multilingual-cpv-sector-classifier
This model is a fine-tuned version of bert-base-multilingual-cased on the Tenders Economic Daily Public Procurement Data. It achieves the following results on the evaluation set:
- F1 Score: 0.686
Model description
The model takes procurement descriptions written in any of 104 languages and classifies them into 45 sector classes represented by CPV(Common Procurement Vocabulary) code descriptions as listed below.
Common Procurement Vocabulary |
---|
Administration, defence and social security services. ๐ฎโโ๏ธ |
Agricultural machinery. ๐ |
Agricultural, farming, fishing, forestry and related products. ๐พ |
Agricultural, forestry, horticultural, aquacultural and apicultural services. ๐จ๐ฟโ๐พ |
Architectural, construction, engineering and inspection services. ๐ทโโ๏ธ |
Business services: law, marketing, consulting, recruitment, printing and security. ๐ฉโ๐ผ |
Chemical products. ๐งช |
Clothing, footwear, luggage articles and accessories. ๐ |
Collected and purified water. ๐ |
Construction structures and materials; auxiliary products to construction (excepts electric apparatus). ๐งฑ |
Construction work. ๐๏ธ |
Education and training services. ๐ฉ๐ฟโ๐ซ |
Electrical machinery, apparatus, equipment and consumables; Lighting. โก |
Financial and insurance services. ๐จโ๐ผ |
Food, beverages, tobacco and related products. ๐ฝ๏ธ |
Furniture (incl. office furniture), furnishings, domestic appliances (excl. lighting) and cleaning products. ๐๏ธ |
Health and social work services. ๐จ๐ฝโโ๏ธ |
Hotel, restaurant and retail trade services. ๐จ |
IT services: consulting, software development, Internet and support. ๐ฅ๏ธ |
Industrial machinery. ๐ญ |
Installation services (except software). ๐ ๏ธ |
Laboratory, optical and precision equipments (excl. glasses). ๐ฌ |
Leather and textile fabrics, plastic and rubber materials. ๐งต |
Machinery for mining, quarrying, construction equipment. โ๏ธ |
Medical equipments, pharmaceuticals and personal care products. ๐ |
Mining, basic metals and related products. โ๏ธ |
Musical instruments, sport goods, games, toys, handicraft, art materials and accessories. ๐ธ |
Office and computing machinery, equipment and supplies except furniture and software packages. ๐จ๏ธ |
Other community, social and personal services. ๐ง๐ฝโ๐คโ๐ง๐ฝ |
Petroleum products, fuel, electricity and other sources of energy. ๐ |
Postal and telecommunications services. ๐ถ |
Printed matter and related products. ๐ฐ |
Public utilities. โฒ |
Radio, television, communication, telecommunication and related equipment. ๐ก |
Real estate services. ๐ |
Recreational, cultural and sporting services. ๐ด |
Repair and maintenance services. ๐ง |
Research and development services and related consultancy services. ๐ฉโ๐ฌ |
Security, fire-fighting, police and defence equipment. ๐งฏ |
Services related to the oil and gas industry. โฝ |
Sewage-, refuse-, cleaning-, and environmental services. ๐งน |
Software package and information systems. ๐ฃ |
Supporting and auxiliary transport services; travel agencies services. ๐ |
Transport equipment and auxiliary products to transportation. ๐ |
Transport services (excl. Waste transport). ๐บ |
Intended uses & limitations
- Input description should be written in any of the 104 languages that MBERT supports.
- The model is just evaluated in 22 languages. Thus there is no information about the performances in other languages.
- The domain is also restricted by the awarded procurement notice descriptions in European Union. Evaluating on whole document texts might change the performance.
Training and evaluation data
- The whole data consists of 744,360 rows. Shuffled and split into train and validation sets by using 80%/20% manner.
- Each description represents a unique contract notice description awarded between 2011 and 2018.
- Both training and validation data have contract notice descriptions written in 22 European Languages. (Malta and Irish are extracted due to scarcity compared to whole data)
Training procedure
The training procedure has been completed on Google Cloud V3-8 TPUs. Thanks Google for giving the access to Cloud TPUs
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- num_epochs: 3
- gradient_accumulation_steps: 8
- batch_size_per_device: 4
- total_train_batch_size: 32
Training results
Epoch | Step | F1 Score |
---|---|---|
1 | 18,609 | 0.630 |
2 | 37,218 | 0.674 |
3 | 55,827 | 0.686 |
Language | F1 Score | Test Size |
---|---|---|
PL | 0.759 | 13950 |
RO | 0.736 | 3522 |
SK | 0.719 | 1122 |
LT | 0.687 | 2424 |
HU | 0.681 | 1879 |
BG | 0.675 | 2459 |
CS | 0.668 | 2694 |
LV | 0.664 | 836 |
DE | 0.645 | 35354 |
FI | 0.644 | 1898 |
ES | 0.643 | 7483 |
PT | 0.631 | 874 |
EN | 0.631 | 16615 |
HR | 0.626 | 865 |
IT | 0.626 | 8035 |
NL | 0.624 | 5640 |
EL | 0.623 | 1724 |
SL | 0.615 | 482 |
SV | 0.607 | 3326 |
DA | 0.603 | 1925 |
FR | 0.601 | 33113 |
ET | 0.572 | 458 |