mDeBERTa-v3-base-MNLI-XNLI

by Vrije Universiteit Amsterdam

mDeBERTa-v3-base-MNLI-XNLI

by Vrije Universiteit Amsterdam

Main use cases: A multilingual model based on a further development of the BERT architecture. It was explicitly trained for concern recognition and text classification without examples (zero shot).

Input length: 512 tokens (approx. 384 words) - theoretically 24,528 tokens, but considerable speed losses are to be expected above 512 tokens.

Languages: Evaluated for 15 languages, including English and German. To a lesser extent 85 other languages.

Model size: ~86 million parameters

Input length: 512 tokens (approx. 384 words) - theoretically 24,528 tokens, but considerable speed losses are to be expected above 512 tokens.

Languages: Evaluated for 15 languages, including English and German. To a lesser extent 85 other languages.

Model size: ~86 million parameters

Test results

Use case: intent detection

Quality

The F1 values for the individual concerns vary only moderately (0.46-0.8). However, not a single concern shows a balanced recognition pattern of recall and precision.

The request "parcel has not arrived" performs well, while "transmit and record meter reading", "delete account or customer account" and "change password" perform rather poorly. In all four cases, the model generalizes too much, so that although most targets are recognized (recall >80% - gold), many false hits are also reported, which means that the precision (purple) is comparatively low.

The opposite pattern is found for the concerns "I would like to receive my money" and "Product defective/deficient", which still perform quite well, and the concern "Please no more advertising", which is already insufficiently recognized. In these three cases, the model generalizes too little, so that hardly any false-positive hits are returned (precision at >89% - purple), but only around half of the actual targets are returned (61%, 51%, 34% recall - gold).

Overall, six out of seven concerns are below an F1 value of 0.75, which essentially means that one out of four hits is a false positive and one out of four targets is not found. Two out of seven concerns are even below an F-value of 0.5, which means that without major improvement measures (training, fine-tuning, etc.), this model has considerable difficulties in recognizing customer concerns in emails.

Several variants were examined in the test. For example, different formulations of the concerns were compared with each other, e.g. in terms of the simplicity of the wording. The difference between positive vs. negative statements was also evaluated - for example, "stop advertising" as opposed to "no more advertising". In addition, it was investigated which threshold value works best for counting a similarity as a hit. The best configuration here showed a value of 0.4. Finally, for each configuration, we also tested the effect of allowing a text to be assigned to several concerns at the same time. This resulted in poorer results in our test series. The results shown above represent the best combination of the parameters described.

Study design

Response time

Our tests for recognizing requests resulted in comparatively long response times with an average value of 1.26 seconds per email, which is why the use of the model in real-time applications is only recommended to a limited extent.

Median: N/A
Mean: 1.26 sec.
Minimum: N/A
Maximum: N/A

Expenses

This model was run locally on our servers, so there were no direct costs. In practice, the price depends very much on the setup and the hardware used. In general, larger models are more expensive than smaller ones: mDeBERTa can be considered rather small with a size of 100 million parameters.

Hosting

Local Hosting possible, GPU needed

Recommendation

Despite the probably very low costs, we cannot give a clear product recommendation for this model due to the mediocre quality at best and the comparatively long response times if the focus is on recognizing concerns in German-language customer emails. However, the application can be worthwhile if large volumes of texts have to be (pre-)classified rather roughly and if VIER's own hosting is urgently required, for example for data protection reasons, as the costs are rather low. In addition, the quality could also be increased through fine-tuning.