To investigate the recognition of concerns, we selected a use case that is implemented in practice with our customers. The data is based on 790 email inquiries from end customers to customer service. The inquiries are very varied, from formal to colloquial, and in some cases also contain entire email conversations. The emails contain a total of 7 different requests.
All potentially interesting models were subjected to a pre-test with 5 messages. Only models that achieved acceptable results in the pre-tests were considered for the detailed tests.
To investigate the recognition of concerns, we selected a use case that is implemented in practice with our customers. The data is based on 790 email inquiries from end customers to customer service. The inquiries are very varied, from formal to colloquial, and in some cases also contain entire email conversations. The emails contain a total of 7 different requests.
All potentially interesting models were subjected to a pre-test with 5 messages. Only models that achieved acceptable results in the pre-tests were considered for the detailed tests.
To get a better overview of the quality of the different models, we show the F1 value in the following graph as the most meaningful metric for assessing the performance of the models. You can see the quality for the 7 concerns tested.
The model with the best performance is GPT-3.5 Turbo. For this LLM, we have inserted the name of the concern with an example in the prompt. The model has output one or more numbers as a result. There are two other generative models (language in - language out) in the overview, FLAN-T5-XXL, which shows similar results to GPT-3.5 Turbo for most concerns, and FLAN-T5-XL, which also shows very good results, even compared to its bigger brother. The FLAN models can be run locally, but require quite powerful GPUs.
The two models trained specifically for concern detection (zero-shot, i.e. with concern naming only) show different performances. While mDeBERTa-v3-base-MNLI-XNLI shows a rather poor recognition rate for most concerns, XLM-RoBERTa-large-XNLI delivers promising results. Both models are significantly smaller than the FLAN models and can therefore be used more efficiently.
The two tested embedding models were trained with 10 examples per concern. These models compare the embeddings of the test emails with the embeddings of the learning examples. These two embedding models showed the worst results in this model comparison, apart from other small open source models that passed the pre-test but performed so poorly in the test run that they are not even listed separately here.
The embedding models and the FLAN models are far ahead in terms of response speed. GPT-3.5 Turbo is in the middle of the field with approx. 0.6 seconds per processing, while the locally executed models XLM-RoBERTa-large-XNLI and mDeBERTa-v3-base-MNLI-XNLI take around twice as long. Far behind is BART-large-MNLI with extremely long response times. The local models ran on an A40 GPU (or A100 for FLAN-XXL) without any optimization, so the speed could most likely be improved for live use if some effort is made.
Mean: 0,6 sec.
Median: 0,59 sec.
Mean: 0,12 sec.
Median: N/A
Mean: 0,52 sec.
Mean: 0,43 sec.
Mean: 6,67 sec.
Median: 5,44 sec.
Mean: 1,26 sec.
Median: N/A
Mean: 1,28 sec.
Median: N/A
Mean: 0,08 sec.
Median: 0,07 sec.
Mean: 0,12 sec.
Median: 0,10 sec.
The fees for 790 mails (complete threads) are shown in each case.
The fees for GPT-3.5 Turbo are the highest in line with the possible performance. This is partly due to the fact that this model is a language generation model that requires significantly more computing power than pure embedding or classification models. The OpenAI embedding model (Ada) is significantly cheaper, while the Aleph Alpha embedding model (Luminous Base Embedding) costs almost the same as GPT-3.5 Turbo, according to the company's official prices. The other models were run locally, so the price depends very much on the setup and the hardware used. In general, larger models are more expensive than smaller ones. While mDeBERTa-v3-base-MNLI-XNLI has a size of 100 million parameters, XLM-RoBERTa-large-XNLI has a size of 355 million parameters, FLAN-T5-XL has 3 billion parameters and FLAN-T5-XXL has 11 billion parameters.
Fee: 1,20€
Fee: 1,11€
Fee: 0,06€
Fee: local
Fee: local
Fee: local
Fee: local
Fee: local
Hosting: Europe via Microsoft Azure
Hosting: Germany
Hosting: Europe via Microsoft Azure
Hosting: VIER Frankfurt
Hosting: VIER Frankfurt
Hosting: VIER Frankfurt
Hosting: VIER Frankfurt
Hosting: VIER Frankfurt
The GPT-3.5 Turbo requires the least effort. The model is easy to operate with a prompt and does not require any fine-tuning to function almost perfectly for concern detection. It is similarly simple with the FLAN models. However, the result here is only good if the prompt is in English. In addition, the FLAN models must be set up locally. For the smaller local models (mDeBERTa-v3-base-MNLI-XNLI and XLM-RoBERTa-large-XNLI), it is also necessary to determine different formulations of the concerns and cut-off values for the classification. This requires a lot of testing. Testing and optimizing the various cut-off values and the formulations of the examples is also necessary for the embedding models. However, these models do not have to be set up locally.
However, all the above-mentioned efforts are comparatively low compared to the option of training a model from scratch for concern recognition.
Effort: low
Effort: rather low
Effort: rather low
Effort: medium
Effort: medium
Effort: medium
Effort: rather low
Effort: rather low
Recommendation: yes (but intent detection would not be the classic use case here, as it does not require speech generation)
Recommendation: no (it shows very good results in some areas, but very poor results in others, more fine-tuning could help, but then the price is still very high)
Recommendation: no (it shows very good results in some cases, but very poor results in others, more fine-tuning could help)
Recommendation: no
Recommendation: no
Recommmendation: yes
Recommendation: yes (but intent detection would not be the classic use case here, as it does not require speech generation)
Recommendation: yes (but intent detection would not be the classic use case here, as it does not require speech generation)