AI Benchmark

Intent detection

To investigate the recognition of concerns, we selected a use case that is implemented in practice with our customers. The data is based on 790 email inquiries from end customers to customer service. The inquiries are very varied, from formal to colloquial, and in some cases also contain entire email conversations. The emails contain a total of 7 different requests.

All potentially interesting models were subjected to a pre-test with 5 messages. Only models that achieved acceptable results in the pre-tests were considered for the detailed tests.

Study design Details pre-test

AI Benchmark

Intent detection

All potentially interesting models were subjected to a pre-test with 5 messages. Only models that achieved acceptable results in the pre-tests were considered for the detailed tests.

Study design Details pre-test

Quality

To get a better overview of the quality of the different models, we show the F1 value in the following graph as the most meaningful metric for assessing the performance of the models. You can see the quality for the 7 concerns tested.

intent detection f1 score model comparison plot EN

Quality of intent detection

The model with the best performance is GPT-3.5 Turbo. For this LLM, we have inserted the name of the concern with an example in the prompt. The model has output one or more numbers as a result. There are two other generative models (language in - language out) in the overview, FLAN-T5-XXL, which shows similar results to GPT-3.5 Turbo for most concerns, and FLAN-T5-XL, which also shows very good results, even compared to its bigger brother. The FLAN models can be run locally, but require quite powerful GPUs.

The two models trained specifically for concern detection (zero-shot, i.e. with concern naming only) show different performances. While mDeBERTa-v3-base-MNLI-XNLI shows a rather poor recognition rate for most concerns, XLM-RoBERTa-large-XNLI delivers promising results. Both models are significantly smaller than the FLAN models and can therefore be used more efficiently.

The two tested embedding models were trained with 10 examples per concern. These models compare the embeddings of the test emails with the embeddings of the learning examples. These two embedding models showed the worst results in this model comparison, apart from other small open source models that passed the pre-test but performed so poorly in the test run that they are not even listed separately here.

Response time

Comparison of response times for intent detection

The embedding models and the FLAN models are far ahead in terms of response speed. GPT-3.5 Turbo is in the middle of the field with approx. 0.6 seconds per processing, while the locally executed models XLM-RoBERTa-large-XNLI and mDeBERTa-v3-base-MNLI-XNLI take around twice as long. Far behind is BART-large-MNLI with extremely long response times. The local models ran on an A40 GPU (or A100 for FLAN-XXL) without any optimization, so the speed could most likely be improved for live use if some effort is made.

GPT-3.5 Turbo

Mean: 0,6 sec.
Median: 0,59 sec.

AI Benchmark

Intent detection

AI Benchmark

Intent detection

Quality

Quality of intent detection

Response time

Comparison of response times for intent detection

GPT-3.5 Turbo

Luminous Base Embedding

Ada

BART-large-MNLI

mDeBERTa-v3-base-MNLI-XNLI

XLM-RoBERTa-large-XNLI

FLAN-T5-XXL

FLAN-T5-XL

Expenses

GPT-3.5 Turbo

Luminous Base Embedding

Ada

BART-large-MNLI

mDeBERTa-v3-base-MNLI-XNLI

XLM-RoBERTa-large-XNLI

FLAN-T5-XXL

FLAN-T5-XL

Hosting

GPT-3.5 Turbo

Luminous Base Embedding

Ada

BART-large-MNLI

mDeBERTa-v3-base-MNLI-XNLI

XLM-RoBERTa-large-XNLI

FLAN-T5-XXL

FLAN-T5-XL

Effort to get started

GPT-3.5 Turbo

Luminous Base Embedding

Ada

BART-large-MNLI

mDeBERTa-v3-base-MNLI-XNLI

XLM-RoBERTa-large-XNLI

FLAN-T5-XXL

FLAN-T5-XL

Recommendation

GPT-3.5 Turbo

Luminous Base Embedding

Ada

BART-large-MNLI

mDeBERTa-v3-base-MNLI-XNLI

XLM-RoBERTa-large-XNLI

FLAN-T5-XXL

FLAN-T5-XL