Test methodology

Summarization

We tested how well different large language models (LLMs) can create summaries. We used automatically generated transcripts of 109 call center calls as a basis. We explicitly selected transcripts that were not of optimal quality in order to make it more difficult for the models.  

 
However, the models were pre-tested with two exemplary transcripts. Only models that performed well here were included in the detailed test.  

 
This is how we tested:

Test methodology

Summarization

We tested how well different large language models (LLMs) can create summaries. We used automatically generated transcripts of 109 call center calls as a basis. We explicitly selected transcripts that were not of optimal quality in order to make it more difficult for the models.  

 
However, the models were pre-tested with two exemplary transcripts. Only models that performed well here were included in the detailed test.  

 
This is how we tested:

Test methodology

Intent detection

To investigate the recognition of concerns, we selected a use case that is implemented in practice with our customers. The data is based on 790 email inquiries from end customers to customer service. The inquiries are very varied, from formal to colloquial, and in some cases also contain entire email conversations. The emails contain a total of 7 different requests.  

 
All potentially interesting models were subjected to a pre-test with 5 messages. Only models that achieved acceptable results in the pre-tests were considered for the detailed tests.  

 
This is how we tested:

Test methodology

Intent detection

To investigate the recognition of concerns, we selected a use case that is implemented in practice with our customers. The data is based on 790 email inquiries from end customers to customer service. The inquiries are very varied, from formal to colloquial, and in some cases also contain entire email conversations. The emails contain a total of 7 different requests.  

 
All potentially interesting models were subjected to a pre-test with 5 messages. Only models that achieved acceptable results in the pre-tests were considered for the detailed tests.  

 
This is how we tested: