Previous studies on the quality of text summaries have shown that it is not only the size of the large language model (LLM) that is decisive, but also whether the model has been explicitly trained on human preferences. LLMs that have been trained to do so perform even better on summaries than models that have been fine-tuned to specific text types, e.g. newspaper articles. Furthermore, studies have shown that machine evaluations of the quality of text summaries cannot replace human evaluations of summaries, as the results vary widely. Most of the relevant studies were based on English newspaper articles.
We decided to test the quality of the summary for a use case that we also implemented with our customers. We used automatically generated transcripts of 109 call center calls as a basis. We explicitly selected transcripts that were not of optimal quality to make it more difficult for the models. Due to its origin as a spontaneous spoken dialog, the data contains a large number of incomplete linguistic structures in contrast to an edited newspaper text, for example.
Before we went into a detailed test, we checked the quality of the dialogs with two sample transcripts. Only models that performed well here were included in the detailed test, the results of which are presented here.
Each of the 109 dialogues was additionally summarized by a human expert in order to obtain a reference value for the machine summaries. We also had a brief summary of each tested LLM generated in German.
The summaries of the LLMs and the reference summaries were rated by human experts in 6 quality criteria using a 5-point scale (Likert).
In addition, both the response time of the model and the costs incurred by the model for the use case of German summaries were examined.