Wir haben getestet, wie gut verschiedene Large Language Modelle (LLMs) Zusammenfassungen erstellen können. Dafür haben wir automatisch erstellte Transkripte von 109 Callcenter-Telefonaten als Grundlage genommen. Wir haben explizit Transkripte ausgewählt, die keine optimale Qualität haben, um es den Modellen schwerer zu machen.
Vorab wurden die Modelle jedoch einem Vortest mit zwei beispielhaften Transkripten geprüft. Nur Modelle, die hier gut waren, sind in den ausführlichen Test genommen worden. Nachfolgend finden Sie die Ergebnisse der Modelle für den Anwendungsfall Zusammenfassung im Vergleich zueinander.
Wir haben getestet, wie gut verschiedene Large Language Modelle (LLMs) Zusammenfassungen erstellen können. Dafür haben wir automatisch erstellte Transkripte von 109 Callcenter-Telefonaten als Grundlage genommen. Wir haben explizit Transkripte ausgewählt, die keine optimale Qualität haben, um es den Modellen schwerer zu machen.
Vorab wurden die Modelle jedoch einem Vortest mit zwei beispielhaften Transkripten geprüft. Nur Modelle, die hier gut waren, sind in den ausführlichen Test genommen worden. Nachfolgend finden Sie die Ergebnisse der Modelle für den Anwendungsfall Zusammenfassung im Vergleich zueinander.
GPT-4 achieves higher scores than the reference summaries in all categories and is also very close to the maximum possible score in all categories. This shows that summarizing short texts, some of which are of low quality, is precisely the area of competence of this language generation model.
The quality of Claude 2's summaries is roughly on a par with human-written references. Although the texts from Claude 2 are generally rated better and are also easier to read, some series of numbers were incorrectly included in the summary in the tests. Conciseness is also a little below the human comparison values.
Apart from the "Fluidity" category, where all models rank slightly above the reference summaries, the other models in the other categories remain slightly below. GPT-3.5 Turbo, Claude v1 and Llama2-7B-Chat are roughly on a par. What they all have in common is that they are rated lower than the reference, particularly in terms of conciseness, i.e. the brevity of the summary and the structure of the texts. All three models also contain more false information (hallucinations) in the summaries than the references, albeit very little. However, the quality of all three models can also be rated as very good.
Luminous Supreme Control, on the other hand, shows even greater weaknesses by including more incorrect information in the summaries (hallucinations) and forgetting relevant content (completeness). As a result, the overall rating is ultimately lower. In terms of fluidity, structure and relevance, it is on a par with the three aforementioned models.
It is worth mentioning that both Llama2-7B-Chat and Luminous Supreme Control produced some total failures in our tests, where, for example, the dialog was only repeated instead of summarized. Furthermore, Llama2-7B-Chat currently only produces English summaries of German texts, which still have to be translated depending on how they are used.
The evaluation was based on a Likert scale. The points achieved are indicated in each case. For comparison, the results of human-written reference summaries:
No hallucinations: 4.82
Completeness: 4,85
Structure: 4,97
Fluidity: 4,89
Relevance: 4,78
General evaluation: 4,79
No hallucinations: 4.05
Completeness: 4.19
Structure: 4.31
Fluidity: 4.31
Relevance: 3.88
General rating: 3.86
No hallucinations: 4.30
Completeness: 4.51
Structure: 4.64
Fluidity: 4.60
Relevance: 4.42
General evaluation: 4.36
No hallucinations: 4.06
Completeness: 4.39
Structure: 4.09
Fluidity: 4.29
Relevance: 3.59
General evaluation: 3.78
No hallucinations: 3.46
Completeness: 3.34
Structure: 4.06
Fluidity: 4.08
Relevance: 3.72
General evaluation: 3.25
No hallucinations: 4.12
Completeness: 4.20
Structure: 4.01
Fluidity: 4.12
Relevance:3.69
General evaluation: 3.64
In terms of response times, the GPT models take the longest at 10-11 seconds. Claude v1, Claude 2 and Luminous Supreme Control rank in the middle and are almost twice as fast at 6-7 seconds. The locally executed Llama2-7B chat model shows the shortest response times here, although it should be mentioned that slightly longer response times would result in an application setup. Also, any translation of the English summary of the Llama2-7b chat model is not reflected in the response times.
In principle, however, all response times are absolutely acceptable for the use case at hand. Quality is far more important here than speed.
Mean: 10.2 sec.
Median: 9.7 sec.
Mean: 11.54 sec.
Median: 11.16 sec.
Mean: 6.67 sec.
Median: 6 sec.
Mean: 6.87 sec.
Median: 7 sec.
Mean: 10.02 sec.
Median: 6.6 sec.
Mean: 3.58 sec.
Median: 3.18 sec.
(for 109 transcripts)
The costs for GPT-3.5 Turbo are by far the lowest. Claude v1 and Claude 2 are around ten times more expensive, but at least Claude 2 also provides higher quality summaries. At around 2 cents per summary, GPT-4 is twice as expensive as Claude 2, but also delivers outstanding quality. Only Luminous Supreme Control is even more expensive at approx. 3.5 cents per summary.
Llama2-7B-Chat was run locally, so the price depends very much on the setup and the hardware used. In general, operating larger models is more expensive than smaller ones.
Fee: €2.22 ($2.37), about 2 cents per summary
Fee: €0.092 ($0.098), about 0.08 cents per summary
Fee: €1.18 ($1.26), about 1 cent per summary
Fee: €0.76 ($0.81), about 0.7 cents per summary
Fee: €3.82, about 3.5 cents per summary
Fee: none (own GPUs)
Hosting: Europe via Azure
Hosting: Europe via Azure
Hosting: currently only US and UK, access from Europe only via VPN (01/2024)
Hosting: currently only US and UK, access from Europe only via VPN (as of 01/2024)
Hosting: Germany
Hosting: VIER Frankfurt
In principle, the effort required for all cloud-based models was very low. The models are easy to use thanks to a prompt and do not require any fine-tuning to generate high-quality summaries. Only Luminous Supreme Control required a little more prompt engineering. In the end, we had to insert an example in order to generate the desired summaries.
We also had to optimize the prompt a little for the Llama-7B chat model. The model also had to be set up locally.
However, all the efforts mentioned are comparatively low compared to the option of training a model from scratch for the summary.
Effort: low
Effort: low
Effort: low
Effort: low
Effort: rather low
Effort: medium
All summaries were very good. No failures. The prompt must be German, otherwise the summaries are much worse.
All summaries relatively good. No total failures.
Shorter summaries that were very to the point. No failures. More grammatical errors than the other models. If there were hallucinations in the summaries, they involved numbers.
Some summaries were a shorter dialog rather than an actual summary. Could possibly be solved with the prompt.
7 texts could not be evaluated because they were too long for the context window. Of the 102 remaining texts, the model had 4 total failures where it was caught in a loop or just repeated something from the prompt.
Given the size of the model, the summaries in English were good. However, there were 9 summaries where the model only repeated parts of the input. This is even more than Luminous. However, the other summaries were of a higher quality.
(imperfect German transcripts)
Recommendation: yes
Recommendation: yes
Recommendation: yes
Recommendation: yes
Recommendation: cautiously yes, if a customer absolutely wants a model hosted in Germany
Recommendation: cautiously yes, if a customer absolutely wants a model hosted by us