AI Benchmark

Summarization

Wir haben getestet, wie gut verschiedene Large Language Modelle (LLMs) Zusammenfassungen erstellen können. Dafür haben wir automatisch erstellte Transkripte von 109 Callcenter-Telefonaten als Grundlage genommen. Wir haben explizit Transkripte ausgewählt, die keine optimale Qualität haben, um es den Modellen schwerer zu machen.

Vorab wurden die Modelle jedoch einem Vortest mit zwei beispielhaften Transkripten geprüft. Nur Modelle, die hier gut waren, sind in den ausführlichen Test genommen worden. Nachfolgend finden Sie die Ergebnisse der Modelle für den Anwendungsfall Zusammenfassung im Vergleich zueinander.

Study design Details pre-test

AI Benchmark

Summarization

Study design Details pre-test

Quality

Summarization-Benchmark model comparison EN

Comparison of the quality of summarization results

GPT-4 achieves higher scores than the reference summaries in all categories and is also very close to the maximum possible score in all categories. This shows that summarizing short texts, some of which are of low quality, is precisely the area of competence of this language generation model.
The quality of Claude 2's summaries is roughly on a par with human-written references. Although the texts from Claude 2 are generally rated better and are also easier to read, some series of numbers were incorrectly included in the summary in the tests. Conciseness is also a little below the human comparison values.
Apart from the "Fluidity" category, where all models rank slightly above the reference summaries, the other models in the other categories remain slightly below. GPT-3.5 Turbo, Claude v1 and Llama2-7B-Chat are roughly on a par. What they all have in common is that they are rated lower than the reference, particularly in terms of conciseness, i.e. the brevity of the summary and the structure of the texts. All three models also contain more false information (hallucinations) in the summaries than the references, albeit very little. However, the quality of all three models can also be rated as very good.

Luminous Supreme Control, on the other hand, shows even greater weaknesses by including more incorrect information in the summaries (hallucinations) and forgetting relevant content (completeness). As a result, the overall rating is ultimately lower. In terms of fluidity, structure and relevance, it is on a par with the three aforementioned models.

It is worth mentioning that both Llama2-7B-Chat and Luminous Supreme Control produced some total failures in our tests, where, for example, the dialog was only repeated instead of summarized. Furthermore, Llama2-7B-Chat currently only produces English summaries of German texts, which still have to be translated depending on how they are used.

The evaluation was based on a Likert scale. The points achieved are indicated in each case. For comparison, the results of human-written reference summaries:

No hallucinations: 4.51
Completeness: 4.47
Structure: 4.71
Fluidity: 3.99
Relevance: 4.62
General rating: 4.12

GPT-4

No hallucinations: 4.82
Completeness: 4,85
Structure: 4,97
Fluidity: 4,89
Relevance: 4,78
General evaluation: 4,79

AI Benchmark

Summarization

AI Benchmark

Summarization

Quality

Comparison of the quality of summarization results

GPT-4

GPT-3.5 Turbo

Claude 2

Claude v1

Luminous Supreme Control

Llama2-7B-Chat (Englisch)

Response time

Comparison of response times

GPT-4

GPT-3.5 Turbo

Claude 2

Claude v1

Luminous Supreme Control

Llama2-7B (Englisch)

Expenses

GPT-4

GPT-3.5 Turbo

Claude 2

Claude v1

Luminous Supreme Control

Llama2-7B-Chat (Englisch)

Hosting

GPT-4

GPT-3.5 Turbo

Claude 2

Claude v1

Luminous Supreme Control

Llama2-7B-Chat (Englisch)

Effort to get started

GPT-4

GPT-3.5 Turbo

Claude 2

Claude v1

Luminous Supreme Control

Llama2-7B-Chat (Englisch)

Additional information

GPT-4

GPT-3.5 Turbo

Claude 2

Claude v1

Luminous Supreme Control

Llama2-7B-Chat (Englisch)

Recommendation

GPT-4

GPT-3.5 Turbo

Claude 2

Claude v1

Luminous Supreme Control

Llama2-7B-Chat (Englisch)