AI Benchmark

AI models

Here you will find an overview of the AI models we have tested. For a better overview, we have differentiated between open source and commercial models.

Commercial Models Open source models

AI Benchmark

AI models

Here you will find an overview of the AI models we have tested. For a better overview, we have differentiated between open source and commercial models.

Commercial Models Open source models

Commercial Models

Commercial models are generally only available via the interfaces of the respective providers and possibly their partners. Fine-tuning for special applications is also only possible for selected models.

GPT-3.5 Turbo

by OpenAI

Main use cases: Can be used for any task that requires speech generation. For example summaries, chatbots, voicebots, but also intent or sentiment recognition.

Input length: Two different models with 4,096 tokens (approx. 3,072 words) or 16,385 tokens (approx. 12,288 words)

Languages: 95 natural languages

Model size: 110 billion parameters

Evaluated in: VIER summarization, VIER intent detection, VIER dialog

See details

GPT-4

by OpenAI

Main use cases: Can be used for any task that requires speech generation. Very high quality for summaries, chat applications, concern and sentiment recognition as well as the generation of creative content, coding or general knowledge. Is able to evaluate the performance of other models in dialog tasks. Can also process images as input.

Input length: Three different models with 8,192 tokens (approx. 6,144 words), 32,769 tokens (approx. 24,576 words) and 128,000 tokens (over 300 pages of continuous text)

Languages: 95 natural languages, is better than GPT-3.5 Turbo in at least 26 languages

Model size: ~1.8 trillion parameters

Evaluated in: VIER summarization, VIER dialogs

See details

Luminous Supreme Control

by Aleph Alpha

Main use cases: The main task of the model is language generation. It can also recognize concerns and sentiment and formulate texts more optimally (e.g. simplified) if it is given a few examples. For chat applications, it needs to be fine-tuned beforehand so that there are no unwanted responses with discriminatory content or biases.

Input length: 2048 tokens (approx. 1536 words)

Languages: English, German, French, Italian and Spanish

Model size: ~70 billion parameters

Evaluated in: VIER summarization

See details

Claude 2

by Anthropic

Main use cases: Can be used similarly to GPT-4 for any form of language generation, for example for creative content creation, text summarization, text editing, in-depth dialog, understanding complex contexts or coding.

Input length: 100,000 tokens (approx. 300 pages of continuous text)

Languages: best in English, but also possible in at least 43 other languages

Model size: ~130 billion parameters

Evaluated in: VIER summarization

See details

Claude v1

by Anthropic

Main use cases: A general language generation model that can be used for any form of language generation, such as creative content creation, text summarization, text editing, advanced dialog or understanding complex contexts.

Input length: 9,000 tokens (approx. 6750 words)

Languages: Mainly English. Also Spanish and French. To a lesser extent also German, Italian, Portuguese and possibly other languages.

Model size: ~93 billion parameters (estimate, as unpublished)

Evaluated in: VIER summarization

See details

PaLM 2

by Google

Main use cases: Can generate speech in various contexts, for example for summaries, coding, dialog guidance, text editing or translation. Can also be used for generating creative content.

Input length: 8196 tokens (approx. 6100 words)

languages: more than 100 natural languages

Model size: ~340 billion parameters

Evaluated in: -

command

by Cohere

Main use cases: Model for speech generation that can be used for general chat tasks. It has been specially trained for business use cases such as summarizing or (re)formulating texts as well as information extraction or intent recognition.

Input length: 4096 tokens (approx. 3072 words)

Languages: Mainly English. To a lesser extent also German, French, Italian, Spanish and Arabic.

Model size: ~52 billion parameters

Evaluated in: -

Ada

by OpenAI

Main use cases: An embedding model that compares the input text with text references and calculates the similarity. This can be used, for example, to implement search functions (in knowledge databases), intent recognition and text classification.

Input length: 8191 tokens (approx. 6143 words)

Languages: Mainly English. Also Spanish, French, German, Italian, Portuguese, Mandarin and probably many others.

Model size: ~350 million parameters

Evaluated in: VIER intent detection

See details

Luminous Base Embedding

by Aleph Alpha

Input length: 2048 tokens (approx. 1536 words)

Languages: English, German, French, Italian and Spanish

Model size: ~13 billion parameters

Evaluated in: VIER intent detection

See details

Open Source Models

In principle, all open-source models can be operated on your own servers and fine-tuned for specific purposes. Some open source models are also available from cloud services via API.

Llama2-7B-Chat

by Meta

Main use cases: Model for language generation that can be used for general chat tasks. The model can also recognize the customer's most important intents and summarize texts in a simple way. However, the quality in these tasks is correspondingly poorer than with the larger alternatives.

Input length: 4096 tokens (approx. 3072 words)

Languages: English

Model size: ~7 billion parameters

Evaluated in: VIER intent detection

See details

Vicuna 1.5 7B

by Large Model Systems Organization

Main use cases: Model for speech generation, which can be used for general chat tasks and communicates in a human-like manner. The model is based on Llama2 and has been improved with chat conversations. Even if it can basically recognize the most important intents - and summarize texts, for example - the quality is limited by the size.

Input length: 4000 tokens (approx. 3000 words)

languages: predominantly English

Model size: ~7 billion parameters

Evaluated in: VIER summarization

Flan-t5-XXL

by Google

Main use cases: Model for speech generation, which can be used for translations, text summaries, sentiment analysis or intent recognition. The quality of language generation lags behind larger, more modern models, while intent recognition, for example, is similarly good.

Input length: 512 tokens (approx. 384 words) is basic, trained up to 2048 tokens (approx. 1536 words)

Languages: English, French, Romanian, German

Model size: ~11 billion parameters

Evaluated in: VIER intent detection

See details

FLAN-T5-XL

by Google

Input length: 512 tokens (approx. 384 words) is basic, up to 2048 tokens (approx. 1536 words) trained

Languages: English, French, Romanian, German

Model size: ~3 billion parameters

Evaluated in: VIER intent detection

See details

XLM-RoBERTa-large-XNLI

by Hugginface

Main use cases: Model that has been fine-tuned for intent recognition and text classification. It is based on the RoBERTa variation of the basic BERT model and can recognize intents even in complex emails if only the name of the intent is specified (zero shot).

Input length: 512 tokens (approx. 384 words)

Languages: English, French, German, Spanish, Greek & 10 others

Model size: ~355 million parameters

Evaluated in: VIER intent detection

See details

MPT-30b-Instruct

von MosaicML

Main use cases: A model for language generation that has been fine-tuned for the general answering of questions (instruction-tuned). This means that the model can be used for various tasks such as intent recognition, summarization, (re-)formulation or text classification.

Input length: 8192 tokens (approx. 6144 words)

Languages: predominantly English

Model size: ~30 million parameters

Evaluated in: -

Dolly2-12B

by Databricks

Main use cases: Model for language generation that has been fine-tuned for the general answering of questions (instruction-tuned). This means that the model can be used for various tasks such as concern recognition, summarization, (re-)formulation or text classification. This model is based on the open source LLM Pythia from EleutherAI. The basic model is only trained for English, so it cannot currently be used for German-language use cases.

Input length: 2048 tokens (approx. 1536 words)

Languages: English

Model size: ~12 million parameters

Evaluated in: -

mDeBERTa-v3-base-mnli-xnli

by Vrije Universiteit Amsterdam

Main use cases: A multilingual model based on a further development of the BERT architecture. It was explicitly trained for concern recognition and text classification without examples (zero shot).

Input length: 512 tokens (approx. 384 words) - theoretically 24,528 tokens, but considerable speed losses are to be expected above 512 tokens.

Languages: Evaluated for 15 languages, including English and German. To a lesser extent 85 other languages.

Model size: ~86 million parameters

Evaluated in: VIER intent detection

See details

BART-large-MNLI

by Meta

Main use cases: A general language model based on the Transformer architecture. It was explicitly trained for intent recognition and text classification without examples (zero shot).

Input length: 1024 tokens (approx. 768 words)

Languages: predominantly English

Model size: ~407 million parameters

Evaluated in: VIER intent detection

See details

Koala 13b

by Berkeley Artificial Intelligence Research – BAIR

Main use cases: A general language generation model based on Llama2 and fine-tuned for use in chat conversations. It can be used for dialog systems, text summarization or concern detection in English.

Input length: 4096 tokens (approx. 3072 words)

languages: predominantly English

Model size: ~13 billion parameters

Evaluated in: -

Falcon-40B-Instruct

by Technology Innovation Institute – TII

Main use cases: Model for language generation that has been fine-tuned to answer general questions (instruction-tuned). This means that the model can be used for various tasks such as concern recognition, summarization, (re-)formulation or text classification. However, possible use cases are limited by the rather short input length.

Input length: 2048 tokens (approx. 1536 words)

Languages: Mainly English. Also German, Spanish and French. To a lesser extent also Italian, Portuguese, Polish, Dutch, Romanian, Czech and Swedish.

Model size: ~40 billion parameters

Evaluated in: -

Falcon-180B-Chat

by Technology Innovation Institute – TII

Main use cases: Model for speech generation that can be used for general chat tasks. The model can be used for dialog systems, concern recognition or summaries. However, possible use cases are limited by the rather short input length.

Input length: 2048 tokens (approx. 1536 words)

Languages: Mainly English. Also German, Spanish, French. To a lesser extent also Italian, Portuguese, Polish, Dutch, Romanian, Czech and Swedish

Model size: ~180 billion parameters

Evaluated in: -

Llama2-13B-Chat

by Meta

Main use cases: Model for language generation that can be used for general chat tasks. The model can also recognize the customer's most important concerns and summarize texts in a simple way. The quality is better than the smaller alternative (Llama2-7B-Chat), depending on the size.

Input length: 4096 tokens (approx. 3072 words)

Languages: Mainly English. To a lesser extent also Spanish, French, German, Italian, Portuguese, Dutch, Russian, Chinese (Mandarin), Japanese, Korean, Arabic, Hebrew, Hindi, etc.

Model size: ~13 billion parameters

Evaluated in: -

Guanaco 65B

by University of Washington - Seattle

Main use cases: Guanaco 65B is interesting because, thanks to a very efficient approach (QLoRA), a model of this size can be fine-tuned on very limited hardware. The result is a model for speech generation that solves general chat tasks in good quality. However, due to the license of the basic model (Llama1-65B), commercial use is excluded.

Input length: 2048 tokens (approx. 1536 words)

Languages: Mainly English. Additionally Spanish. To a lesser extent also Russian, German, French, Chinese, Thai, Brazilian Portuguese, Catalan, etc.

Model size: ~65 billion parameters

Evaluated in: -

Only non-commercial use permitted!