Main use cases: Model for language generation that can be used for general chat tasks. The model can also recognize the customer's most important intents and summarize texts in a simple way. However, the quality in these tasks is correspondingly poorer than with the larger alternatives.
Input length: 4096 tokens (approx. 3072 words)
Languages: English
Model size: ~7 billion parameters
Main use cases: Model for language generation that can be used for general chat tasks. The model can also recognize the customer's most important intents and summarize texts in a simple way. However, the quality in these tasks is correspondingly poorer than with the larger alternatives.
Input length: 4096 tokens (approx. 3072 words)
Languages: English
Model size: ~7 billion parameters
The quality of Llama2-7B-Chat's summaries is above average in all evaluation categories, i.e. it produces fluent and concise summaries with correct content. Basically, Llama2-7B-Chat achieves a similar level to the human-written reference summaries. In the "fluidity" category, i.e. how pleasant the texts are to read, it is even slightly better than the reference. In all other categories, Llama2-7B-Chat performs slightly worse. In comparison, the biggest challenges for the model here are to generate well-structured summaries (structure) that only contain the important aspects and are as short as possible (relevance). The proportion of incorrect information in the summaries is only slightly higher than in the reference (hallucinations). However, the model delivered a total failure in nine cases (9%), where only the dialog was repeated and no summary was generated. In addition, Llama2 only provides English summaries of the German texts, which still have to be translated depending on the application.
Overall, however, the quality is very good and almost comparable to human summaries. From a quality perspective, the model has great potential to be used for summarizing imperfect German-language transcripts.
The response speed in our test was very good for the area of application. The average speed was around 3 seconds. However, these values must take into account that we directly controlled a local model. If it is connected via an API in a live setting, the response times will be slightly higher.
This model was run locally on our servers, so there were no direct costs. In practice, the price depends very much on the setup and the hardware used. In general, larger models are more expensive than smaller ones: Llama2-7B chat can be considered large with a size of ~7 billion parameters.
The hosting is done on our Research GPUs in VIER Frankfurt.
Based on the results, we can only make a limited recommendation for this model. Although the quality of the summaries is very good and the response times very short, the model only produces English summaries and also some total failures. However, the model could be interesting if VIER's own hosting is urgently required, for example for data protection reasons.