Main use cases: The main task of the model is language generation. It can also recognize concerns and sentiment and formulate texts more optimally (e.g. simplified) if it is given a few examples. For chat applications, it needs to be fine-tuned beforehand so that there are no unwanted responses with discriminatory content or biases.
Input length: 2048 tokens (approx. 1536 words)
Languages: English, German, French, Italian and Spanish
Model size: ~70 billion parameters
Main use cases: The main task of the model is language generation. It can also recognize concerns and sentiment and formulate texts more optimally (e.g. simplified) if it is given a few examples. For chat applications, it needs to be fine-tuned beforehand so that there are no unwanted responses with discriminatory content or biases.
Input length: 2048 tokens (approx. 1536 words)
Languages: English, German, French, Italian and Spanish
Model size: ~70 billion parameters
The quality of the Luminous Supreme Control summaries was rated significantly lower by the annotators than the quality of human summaries of the same transcripts. Only the fluency category, which measures whether the text is pleasant to read, shows similarly good values to the human summaries. In particular, the model shows a fairly high proportion of hallucinations, forgets relevant parts of the texts, produces long summaries and is therefore generally rated significantly worse.
In addition - which is not clear in the overview - there were several complete failures when the model was supposed to summarize relatively long dialogues. The result was that the model did not generate a summary at all for 7 of the 109 texts. However, the increased input length was necessary because an example summary had to be included in the prompt in order to increase the quality. In turn, this limits the length of the transcripts that can still be summarized. With three texts, the model got into a loop and was limited to repeating individual sentences of the transcripts.
At the time of testing 07/23, we can only recommend the use of imperfect German transcripts to a very limited extent if, for example, the criterion of hosting in Germany outweighs the quality of the summaries. As mentioned above, the content to be summarized should also not be too long.
Our tests showed considerable fluctuations in the response times, meaning that the median differs greatly from the mean value. Basically, the response times were faster than with OpenAI, for example, with a median of 6.6 seconds. At the same time, however, there is at least one extremely long response time of more than 2 minutes. For the field of application of the summaries, the time does not seem to be very critical, unless it is used in the everyday life of an agent and the response time has a similar peak every 100 transcripts. Then the waiting time of 2 minutes is very long.
Luminous Surpreme Control costs €0.044 per 1000 input tokens and €0.048 per 1000 output tokens. Aleph Alpha uses credits for billing. One credit costs €0.238. Summarizing all transcripts (109) cost 16.07 credits - that means a cost of €3.82, or about 3.5 cents per summary.
The high costs are partly due to the fact that the prompt had to be significantly longer at the time of testing than for the other models tested. This was necessary because the prompt had to include a sample summary to increase the quality of the summaries. A further transcript including a summary was therefore sent with each request.
The model is hosted in Germany on Aleph Alphas own servers. This is a big plus.
While the speed and hosting of the model in Germany clearly speak in favor of using the product, the quality was not yet optimal at the time of testing. In addition, the price was very high due to the need to include an example in the prompt. Accordingly, we would only recommend this model for imperfect German transcripts if hosting in Germany has the highest priority.