by Aleph Alpha
by Aleph Alpha
Main use cases: An embedding model that compares the input text with text references and calculates the similarity. This can be used, for example, to implement search functions (in knowledge databases), intent recognition and text classification.
Input length: 2048 tokens (approx. 1536 words)
Languages: English, German, French, Italian and Spanish
Model size: ~13 billion parameters
Main use cases: An embedding model that compares the input text with text references and calculates the similarity. This can be used, for example, to implement search functions (in knowledge databases), intent recognition and text classification.
Input length: 2048 tokens (approx. 1536 words)
Languages: English, German, French, Italian and Spanish
Model size: ~13 billion parameters
The F1 values for the individual concerns vary greatly (0.24-0.9). Only the request "Change password" shows a balanced recognition pattern with a correspondingly high F1 value. In contrast, "Delete account or customer account" and "Please no more advertising" perform poorly to very poorly. In both cases, the model generalizes too much, so that although all targets are recognized (high recall - gold), many false hits are also reported, which means that the precision (purple) is comparatively low. We find the opposite pattern, i.e. too little generalization, for "Product defective/deficient" and less so for "Transmit meter reading, record". Although no false-positive hits are returned here (precision at 100% - purple), 39% or 70% of the actual targets are returned (low recall - gold).
Overall, three out of seven concerns are below an F1 value of 0.75, which essentially means that one out of four hits is a false positive and one out of four targets is not found. This means that without major improvement measures (training, fine-tuning, etc.), this model has considerable difficulties in recognizing the concerns of customers in their emails.
We varied the following parameters in the test: With regard to the similarity metric, three variants are available, using either the maximum similarity value per category, the average similarity value per category or the mean value of the two previous methods. In our test, the mean value produced the best results. In addition, we tested two versions of training/evaluation corpora, one that was constructed analogously to Aleph Alpha's examples and one that was optimized according to our expertise. The optimized corpora produced better results. Finally, we also investigated whether cleaning up the mails and splitting them into individual sentences would lead to better results for the test set. However, these pre-processing measures did not lead to improved concern recognition in our tests. The results shown above represent the best combination of the parameters described.
In principle, the response times are relatively short with an average value of 0.12 seconds and are also suitable for real-time applications.
Median: N/A
Mean value: 0.12 sec.
Minimum: N/A
Maximum: N/A
The request recognition for the 790 texts cost €1.11, i.e. around 1 cent per 7 customer requests (without data cleansing). This means that the costs for Luminous Base Embedding are rather high for a pure embedding model.
Despite the fairly short response times, we cannot give a clear product recommendation for this model due to the mediocre quality at best and the comparatively high price, if you want to recognize requests from German customers (e-mail). However, the application may be worthwhile if you want a model hosted in Germany that can be addressed via an API, provided that the quality can be improved by fine-tuning.