Main use cases: An embedding model that compares the input text with text references and calculates the similarity. This can be used, for example, to implement search functions (in knowledge databases), intent recognition and text classification.
Input length: 8191 tokens (approx. 6143 words)
Languages: Mainly English. Also Spanish, French, German, Italian, Portuguese, Mandarin and probably many others.
Model size: ~350 million parameters
Main use cases: An embedding model that compares the input text with text references and calculates the similarity. This can be used, for example, to implement search functions (in knowledge databases), intent recognition and text classification.
Input length: 8191 tokens (approx. 6143 words)
Languages: Mainly English. Also Spanish, French, German, Italian, Portuguese, Mandarin and probably many others.
Model size: ~350 million parameters
The F1 values for the individual requests vary greatly (0.33-0.88). Only the requests "Change password" and "Parcel has not yet arrived" show balanced recognition patterns with correspondingly high F1 values. In contrast, "Delete account or customer account" and "Transmit and record meter reading" perform poorly to very poorly. In both cases, the model generalizes too much, so that although all targets are recognized (high recall - gold), many false hits are also reported, resulting in comparatively low precision (purple). The same pattern can also be observed for "Please no more advertising" and "I would like to receive my money", albeit somewhat less pronounced.
The opposite pattern, i.e. too little generalization, is only found for "Product defective/deficient". Although there are hardly any false positives here (precision at 81% - purple), only a good half (55%) of the actual targets are returned (low recall - gold).
Overall, five out of seven concerns are below an F1 value of 0.75, which essentially means that one out of four hits is a false positive and one out of four targets is not found. This means that without major improvement measures (training, fine-tuning, etc.), this model has considerable difficulties in recognizing customer concerns in emails.
Different methods were compared with each other in the tests. With regard to the similarity metric, three variants were examined, using either the maximum similarity value per category, the average similarity value per category or the mean value of the two previous methods. In our test, the average similarity value yielded the best results. In addition, we also investigated the influence of the threshold value of the similarity metric, which proved to be optimal at 0.9. We also tested two versions of training/evaluation corpora, one that was constructed analogously to the OpenAI examples and one that was optimized according to our expertise. The optimized corpora produced better results. Finally, we also investigated whether cleaning up the mails and splitting them into individual sentences would lead to better results for the test set. However, these pre-processing measures did not lead to improved concern recognition in our tests. The results shown above represent the best combination of the parameters described.
In our tests on the recognition of requests, there were considerable fluctuations in the response times (0.002-4 seconds). In general, however, the response times are relatively short with an average value of 0.43 seconds. The model is therefore also suitable for real-time applications.
The request recognition for the 790 texts cost €0.06, i.e. around 1 cent per 130 customer requests (without data cleansing). This means that the costs for Ada Embedding are very low.
Despite the fairly short response times and the very reasonable price, this model cannot be clearly recommended due to its mediocre quality if the focus is on recognizing concerns in German customer inquiries by email. However, the application can be worthwhile if large volumes of texts need to be (pre-)classified rather roughly, as the model can be conveniently accessed via an API and the costs are very low. In addition, the quality could also be improved through fine-tuning.