We have also selected a use case for the investigation of concern recognition that can or will be implemented in practice with our customers.
Data basis
790 email inquiries from end customers to customer service serve as the data basis here. The inquiries are very varied, ranging from formal to colloquial, and some also contain entire email conversations (threads). The emails are written in German, only four English emails are included under one request (#4, see below). The 790 mails contain the following seven concerns:
- Concern #1 "Product defective/defective", 239 examples
- Concern #2 "Package not received", 245 examples
- Concern #3 "I would like to receive my money", 242 examples
- Concern #4 "How can I change my password", 28 examples
- Concern #5 "I would like to send my meter reading", 10 examples
- Request #6 "Please delete my account", 11 examples
- Request #7 "Please stop sending me advertising", 29 examples
As can be seen, not all concerns are represented equally frequently. The first 3 concerns test the reliability of recognition with regard to linguistic variation with over 200 examples each. The results of these concerns can therefore be said to be more representative when interpreted. Furthermore, 14 mails contain more than one of the seven concerns. There are a maximum of three concerns per mail.
All potentially interesting models were subjected to a pre-test with 5 messages. Only models that achieved acceptable results in the pre-test were considered for the detailed tests.
Depending on the availability of the model, we either test the concern detection via the web interface (API) or set up the models locally on our own servers. Basically, the 790 emails were passed to the models, each with the instruction to give a response as to whether one of the seven concerns mentioned above is present in the respective message. If one of the concerns is found, the corresponding number should be output, otherwise a 0 as an explicit response that none of the concerns were found.
Depending on the respective models, we changed some of the following parameters. If this is relevant, it is explicitly mentioned in the discussion of the results of the respective model.
Prompt variants:
- ML-setup: For example, we initially test all models zero-shot. This means that we do not give the model an example of a request, but simply state the name of the respective request, such as "Parcel not received" or "Product defective". The models are then tested with one example (one-shot). Some models are also tested with multiple examples, for example if they are embedding models.
- Multiple response: We test how it affects the results if we explicitly tell the model that the messages can contain multiple concerns that are to be output. In contrast, we do not mention whether a multiple response is possible or desired in the comparison case.
- Phrasing of concerns: For some models, we tested different phrasing of the concerns, as some models presumably showed difficulties with negative statements, e.g. "Please do not send me any more advertising" vs. "I would like to unsubscribe from advertising"
Model settings:
- Temperature: The temperature determines how "creative" a generative model is when processing a request. In principle, little creativity is required for request recognition.
- Similarity Threshold: In embedding models, the analysis of the input is compared with the learned categories. A comparison is considered a hit if the comparison threshold is exceeded. Optimizing this threshold therefore leads to better results.
Result criteria
The evaluation is calculated automatically by comparing the answers of the model and the annotation of the test data. We present this comparison using the following three classic evaluation criteria:
- Precision: Is high if the model makes many correct positive predictions and few incorrect positive predictions.
- Recall: Also known as sensitivity. Measures how many of the actual positive results were recognized. So if every example is classified as positive, this measure would be 1.
- F1 score: Is the harmonic mean between precision and recall.