What data can be used to train socio-emotional dialogue generation systems?

27/02/2024

Lorraine Vanel
PhD student at Zaion and Télécom-Paris on the topic of natural language utterance generation for socially competent agents.
I also act as part-time culinary manager and caterer at Zaion Lab - someone's got to keep those brains on their toes.

Introduction

The success of deep learning technologies, supported by the growing accessibility of ever larger data corpora, has contributed to the development of increasingly advanced dialog generation systems. At Zaion, we aim to research these more developed models for subsequent application in customer service.

Beyond questions of grammar or the fluidity of the generated content, human-machine interaction faces a main challenge for the conversational agent: coherence. This involves various secondary problems, such as :

Logical consistency (relevance of the generated response to the conversation history)
Persona consistency (relevance of the bot's behavior to its previous behavior in the interaction)
Social coherence (relevance of response to a set of accepted social rules and etiquette)
Emotional consistency (relevance to the user's emotional state)

Consistency is important to avoid situations where users find the bot "strange". A task-oriented agent is often the representative of an entity providing a service, and as such must be reliable from a business point of view, but also offer pleasant social interaction. Although socio-emotional bots exist in task-oriented applications, most of these systems are used and designed for the open domain.

While the logical consistency criterion can be solved thanks to computing power, the other three criteria are much less studied, even though numerous works have proven how emotionally aware behavior is beneficial to the overall user experience.

However, complex concepts such as emotion are tricky to annotate due to the subjectivity of the task, making it difficult to obtain relevant and reliable data. This is fast becoming a major challenge in deep learning, where data is at the heart of systems learning processes, and a huge amount of data is required.

In this series of two blog posts, I will present the different methods used in the literature to design the collection and annotation of socio-emotional data :

How are socio-emotional data collected and annotated?
What are the interesting socio-emotional strategies studied?

How are socio-emotional data collected and annotated?

In deep learning, data is crucial, as it is with it that systems will learn their representations. For example, at Zaion, we can use our customers' data (within the limits of our contracts), and label the data through our expert annotation team to feed it into the training of our language models.

In this first article, I will review different ways of collecting and annotating conversational data that do not necessarily require access to such professional resources. These approaches are used in the socio-emotional conversational systems literature, and the datasets I'll mention are, for the most part, corpora made up of textual data and sometimes transcribed from audio sources.

Data collection and annotation

We will review three main collection approaches and associated annotation methods that we have observed in the literature.

Crowdsourcing

Collection: Crowdsourcing, when applied to data collection, is a participatory method where a group of people contribute to the creation of data samples. Crowdsourced data are generally human-to-human (H-H) interactions. They usually involve a speaker who conveys an emotion, and a listener who must respond appropriately. Dialogue systems are then trained to act as listeners. Data is collected by having the two speakers (crowd workers) interact according to set guidelines. For empathetic dialogues, speakers are invited to start the conversation following an emotional prompt. Listeners have to adapt their responses to the context presented by their interlocutor, without any prior knowledge of the stimulus or the situation. An example of a dataset constructed in this way is the ESConv dataset.

Derived annotations : For crowdsourcing, the labels (emotions and dialogue strategies) associated with the data are derived directly from the instructions given to the annotators. In addition, responses to surveys submitted to workers during the collection process can be collected from both the listener and speaker sides, enabling more data to be collected, such as empathy assessment and dialogue strategies at utterance level.

Explored from online sources

Collection: Another common way of collecting data is through crawling, in other words, extracting information from online sources. In the case of textual data, this often means posts and comments retrieved from social networks, which are natural conversations between humans. They can also come from other sources (such as OpenSubtitles) where the data is scripted. The data extracted from these websites is generally unlabeled, and annotation processes need to be designed to label these corpora.

Manual annotations: When datasets are small, or if the research team has the material resources, data can be fully annotated by human experts or annotators who have been trained for this specific annotation task. DailyDialog was annotated by 3 experts with a good understanding of dialogue and communication theory, who were trained in the guidelines of the task (annotation of acts of dialogue and emotion).

Semi-automatic annotations: Manual annotation is usually combined with algorithms to speed up the work and lighten the load on human judges. This hybrid approach is known as semi-automatic annotation. Typically, the first step is to have human judges annotate a small fraction of the dialogues collected. To provide additional support for human speakers, the research team that developed the EDOS dataset used a BERT-based model, trained on another dataset (empathetic dialogues), which provides the three most likely emotional labels for each conversation. This avoids the human judges selecting one of the 42 available labels, but rather chooses among 3 labels, with the option of selecting one of the others if necessary. The second and final part of the process is to use this manually annotated data to train a classifier that will automatically annotate the rest of the collected data.

Derived annotations: We can also use the context in which the web data was posted. For example, for the PEC dataset, posts and comments were extracted from two subreddits: happy and offmychest. The original reddit environment therefore provides a label, and all that remains is to perform a quality control by asking human annotators to annotate a small set of conversations (in the case of PEC, 100 from the happy reddit, 100 from the offmychest reddit and 100 from another reddit, casualconversations, for control).

Recovered from deployed services

Collection: When services such as customer care, chatbots and others are already deployed, logs can be collected and compiled into a corpus. Most of the time, this concerns human-machine interactions, but it can also be used for human-to-human conversations (such as call center data).

Manual / semi-automatic annotation: This type of data can use the same annotation methods as exploration data: human annotation possibly assisted by AI approaches as described above. For their EmoContext dataset, 50 human annotators manually annotated 300 dialogues for each of the 4 classes, and each dialogue was reviewed by 7 judges. These annotated dialogues were integrated as vectors, then used with cosine similarity thresholds to find similar occurrences in the unannotated data pool. The results were then checked by human judges who make the final decision.

User feedback: Some bots in production can ask for feedback on customer satisfaction, either directly or via surveys. This information can be used to annotate certain conversations.

Discussion

In the field of task-oriented systems, we are confronted with the question of data type. Public corpora often compile open-domain conversations with different strategies, formulations and goals. For a model to deliver good performance, the training data (as well as the label type) must resemble the real-time data the system will face when deployed with real users. For best performance, the challenge is therefore to create data that comes as close as possible to the context of the conversation within a task-oriented framework.

Crowdsourcing has its advantages when it comes to collection methods. It allows precise control of dialogue content, the collection of numerous and diverse samples responding to specific issues defined by the research team, and a label given directly by the source of the data (the speaker). However, the scripted nature of the interaction induces an important bias in the corpus, as it differs from the way a person might spontaneously express their feelings in a similar situation. Open-domain data collected from social networks or other online sources is another problem, as it is not structured in the same way as dialogues are task-oriented. Ultimately, for service providers who already have conversational agents deployed in production, the optimal solution is to retrieve and use conversation logs. The data obtained from deployed conversational agents is human-computer interaction, and while the acceptability of these bots may influence the conversation (and the way the user expresses himself), the interaction remains spontaneous and authentic. However, the biases that are intrinsic to the behavior and development of the machine will be preserved in the corpus. Nevertheless, it offers the best compromise between, on the one hand, task-oriented utterances that correspond to the bot's intended use and, on the other, the spontaneous nature of conversations and authentic reactions, even if they may be influenced by the acceptability of technology and AI in customer service.

Once the data has been collected, we believe that a very important step must precede the annotation phase: the design of a labeling process, based on the study of real or expected models and data, which establishes what is to be annotated and what features are to be taught to the virtual agent. Once this process has been clearly defined, semi-automatic annotation provides good optimization of resources, while guaranteeing consistency with quality control by human annotators. If the quantity of data permits, manual annotation also guarantees high quality and control of the resulting labels. It should be pointed out that, due to its subjective nature, annotating emotional state is a complex task, even for humans. Two people may not agree on an emotion expressed in the same sample, due to the personal and internal biases that taint our perception of others' emotional states.

Although the biggest use of these end-to-end models is for direct user interaction, such as chatbots, other applications have emerged, particularly in the field of customer experience. One such example is the automatic generation of responses to human agents to help them in real time with their tasks, as proposed by Zaion with Smart Coaching, which aims to generate a suggested response in real time to human agents. These new services open up interesting new horizons in the field, such as data labeling (if the human agent chooses to use the robot's answer, it would automatically be labeled as relevant) or the desired result of the generation (such as the suggestion of the n best answers, n being generally between 3 and 4).

In a few words

Various means of collection, with their associated annotation schemes, exist for gathering socio-emotional data:

Crowdsourcing (annotated with surveys or emotional incentives)
Online Crawling (human annotation that can be assisted by algorithms)
Human-computer interaction (manual/semi-automatic annotation or user feedback)

It's important to think through the collection and annotation process so as to gather the type of data that will be best suited to the task in hand. Each method has its advantages and disadvantages, and the key is to find the best compromise for a given situation.

"In the second part of this blog, I'll define the various socio-emotional strategies studied in the literature, as well as why and how they are relevant to improving the effectiveness of the task-oriented conversational system. See you soon!"