How to deal with data in advanced socio-emotional dialogue generation systems

The rise of deep learning technologies, backed by the growing accessibility of ever bigger corpora of data, has contributed to the development of increasingly advanced dialogue generation systems. At Zaion, we aim to research these more developed models to eventually apply them to customer service.

26 September 2022

Lorraine Vanel
PhD candidate at Zaion and Télécom-Paris on the topic of generating natural language utterances for socially competent agents.
I also serve as the culinary manager and part-time caterer at Zaion Lab – someone’s got to keep those brains braining.


The rise of deep learning technologies, backed by the growing accessibility of ever bigger corpora of data, has contributed to the development of increasingly advanced dialogue generation systems. At Zaion, we aim to research these more developed models to eventually apply them to customer service.

Beyond the issues of grammar or fluency in the generated content, human-machine interaction faces one main challenge for the conversational agent – consistency. It implies different sub-problems, such as:

  • Logical consistency (relevance of the generated answer to the conversation history)
  • Persona consistency (relevance of the bot’s behaviour to its past behaviour in the interaction)
  • Social consistency (relevance of the answer to a set of accepted social etiquette and rules)
  • Emotional consistency (relevance to the user’s emotional state)

Consistency is important to avoid situations where the users would find the bot “strange”. A task-oriented agent is often representative of an entity that provides a service, and as such, needs to be reliable from a business standpoint, but also offer an enjoyable social interaction. Even though socio-emotional bots exist in task-oriented applications, most of these systems are used and designed for the open domain.

While the logical consistency point can be solved with computation power, the other three points are much less studied, even though many studies have proven how beneficial emotionally aware behaviour is to overall user experience.

However, complex concepts such as emotion are tricky to annotate because of how subjective the task is, hence why it is hard to obtain relevant and reliable data. In deep learning, where data is at the heart of the systems’ learning processes and where a huge amount of data is required, this quickly becomes a significant challenge.

In this series of two blog posts, I will be presenting the different methods used in literature to design socio-emotional data collection and annotation:

  1. How is socio-emotional data collected and annotated?
  2. What are the studied interesting socio-emotional strategies?

How is socio-emotional data collected and annotated?


In deep learning, data is crucial as it is what the systems will learn their representations from. For example, at Zaion we can use our customer’s data (within the limits of our contracts), and label the data through our expert annotation team to feed it to train our language models.

These approaches are used in the literature of socio-emotional conversational systems, and the datasets I will mention are, for the most part, corpora made up of textual data, and sometimes transcribed from audio sources.

Dataset Collection and Annotation

We will go over three main collection approaches as well as the associated annotation methods we have observed in the literature.


Collection: Crowdsourcing, when applied to data collection, is a participatory method where a group of people contributes to creating data samples. The crowd-sourced data is typically Human-Human interaction (H-H). They usually involve a speaker, or seeker, that would convey emotion; and a listener, or helper, that would have to answer accordingly.  Dialogue Systems are then trained to perform in the role of listeners. The data is collected by having the two crowd workers interact by following set guidelines. For empathetic dialogues, workers taking the role of speakers are asked to start the conversation following an emotional prompt. The listeners must adapt their replies to the context as presented by their interlocutor without being aware of the prompt or the situation beforehand. One such example is the ESConv dataset.

Derived Annotations: For crowdsourcing, the labels (emotions and dialogue strategies) associated with the data are directly derived from the instructions given to the annotators. Additionally, answers to surveys submitted to the workers during the collection process can be collected on both the listener and the speaker side, which allows for the collection of more data, such as empathy grading and utterance-level dialogue strategies.

Crawled from online sources

Collection:  Another common way to collect data is by crawling, which in other words is extracting information from online sources. In the case of textual data, it is often posts and comments crawled from social media and is thus natural Human-Human speech. It can also come from other sources (such as OpenSubtitles) where the data is scripted. The data extracted from those websites is usually not labelled and the annotation processes must be designed to label the corpora.

Manual Annotations: When datasets are small or if the research team has the material means, data can be entirely annotated by human experts or annotators that have been trained on that specific annotation task. DailyDialog was annotated by 3 experts that hold a good understanding of dialogue and communication theory, who were taught the guidelines of the task (emotion and dialogue act annotation).

Semi-automatic Annotations: Manual annotation is usually paired up with algorithms to accelerate the work and lighten the human judges’ workload. This hybrid approach is called semi-automatic annotation. In general, the first step is to have human judges annotate a small fraction of the collected dialogues. To provide further support to the human helpers, the EDOS dataset team used a Bert-based model, trained on the empathetic dialogues dataset, to output for each dialogue the top 3 most likely emotion label. This is to prevent having the human judges select one of the 42 available labels, instead having them pick between 3 labels, with the possibility to select one of the others if needed. The second and last part of the process is to use this manually annotated data as training data for a classifier that will automatically annotate the rest of the collected data.

Derived Annotations: The context in which web data was posted can be used. For example, for the PEC dataset, posts and comments have been extracted from two subreddits: happy and offmychest. The original reddit environment thus provides a label and what is left to do is to direct a quality check by asking human annotators to annotate a small set of the conversations (100 from the happy reddit, 100 from the offmychest reddit and 100 from another reddit, casualconversations, for control).

Retrieved from deployed services

Collection: When services such as customer assistance, chatbots and such are already deployed, it’s possible to retrieve the logs to compile them into corpuses. Most of the time, this concerns human-machine interactions, but it can also be used for human-human conversations (such as call center data).

Manual/Semi-Automatic Annotation: This type of data can use the same annotation schemes as crawled data: human annotation possibly helped by AI approaches as described above. For their EmoContext dataset, 50 human annotators manually annotated 300 dialogues for each of the 4 classes, and each dialogue was looked over by 7 judges. These annotated dialogues were embedded as vectors, and then used along with cosine similarity thresholds to find similar occurrences in the non-annotated pool of data. Results would then be checked by human judges that would make the last ruling.

User Feedback: Some bots in production can ask for customer satisfaction feedback, either directly or through surveys. This information can be used to annotate certain conversations.


In the field of task-oriented systems, we also deal with the issue of the type of data. Public corpora often compile conversations from the open domain that exhibit different strategies, formulations and goals. For a model to provide good performance, the training data (as well as the labels type) must be like the real-time data the system will be confronted with when deployed to actual users. For better performance, the challenge is thus to create data that is the closest possible to the conversation context in a task-oriented setting.


For the means of collection, crowdsourcing has its advantages. It allows for precise control on the content of the dialogues, the gathering of numerous and diverse samples that answer specific problems as defined by the research team, and a label given directly by the source of the data (the speaker). However, the scripted nature of the interaction induces a significant bias in the corpus, as it differs from how one might spontaneously express their feelings in a similar situation. Open-domain data crawled from social media or other online sources is another issue, as it is not structured in the same way as task-oriented dialogues are. In the end, for service providers that already have conversational agents deployed in production, the optimal solution is to retrieve and use conversation logs. Data retrieved from deployed conversational agents squarely falls into the Human-Machine interaction scope, and while the acceptability of such bots can influence the conversation (and the way the user expresses themselves), the interaction remains spontaneous and authentic. However, biases that are intrinsic to the machine’s behaviour and development will be preserved in the corpus. Still, it offers the best compromise between task-oriented utterances that match the intended use of the bot; and the spontaneous nature of the conversations and the genuine reactions, even if they might be influenced by their acceptability of technology and AI in customer service.


Once the data has been collected, we believe that a very important step must precede the annotation phase: designing a labelling process that establishes what needs to be annotated as well as which strategies and emotions by studying patterns in the data and extracting the features that will be taught to the agent. Once this process is clearly defined, a semi-automatic annotation allows for a good optimization of resources, while guaranteeing good consistency with quality control by human annotators. If the data quantity allows for it, a manual annotation also guarantees very good quality and control over the resulting labels. It is important to note that annotating emotional state is a complex task even for humans, due to its subjective nature. Two people might not agree on an emotion expressed in a same sample, because of the personal and internal biases that colour our perception of others’ emotional states.


Although the biggest use of such end-to-end models is direct user interaction such as chatbots, other applications have been appearing, especially in the field of customer experience. One such example is the automatic generation of responses to human agents to assist them in real time with their tasks, like what Zaion is offering with Smart Coaching that aims to generate a suggested answer in real time to human agents. These new services introduce interesting new horizons in the domain, such as in data labelling (if the human agent chooses to use the bot’s response, it will be automatically labelled as relevant) or the desired output of the generation (like suggesting the n-best responses, n usually being set around 3-4).

In short

Different means of collection, with their associated annotation schemes, exist to gather socio-emotional data:

  • Crowdsourcing (annotated with surveys or emotional prompts)
  • Online Crawling (annotated with human annotation that can be helped with algorithms)
  • Human-Machine Interaction (manual/semi-automatic annotation or user feedback)

It is important to think about the collection and annotation process to gather the type of data that is going to be the best adapted to the task at hand. Each method has its pros and cons, and it all becomes a matter of finding the best trade-off for a given situation.

“In the second part of this blog, I will define the different socio-emotional strategies that are studied in literature, as well as why and how they are relevant to make task-oriented conversational system more efficient. See you then!”