Learning an Effective Context-Response Matching Model with Self-Supervised Tasks for Retrieval-based Dialogues

TL;DR:

BERT-SL is one of approaches suggesting complementary training tasks, in order to deal with the limitation of learning sequential information.
There was performance gain in both PLM based model and non-PLM based model, which means that the auxilary tasks are substantial to enhance the capabilities for dialogue response selection.

Introduction

The recent response selection models are the result of tacking to incoherence or inconsistency. However, the approach of this paper seems different, since it is the trial to improve model’s capacity by design of proper auxiliary tasks, while MDFN tried to enhance by reconstruction of architecture.

The limitation of the existing approaches

The most common approach to extract a context representation for response selection model is concatenating utterances and a response. After that, later studies suggested the way to calculate the matching score after aggregating the representation of each turn of utterances individually. In addition, considering multiple granularities(layers) of representations or more complicated interaction mechanisms between the context and the response was suggested to further improve.
However, many models are in trouble to capture the potential training signals and learn task-related knowledge(coherence and consistency in dialogue system) during the training process, and it gets worse when training corpora is limited.

BERT-SL, Giving more assignments to model

figure_1

The auxiliary tasks below, the additional assignments to model in training procedure, are designed in self-supervised manner, and supposed to be jointly trained with main task in fine-tuning after domain-adaptive post-training. The post-training is for injection of in-domain knowledge as well as a fair comparison with previous works.
- NSP (Next Session Prediction):
  - split utterances in two parts randomly, and predict whether two parts are consecutive or not with representation of cls token.
  - for negative samples, sample randomly in whole training corpora
- UR (Utterance Restoration):
  - utterance-level masking, masking whole utterance randomly sampled from utterances in context
- ID (Incoherence Detection):
  - replace one of random utterance, and predict which utterance was replaced (utterance-wise classification)
  - apply softmax after max-pooling
- CD (Consistency Discrimination):
  - choose one pivot sample and one positive sample randomly among utterances in same dialogue, and one negative sample from another dialogues in training corpora
  - define the learning objectives as a triplet loss function
  - The premise is that the topic and speech style will be similar if both utterances belong to the same dialogue.

Conclusion

The auxiliary tasks can be regarded as regularization for enhancing the model’s generalization ability, leveraging the training corpus and learn both characteristics of dialogue text and implicit knowledge.
- Auxiliary tasks enhanced the performance to two benchmark datasets in general, when applied to PLM based model.
- The improvement on PLM based model is higher than that on non-PLM based model. It can be explained the reason that auxiliary tasks is more effective when applying to general neural architectures as they are designed to consider the characteristics of dialogue data.
- The inference time was not increased significantly

Which auxiliary task is better?

The performance has dropped for each exclusion of auxiliary tasks, which means all of them were significant.
The performance drop when exclude ID (Incoherence Detection) was the most significant. It can be explained that it helps learning relevance between context and response.
The exclusion of UR (Utterance Restoration) causes the least performance drop, which seems there is a redundant knowledge training with token-level MLM.
There was performance gain in both PLM based model and non-PLM based model → The auxiliary tasks are substantial to enhance the capabilities such as semantic relevance, coherence, or consistency for the downstream task(dialogue response selection).
It is likely to cause performance drop when the context is too short or too long in general. However, BERT-SL jointly trained on four auxiliary tasks seems more robust. It can be explained that the enhancement of capability makes model to be more independent to the length of context.

Reference and Implementation:

paper: Learning an Effective Context-Response Matching Model with Self-Supervised Tasks for Retrieval-based Dialogues, 14 Sep 2020