in Posts on Reviews, Nlp, Deep_learning, Retrieval, Dialogue, Response_selection, Adversarial_dataset
TL;DR:
The author argues that training model to recognize the relevance between context and response is important for response selection task.
The overall architecture seems similar with MDFN(Masking Decoupling Fusing Network) in that it fuses information of different level such as word, utterance, and context through attention mechanism.
The experiments on MSN(Multi-hop Selector Network) showed that weighted utterance representations improved the performance on PLM-based response matching model in general.
in Posts on Reviews, Nlp, Deep_learning, Retrieval, Dialogue, Response_selection, Adversarial_dataset
TL;DR:
The previous neural response selection model lack of a comprehensive understanding of context, and it results into biased response selection.
The adversarial dataset was reviewed and filtered by experts, and proposed to confirm that model has learned the comprehensive information, not just comparison based on similar tokens.
The proposed debiasing strategy utilizing biased model seems effective to migitate the model’s biased pattern learning.
in Posts on Reviews, Nlp, Deep_learning, Retrieval, Dialogue, Response_selection, Auxiliary_task
TL;DR:
UMS-BERT is one of approaches suggesting complementary training tasks, in order to deal with the limitation of learning sequential information.
There was performance gain in both PLM based model and non-PLM based model, which means that the auxiliary tasks are substantial to enhance the capabilities for dialogue response selection.
The background of paper, such as proposed problem, approach, and conclusion seems quite similar with BERT-SL.
in Posts on Reviews, Nlp, Deep_learning, Retrieval, Dialogue, Response_selection, Auxiliary_task
TL;DR:
BERT-SL is one of approaches suggesting complementary training tasks, in order to deal with the limitation of learning sequential information.
There was performance gain in both PLM based model and non-PLM based model, which means that the auxilary tasks are substantial to enhance the capabilities for dialogue response selection.
in Posts on Reviews, Nlp, Deep_learning, Language_model
TL;DR:
Switch Transformer is sparsely-active transformer, which can reduce optimizing time by introducing MoE(Mixture of Experts) algorithm and parallelizing parts of model.
The advantage of Switch Transformer is that some layers can be parallelized and computation can be accelerated. Efficiency can increase depending on the number of CPU cores. In addition, Switch Transformer shows improvement in quality in low compute resources.
However, optimizing router and MoE layers may be a cause of training instability at the same time.
in Posts on Reviews, Nlp, Deep_learning, Language_model, Generative_model
Introduction
Having reading backward from GPT-3 to now GPT-1, one of the interest things is that changes of the degree of potential merit that each author argues unsupervised learning would have are apparent.
In GPT-1, the author suggests that unsupervised pretraining may boost the performance of supervised downstream tasks, compared to GPT-3 which author emphasizes that supervised learning is not indispensable if the model is large enough.
in Posts on Reviews, Nlp, Deep_learning, Language_model, Multilingual
TL;DR:
M-BERT(Multilingual BERT) is BERT trained on corpora from various languages.
M-BERT does not seem to learn systematic transformation of languages. (complicate syntactic/semantic relationship between languages)
The significant factors of M-BERT’s performance
Vocabulary Memorization: the fraction of Word overlap between languages and
Mapping new vocabularies onto learned structure
Merely pre-training general representation of languages from unannotated corpora guarantees baseline performance of downstream task in some circumstances.
in Posts on Reviews, Nlp, Deep_learning, Attention
TL;DR:
Previous attention needs source vector to conjuncture the relevance with task. In encoder-decoder architecture, an output of encoder is source vector while an output of decoder is target task vector.
Self-Attention embeds contextual information - each tokens’ significance in given task - into a matrix rather than a vector, assuming that there can be more than 1 contextual weight vector for 1 sentence, While former Attention embeds contextual information into vector.
Self-Attention helps model to pay attention to significant parts of sentence for target task relieving some long-term memorization burden from LSTM, and provides attention matrix for visualization.
in Posts on Reviews, Nlp, Deep_learning, Attention
TL;DR:
This paper provided clue to solve long term dependency problem and to develop self-attention, transformer, and BERT, the most popular model in 2019.
It is undeniable that attention, which supports decoder to search where the relatively significant parts are, is novel approach itself compared to previous one which embed source sentence into one fixed-length vector according to distributional hypothesis.
Eventually, attention contributed to broaden model variation of NLP, expanding the existing options that were limited to recurrent network family.