NoisyTune, A Little Noise Can Help You Finetune Pretrained Language Models Better


TL;DR:

  • NoisyTune is a simple regularization, but consistently improve performance of PLM.
  • It adds matrix-wise noise considering differences between parameter matrices in the PLM.
  • It works better on small dataset, which means it mitigates the gap between pretrained data and domain data.

Introduction

  • While pretrained language models(PLMs) mitigated the difficulties of downstream tasks, there are also some limitations to tackle down. One of the limitation is a suboptimal performance due to the difference of distribution between pretrained data and downstream task data. The author argues that the gaps can be migiated by adding NoisyTune, the approach suggested in this paper, to the paramters of PLMs before finetuning.

Difference

  • There have been many regularization tricks before. However, the key difference of NoisyTune is that it adds matrix-wise noise according to the standard deviations of different parameter matrices. Since parameter matrices in the PLM have very different characteristics such as the difference between the self-attention parameters and the feed-forward network parameters, a matrix-wise perturbing method showed better performance better. In addition, the performance drop was observed in the experiments when adding global noise with same distributions to the PLM parameters.

NoisyTune

$\tilde{W_i} = W_i + U(-\frac{-\lambda}{2}, \frac{\lambda}{2}) * std(W_i)$

  • The author showed that adding uniform noise is better than using Gaussian noise by experiments.
  • The hyperparameter lambda was set to 0.15 on GLUE and 0.1 on XTREME.

Performance

  • After adding NoisyTune, the consistent improvement of PLMs’ performance were observed on different tasks. It is not dramatic, but consistent increas between 0.3 ~ 1.0 percentpoint. In addition, the performance improvement on relatively small dataset is usually larger, which bolsters that NoisyTune migiates the gap between pretrained data and downstream data.

Reference and Implementation:




© 2022.03. by bigshane