How to Fine-Tune BERT for Text Classification?
BERT for Text Classification
How to Fine-Tune BERT for Text Classification?
BERT for Text Classification, by Fudan University
2019 CCL, Over 1700 Citations (Sik-Ho Tsang @ Medium)Text Classification
==== My Other Paper Readings Are Also Over Here ====
Outline
- Motivations
- BERT for Text Classification
- Results
1. Motivations
1.1. Fine-Tuning
- A simple softmax classifier is added to the top of BERT to predict the probability of label c:
- where W is the task-specific parameter matrix.
1.2. Motivations
There are different strategies for fine-tuning: Further Pre-training, Multi-Task Fine-Tuning.
There are also different domain data can be used for fine-tuning: Task-Specific, In-Domain, Cross-Domain.
- In this paper, different strategies and domain data are evaluated to find the optimal training receipes.
2. BERT for Text Classification
2.1. Further Pre-training
- Within-task pre-training: BERT is further pre-trained on the training data of a target task.
- In-domain pre-training: several different sentiment classification tasks, which have a similar data distribution are combined as training data for fine-tuning.
- Cross-domain pre-training: data from both the same and other different domains is used for fine-tuning.
2.2. Multi-Task Fine-Tuning
- Multi-task learning is also an effective approach to share the knowledge obtained from several related supervised tasks.
All the tasks share the BERT layers and the embedding layer. The only layer that does not share is the final classification layer, which means that each task has a private classifier layer.
3. Results
3.1. Datasets
3.2. Investigating Different Fine-Tuning Strategies
- BERT has maximum sequence length limitations.
- head-only: keep the first 510 tokens.
- tail-only: keep the last 510 tokens.
- head+tail: empirically select the first 128 and the last 382 tokens.
The truncation method of head+tail achieves the best performance on IMDb and Sogou datasets.
The last layer of BERT gives the best performance.
3.3. Investigating the Further Pretraining
The further pre-training is useful to improve the performance of BERT for a target task, which achieves the best performance after 100K training steps.
- Almost all further pre-training models perform better on all seven datasets than the original BERTbase model (row ‘w/o pretrain’ in Table 5).
Generally, in-domain pretraining can bring better performance than within-task pretraining.
Cross-domain pre-training (row ‘all’ in Table 5) does not bring an obvious benefit in general.
- BERT-Feat is implemented through using the feature from BERT model as the input embedding of the biLSTM with self-attention.
- The result of BERT-IDPT-FiT corresponds to the row of ‘all sentiment’, ‘all question’, and ‘all topic’ in Table 5, and the result of BERT-CDPTFiT corresponds to the row of ‘all’ in it.
BERT-Feat performs better than all other baselines except for ULMFiT.
BERT-IDPT-FiT performs best, with an average error rate reduce by 18.57%.
3.4. Multi-task Fine-Tuning
- For multi-task fine-tuning based on BERT, the effect is improved.
However, multi-task learning may not be necessary to improve generalization.
3.5. Few-Shot Learning
Further pre-trained BERT can further boost its performance, which improves the performance from 17.26% to 9.23% in error rates with only 0.4% training data.
3.6. BERTLarge
BERTLARGE fine-tuning with task-specific further pre-training achieves state-of-the-art results.