How to Fine-Tune BERT for Text Classification?

BERT for Text Classification

4 min readMay 14, 2024

**Fine-Tuning** **BERT** **for Text Classification**

How to Fine-Tune BERT for Text Classification?
BERT for Text Classification, by Fudan University
2019 CCL, Over 1700 Citations (Sik-Ho Tsang @ Medium)
Text Classification
==== My Other Paper Readings Are Also Over Here ====

Exhaustive experiments are conducted to investigate different fine-tuning methods of BERT on text classification task and provide a general solution for BERT fine-tuning.

Outline

Motivations
BERT for Text Classification
Results

1. Motivations

1.1. Fine-Tuning

A simple softmax classifier is added to the top of BERT to predict the probability of label c:

where W is the task-specific parameter matrix.

1.2. Motivations

There are different strategies for fine-tuning: Further Pre-training, Multi-Task Fine-Tuning.
There are also different domain data can be used for fine-tuning: Task-Specific, In-Domain, Cross-Domain.

In this paper, different strategies and domain data are evaluated to find the optimal training receipes.

2. BERT for Text Classification

2.1. Further Pre-training

Within-task pre-training: BERT is further pre-trained on the training data of a target task.
In-domain pre-training: several different sentiment classification tasks, which have a similar data distribution are combined as training data for fine-tuning.
Cross-domain pre-training: data from both the same and other different domains is used for fine-tuning.

2.2. Multi-Task Fine-Tuning

Multi-task learning is also an effective approach to share the knowledge obtained from several related supervised tasks.

All the tasks share the BERT layers and the embedding layer. The only layer that does not share is the final classification layer, which means that each task has a private classifier layer.

3. Results

3.1. Datasets

3.2. Investigating Different Fine-Tuning Strategies

BERT has maximum sequence length limitations.
head-only: keep the first 510 tokens.
tail-only: keep the last 510 tokens.
head+tail: empirically select the first 128 and the last 382 tokens.

The truncation method of head+tail achieves the best performance on IMDb and Sogou datasets.

The last layer of BERT gives the best performance.

3.3. Investigating the Further Pretraining

The further pre-training is useful to improve the performance of BERT for a target task, which achieves the best performance after 100K training steps.

**In-Domain and Cross-Domain Further Pre-Training**

Almost all further pre-training models perform better on all seven datasets than the original BERTbase model (row ‘w/o pretrain’ in Table 5).

Generally, in-domain pretraining can bring better performance than within-task pretraining.
Cross-domain pre-training (row ‘all’ in Table 5) does not bring an obvious benefit in general.

BERT-Feat is implemented through using the feature from BERT model as the input embedding of the biLSTM with self-attention.
The result of BERT-IDPT-FiT corresponds to the row of ‘all sentiment’, ‘all question’, and ‘all topic’ in Table 5, and the result of BERT-CDPTFiT corresponds to the row of ‘all’ in it.

BERT-Feat performs better than all other baselines except for ULMFiT.
BERT-IDPT-FiT performs best, with an average error rate reduce by 18.57%.

3.4. Multi-task Fine-Tuning

For multi-task fine-tuning based on BERT, the effect is improved.

However, multi-task learning may not be necessary to improve generalization.

3.5. Few-Shot Learning

Further pre-trained BERT can further boost its performance, which improves the performance from 17.26% to 9.23% in error rates with only 0.4% training data.