Summary

Mostly followed the original BERT
Excluded NSP task, NSP loss isn’t necessary.
Dynamic masking is slightly better than static masking (actually.. almost no difference)
RoBERTa is just a BERT configuration, and performs much better than the original one.
Training longer, with bitgger batches and more data is desirable.

Essentially, RoBERTa is just a reconfiguration BERT. Then why this is important?-

Introduction

This is a replication study of BERT.

Modifications (from the original one)

Contributions

(1) Presented a set of important BERT design choices and training strategies.
(2) Used a novel dataset (CC-News), and confirm that using more data for pre-training further improves performance on downstream tasks.
(3) Showed that MLM pretraining is competitive.

Implementation

Mostly followed the original BERT setting
“Didn’t randomly inject short sequences” → Because NSP task wasn’t used, so no negative samples are needed.

Training Procedure Analysis

Which choices are important for pretraining BERT models?

Static masking

The original BERT masks only once during preprocessing → single static mask.
To avoid using the same mask, duplicated training data 10 times → 10 different ways of masking (over 40 epochs).

Dynamic masking

Results

Dynamic masking is slightly better than static.. (but actually, almost no difference)

Is NSP really important?

The original study said NSP loss was crucial, but some recent works have questioned the necessity.

Comparison results

Using individual sentences hurts performance on downstream tasks

(better to use related sentences)

Removing the NSP loss matches or slightly improves downstream task performance

Then, why did the original BERT said it’s important?
- BERT just removed the loss itself, and still remained SEGMENT-PAIR format..
- Suitable format for suitable loss!

(Robustly optimized BERT approach)

Configurations

Results

Large improvements!
Increasing the number of pretraining steps (100K → 300K, 500K) was better for downstream tasks.