Pre-LN Transformer, Warm-Up Stage is Skipped — On Layer Normalization in the Transformer Architecture
Pre-LN Transformer, by Microsoft Research Asia, University of Chinese Academy of Sciences, Peking University, Microsoft Research, and Nankai University
2020 ICML, Over 100 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing, NLP, Language Model, Machine Translation, Transformer, Layer Normalization, BERT