# Brief Review — RoFormer: Enhanced Transformer with Rotary Position Embedding

## RoFormer: **Rotary Position Embedding (RoPE), for Position Information**

RoFormer: Enhanced Transformer with Rotary Position Embedding,RoFormer, by Zhuiyi Technology Co., Ltd.2021 arXiv v4, Over 70 Citations(Sik-Ho Tsang @ Medium)

Natural Language Processing, NLP, Language Model, LM, Transformer, BERT

**Rotary Position Embedding (RoPE)**is proposed to effectively**leverage the positional information**.- RoPE encodes the absolute position with a
**rotation matrix**and meanwhile incorporates the**explicit relative position dependency**in self-attention formulation. - RoPE has
**multiple advantages**: The flexibility of sequence length, decaying inter-token dependency with increasing relative distances, and the capability of equipping the linear self-attention with relative position encoding.

# Outline

**Preliminaries****Roformer****Results**

**1. Preliminaries**

## 1.1. Self-Attention

- Let
={*SN**wi*}, where*i*from 1 to*N*, be a sequence ofwith*N*input tokens*wi*being the*i*-th element.

In Transformer, the self-attention first

incorporates position information to the word embeddingsandtransforms them intoqueries,keys, andvaluerepresentations:

where

, respectively.qm,knandvnincorporate them-th andn-th positions throughfq,fkandfv

- The query and key values are then used to compute the attention weights, while the output is computed as the weighted sum over the value representation:

## 1.2. Absolute Position Embedding in **Original **Transformer

- For the
**original****Transformer**,**in**,*fq*,*fk*and*fv*:*pi*is a*d*-dimensional vector depending of the position of token*xi*

- where
is the*L***maximum sequence length**.

:piis generated using the sinusoidal function

## 1.3. Relative Position Embedding in Shaw NAACL’18

Trainable relative position embeddings ~are used in Shaw NAACL’18:pkr, ~pvr

where

represents therrelative distance between position. Theymandnclippedthe relative distance with the hypothesis thatprecise relative position information is not useful beyond a certain distance.

# 2. Roformer

Specifically, incorporating the relative position embedding in Roformer is to:

Simply rotate the affine-transformed word embedding vector by amount of angle multiples of its position indexand thus interprets the intuition behindRotary Position Embedding.

## 2.1. Goal

- In order to incorporate relative position information, we
**require the inner product of query**, which*qm*and key*kn*in Equation (2) to be formulated by a function*g***takes only the word embeddings**, and*xm*,*xn***their relative position**as*m*-*n***input variables**:

The

ultimate goalis to find an equivalent encoding mechanism to solve the functionsandfq(xm,m)tofk(xn,n)conform the aforementioned relation.

## 2.2. 2D Form (for Simplicity)

- Consider a
**simpler case of 2D form**, the above equation becomes:

- where
**Re[]**is the**real part**of a complex number and**(**represents the*Wkxn*)***conjugate complex number of (**.*Wkxn*) **{**:*fq*,*kg*} is further written in a multiplication matrix

- where
**(**is*x*(1)*m*,*x*(2)*m*)expressed in the*xm***2D coordinates.**

## 2.3. General Form

- In
**general form**:

is the*Θ***rotary matrix**with pre-defined parameters:

- When
**applying RoPE to self-attention**,*qTmkn*becomes:

- With:

# 3. Results

## 3.1. Machine Translation

The proposed

RoFormergivesbetter BLEUscores compared to its baseline alternative Vaswani et al. on the WMT 2014 English-to-German translation task.

## 3.2. Language Model Pretraining

Compare to the vanilla BERT,

RoFormerexperiencesfasterconvergence.

## 3.3. GLUE

RoFormer can significantly outperformBERTin three out of six datasets, and the improvements are considerable.

To encode position, conventional Transformer uses absolute position, Shaw NAACL’18 uses the distance between two positions, and the proposed RoFormer uses rotation.

## Reference

[2021 arXiv v4] [RoFormer]

RoFormer: Enhanced Transformer with Rotary Position Embedding

## 2.1. Language Model / Sequence Model

(Some are not related to NLP, but I just group them here)

**1991** … **2020 **[ALBERT] [GPT-3] [T5] [Pre-LN Transformer] [MobileBERT] [TinyBERT] [BART] [Longformer] [ELECTRA] [Megatron-LM] [SpanBERT] [UniLMv2] **2021 **[Performer] [gMLP] [RoFormer]

## 2.2. Machine Translation

**2013 …** **2021 **[ResMLP] [GPKD] [RoFormer]