Brief Review — Fastformer: Additive Attention Can Be All You Need

Additive Attention Mechanism, With Linear Complexity, Outperforms or On Par With , ,

Sik-Ho Tsang
3 min readSep 29, 2024


Fastformer
, by Tsinghua University, and Microsoft Research Asia
2021 arXiv v6, Over 140 Citations (Sik-Ho Tsang @ Medium)

Language Model (LM)
2007 … 2022
[] [] [] [] [] [] [] [] [] 2023 [] []

  • Fastformer is proposed, wherein an additive attention mechanism is designed to model global contexts, and then further transform each token representation based on its interaction with global context representations.
  • In this way, Fastformer can achieve effective context modeling with linear complexity.

Outline

  1. Fastformer
  2. Results

1. Fastformer

Fastformer
  • First, three independent linear transformation layer to transform the input into the attention query, key and value matrices Q, K, V.
  • It then uses additive attention mechanism to summarize the query sequence into a global query vector with the attention weights:
  • Next, it models the interaction between the global query vector and attention keys with element-wise product.
  • Similarly, it then summarizes keys into a global key vector via additive attention with attention weights:
  • And then it models the interactions between global key and attention values via element-wise product.
  • A linear transformation is then used to learn global context-aware attention values, and finally added with the attention query to form the final output.

In this way, the computational complexity can be reduced to linearity.

  • The above fastformer module is then formed a fastformer similar to

2. Results

Sentiment and topic classification tasks

Fastformer can achieve competitive or better performance than other efficient variants in both long and short text modeling.

News recommendation task

Fastformer achieves the best performance, and it also outperforms its basic NRMS model.

  • In addition, Fastformer can further improve the performance of PLM-NR, and the ensemble model achieves the best results on the MIND leaderboard.
Text summarization task
  • On the CNN/DM dataset, many efficient variants (except Poolingformer and Fastformer) are inferior to the vanilla .

Fastformer can achieve the best performance in most metrics, which shows the advantage of Fastformer in natural language generation.

Computational Complexity
  • The complexity of Fastformer only depends on the sequence length and the hidden dimension, and it has the least complexity among compared methods.
Training and Inference Speed

Fastformer is much more efficient than other linear complexity variants in terms of both training and inference time.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.

No responses yet

Write a response