Encoder-decoder vs decoder only transformer models

Modern state-of-the-art Transformer models are based on the decoder-only architecture. However, I was initially confused by the name “encoder”, which to me alludes to encoding the words into numbers. This is actually the role of the embedding layer.

So, what are the differences between an encoder-deconder and decoder-only transformer model?

High-level comparison

Aspect	Encoder-Decoder Transformers	Decoder-Only Models
Architecture	Separate encoder and decoder	Single stack of decoder layers
Components	Encoder, Decoder	Decoder only
Input Processing	Encoder processes entire input at once	Sequential, one token at a time
Typical Use Cases	Translation, Summarization, Q&A	Text generation, Autocomplete, Conversational AI
Attention Mechanism	Self-attention + Cross-attention	Self-attention with causal masking
Parallelization	Encoder: Parallel Decoder: Sequential	Inherently sequential
Directionality	Encoder: Bidirectional Decoder: Unidirectional	Unidirectional
Training Objective	Sequence-to-sequence tasks	Next-token prediction
Model Size	Generally larger	Often more compact
Flexibility	Better for distinct input/output sequences	Simpler for open-ended generation
Examples	BERT+GPT hybrid, T5, BART	GPT series, BLOOM, LLaMA

The main difference is that encoder-decoder models distinguish the input processed by the encoder and the tokens generated by the decoder, whereas in decoder-only models, there is no distinction between input and output tokens; they are treated as the same type of tokens to predict the next token.

This is simpler, and it has worked better in practice as well, which is why it’s the most common modern approach for LLMs such as GPT or LLaMA.

1. Traditional Encoder-Decoder Model:

Separate Input and Output Sequences:
- The encoder processes an input sequence (e.g., a sentence in one language) and creates a contextual representation of that sequence.
- The decoder then uses this contextual information from the encoder (via cross-attention) along with the tokens it has already generated to produce the next token in the output sequence (e.g., a sentence in another language).
- There is a clear distinction between the input sequence (handled by the encoder) and the output sequence (generated by the decoder).

2. Decoder-Only Model:

Single Sequence Processing:
- In a decoder-only model, there is no separate encoder to process an initial input. Instead, the model receives a single sequence, which could be a prompt or the beginning of a sentence.
- The model uses self-attention to process this sequence and predict the next token. After generating a token, it appends this token to the sequence and then processes the updated sequence to predict the next token.
- No Distinction Between Input and Generated Tokens: The model treats the entire sequence (including both the initial prompt and the tokens it has generated so far) as a single input. It doesn’t distinguish between tokens that were part of the original input and those it has generated during the process.

Encoder-decoder vs decoder only transformer models#

High-level comparison#

1. Traditional Encoder-Decoder Model:#

2. Decoder-Only Model:#

Encoder-decoder vs decoder only transformer models

High-level comparison

1. Traditional Encoder-Decoder Model:

2. Decoder-Only Model: