Encoder-decoder vs decoder only transformer models
Modern state-of-the-art Transformer models are based on the decoder-only architecture. However, I was initially confused by the name “encoder”, which to me alludes to encoding the words into numbers. This is actually the role of the embedding layer.
So, what are the differences between an encoder-deconder and decoder-only transformer model?
High-level comparison
| Aspect | Encoder-Decoder Transformers | Decoder-Only Models |
|---|---|---|
| Architecture | Separate encoder and decoder | Single stack of decoder layers |
| Components | Encoder, Decoder | Decoder only |
| Input Processing | Encoder processes entire input at once | Sequential, one token at a time |
| Typical Use Cases | Translation, Summarization, Q&A | Text generation, Autocomplete, Conversational AI |
| Attention Mechanism | Self-attention + Cross-attention | Self-attention with causal masking |
| Parallelization | Encoder: Parallel Decoder: Sequential | Inherently sequential |
| Directionality | Encoder: Bidirectional Decoder: Unidirectional | Unidirectional |
| Training Objective | Sequence-to-sequence tasks | Next-token prediction |
| Model Size | Generally larger | Often more compact |
| Flexibility | Better for distinct input/output sequences | Simpler for open-ended generation |
| Examples | BERT+GPT hybrid, T5, BART | GPT series, BLOOM, LLaMA |
The main difference is that encoder-decoder models distinguish the input processed by the encoder and the tokens generated by the decoder, whereas in decoder-only models, there is no distinction between input and output tokens; they are treated as the same type of tokens to predict the next token.
This is simpler, and it has worked better in practice as well, which is why it’s the most common modern approach for LLMs such as GPT or LLaMA.
1. Traditional Encoder-Decoder Model:
- Separate Input and Output Sequences:
- The encoder processes an input sequence (e.g., a sentence in one language) and creates a contextual representation of that sequence.
- The decoder then uses this contextual information from the encoder (via cross-attention) along with the tokens it has already generated to produce the next token in the output sequence (e.g., a sentence in another language).
- There is a clear distinction between the input sequence (handled by the encoder) and the output sequence (generated by the decoder).
2. Decoder-Only Model:
- Single Sequence Processing:
- In a decoder-only model, there is no separate encoder to process an initial input. Instead, the model receives a single sequence, which could be a prompt or the beginning of a sentence.
- The model uses self-attention to process this sequence and predict the next token. After generating a token, it appends this token to the sequence and then processes the updated sequence to predict the next token.
- No Distinction Between Input and Generated Tokens: The model treats the entire sequence (including both the initial prompt and the tokens it has generated so far) as a single input. It doesn’t distinguish between tokens that were part of the original input and those it has generated during the process.