Encoder-decoder vs decoder only transformer models

Modern state-of-the-art Transformer models are based on the decoder-only architecture. However, I was initially confused by the name “encoder”, which to me alludes to encoding the words into numbers. This is actually the role of the embedding layer.

So, what are the differences between an encoder-deconder and decoder-only transformer model?

High-level comparison

AspectEncoder-Decoder TransformersDecoder-Only Models
ArchitectureSeparate encoder and decoderSingle stack of decoder layers
ComponentsEncoder, DecoderDecoder only
Input ProcessingEncoder processes entire input at onceSequential, one token at a time
Typical Use CasesTranslation, Summarization, Q&AText generation, Autocomplete, Conversational AI
Attention MechanismSelf-attention + Cross-attentionSelf-attention with causal masking
ParallelizationEncoder: Parallel
Decoder: Sequential
Inherently sequential
DirectionalityEncoder: Bidirectional
Decoder: Unidirectional
Unidirectional
Training ObjectiveSequence-to-sequence tasksNext-token prediction
Model SizeGenerally largerOften more compact
FlexibilityBetter for distinct input/output sequencesSimpler for open-ended generation
ExamplesBERT+GPT hybrid, T5, BARTGPT series, BLOOM, LLaMA

The main difference is that encoder-decoder models distinguish the input processed by the encoder and the tokens generated by the decoder, whereas in decoder-only models, there is no distinction between input and output tokens; they are treated as the same type of tokens to predict the next token.

This is simpler, and it has worked better in practice as well, which is why it’s the most common modern approach for LLMs such as GPT or LLaMA.

1. Traditional Encoder-Decoder Model:

  • Separate Input and Output Sequences:
    • The encoder processes an input sequence (e.g., a sentence in one language) and creates a contextual representation of that sequence.
    • The decoder then uses this contextual information from the encoder (via cross-attention) along with the tokens it has already generated to produce the next token in the output sequence (e.g., a sentence in another language).
    • There is a clear distinction between the input sequence (handled by the encoder) and the output sequence (generated by the decoder).

2. Decoder-Only Model:

  • Single Sequence Processing:
    • In a decoder-only model, there is no separate encoder to process an initial input. Instead, the model receives a single sequence, which could be a prompt or the beginning of a sentence.
    • The model uses self-attention to process this sequence and predict the next token. After generating a token, it appends this token to the sequence and then processes the updated sequence to predict the next token.
    • No Distinction Between Input and Generated Tokens: The model treats the entire sequence (including both the initial prompt and the tokens it has generated so far) as a single input. It doesn’t distinguish between tokens that were part of the original input and those it has generated during the process.