Exploring the Functional Roles of Transformer Layers

Engineering
November 10, 2024
August 7, 2024
Aakash Nain

Marc Pickett

Introduction

A new paper, "Transformer Layers as Painters," co-written by Emergence researchers and Sakana AI, investigates the internal workings of transformers, focusing on how the removal or reorganization of information impacts pretrained models. This understanding is crucial for enhancing the efficacy of large language models (LLMs) and potentially developing new model variants. The research uses empirical studies on frozen models to reveal significant insights into the behavior of various transformer layers.

The authors put forth a helpful means of conceptualizing the middle layers of a transformer by making an analogy to an “assembly line of painters.” Imagine a canvas (input) is passed along a series of painters, some of whom specialize in painting birds while others specialize in painting wheels. Each painter receives the canvas from the painter below her, then she decides whether to add a few strokes to the painting or instead just pass it along to the painter above her. 

In this analogy, each painter uses the same “vocabulary” for understanding paintings, so that a painter may receive the painting from a painter earlier in the assembly line without catastrophe. The painters may also be reordered without complete catastrophe (even if parts of the background get painted after foreground objects, occluding them), and the painters may even all add their strokes at the same time (in parallel). This becomes more evident through the research demonstrated below.

Models and Benchmarks

The experiments detailed in this paper primarily utilize two transformer models: Llama2-7B, a decoder-only model with 7 billion parameters and 32 layers, and BERT-Large, an encoder-only model with 24 layers and 340 million parameters. The benchmarks include ARC, HellaSwag, GSM8K, WinoGrande, LAMBADA for Llama2, and tasks from the GLUE benchmark for BERT.

Key Questions and Findings

  1. Do layers use the same representation space?
    1. Middle layers share a common representation space, unlike the first and last few layers.
    2. Skipping or switching the order of middle layers does not catastrophically affect performance, indicating shared representation.
  2. Are all the layers necessary?
    1. Not all layers are essential; skipping several middle layers results in graceful performance degradation.
    2. This suggests some redundancy in the middle layers.
  3. Are middle layers all performing the same function?
    1. Middle layers perform different functions despite sharing a representation space.
    2. Replacing middle layers with a single layer's weights results in significant performance drops, worse than skipping layers.
  4. Does the layer order matter? 
    1. The order of middle layers affects performance, but changes result in graceful degradation.
    2. Reversing or randomizing the order of middle layers is less harmful than initially expected.
  5. Can we run the layers in parallel?
    1. Running middle layers in parallel is feasible, with only minor performance drops, except for math-heavy tasks.
    2. This could offer latency benefits without severely compromising accuracy.
  6. Does the order matter for some tasks more than others?
    1. Mathematical and reasoning tasks are more sensitive to layer order than semantic tasks.
    2. This indicates that certain tasks require a more structured layer order to maintain performance.
  7. Does looping help parallelized layers?
    1. Iterating parallelized layers improves performance, with optimal iterations proportional to the number of parallelized layers.
    2. This suggests a method for enhancing parallel execution.
  8. Which variants harm performance the least?
    1. Randomizing layer order and looped parallel execution degrade performance the least.
    2. Repeating a single layer is most detrimental, causing performance to quickly drop to baseline levels.

Discussion

The study reveals that transformers have three distinct classes of layers (beginning, middle, and ending), with middle layers exhibiting robustness and varied functionality. These findings suggest potential for architectural improvements and methods to trade accuracy for latency in transformer models. The insights gained can guide the development of more efficient and effective LLMs.

Conclusion

Understanding the distinct roles of transformer layers is key to improving LLMs and optimizing their architectures. Our paper provides foundational insights that are crucial for advancing AI technologies. For a deeper dive into the experiments and findings, read the full paper on arXiv.

More from the Journal