A new paper, "Transformer Layers as Painters," co-written by Emergence researchers and Sakana AI, investigates the internal workings of transformers, focusing on how the removal or reorganization of information impacts pretrained models. This understanding is crucial for enhancing the efficacy of large language models (LLMs) and potentially developing new model variants. The research uses empirical studies on frozen models to reveal significant insights into the behavior of various transformer layers.
The authors put forth a helpful means of conceptualizing the middle layers of a transformer by making an analogy to an “assembly line of painters.” Imagine a canvas (input) is passed along a series of painters, some of whom specialize in painting birds while others specialize in painting wheels. Each painter receives the canvas from the painter below her, then she decides whether to add a few strokes to the painting or instead just pass it along to the painter above her.
In this analogy, each painter uses the same “vocabulary” for understanding paintings, so that a painter may receive the painting from a painter earlier in the assembly line without catastrophe. The painters may also be reordered without complete catastrophe (even if parts of the background get painted after foreground objects, occluding them), and the painters may even all add their strokes at the same time (in parallel). This becomes more evident through the research demonstrated below.
The experiments detailed in this paper primarily utilize two transformer models: Llama2-7B, a decoder-only model with 7 billion parameters and 32 layers, and BERT-Large, an encoder-only model with 24 layers and 340 million parameters. The benchmarks include ARC, HellaSwag, GSM8K, WinoGrande, LAMBADA for Llama2, and tasks from the GLUE benchmark for BERT.
The study reveals that transformers have three distinct classes of layers (beginning, middle, and ending), with middle layers exhibiting robustness and varied functionality. These findings suggest potential for architectural improvements and methods to trade accuracy for latency in transformer models. The insights gained can guide the development of more efficient and effective LLMs.
Understanding the distinct roles of transformer layers is key to improving LLMs and optimizing their architectures. Our paper provides foundational insights that are crucial for advancing AI technologies. For a deeper dive into the experiments and findings, read the full paper on arXiv.