Attention Residuals (AttnRes)

What is Attention Residuals (AttnRes)?

Attention Residuals (AttnRes) is a model architecture change for large language models that modifies how residual connections aggregate information across layers. In many modern LLM setups, residual connections with PreNorm accumulate all preceding layer outputs using fixed unit weights, which can lead to uncontrolled hidden-state growth with depth and can dilute how much each layer contributes.

AttnRes replaces the fixed accumulation with learned, input-dependent softmax attention over preceding layer outputs, so each layer can selectively aggregate earlier representations. To make this practical for large-scale training, the paper introduces Block AttnRes, which reduces the memory and communication overhead by attending over block-level representations instead of all preceding layer outputs.

Key Features

Softmax attention over preceding layer outputs (AttnRes): Uses learned, input-dependent weights to decide how much earlier layer representations should contribute to the current layer.
Block-wise attention (Block AttnRes): Partitions layers into blocks and performs attention at the block level to reduce memory footprint compared with full attention over all preceding layers.
Cache-based pipeline communication: Incorporates cache mechanisms for pipeline parallelism to help reduce communication overhead during training.
Two-phase computation strategy: Adds a computation structure intended to make the block attention approach practical during large-scale model training.
Drop-in replacement framing for residual connections: Designed to replace standard residual connections with minimal overhead relative to the baseline residual setup.
Validated across model sizes with scaling law experiments and ablations: Reports consistent improvement across model sizes and ablation results supporting the benefit of content-dependent depth-wise selection.

How to Use Attention Residuals (AttnRes)

If you are implementing or evaluating this research idea, start by identifying the residual connection pattern used in your target model (specifically residual connections with PreNorm and fixed unit-weight accumulation). Then:

Replace the residual aggregation with AttnRes, using softmax attention to compute input-dependent weights over preceding layer outputs.
If training cost is a concern, use Block AttnRes by partitioning layers into blocks and attending over block-level representations to reduce memory usage.
Follow the training practicality components described in the paper—cache-based pipeline communication and a two-phase computation strategy—to manage overhead when scaling up.
Evaluate on downstream tasks and/or run ablations to confirm that content-dependent selection improves performance in your setting.

Use Cases

Improving deep LLM training stability where PreNorm dilution is a concern: Apply AttnRes to address the reported issue that uniform aggregation can lead to hidden-state growth and progressively diluted layer contribution.
Large-scale training setups sensitive to attention memory/communication costs: Use Block AttnRes to keep the selective aggregation benefits while reducing the overhead of attending across all preceding layers.
Model architecture experiments on residual connection variants: Compare standard residual connections against attention-based residual aggregation to quantify how content-dependent selection affects performance.
Downstream evaluation of representation quality across tasks: Use the method in a pretrained architecture to test whether mitigating dilution yields better downstream results across evaluated tasks.

FAQ

What problem does AttnRes address? The approach targets residual connections (notably with PreNorm) that accumulate all layer outputs using fixed unit weights, which the paper says can cause uncontrolled hidden-state growth with depth and dilute each layer’s contribution.
How does AttnRes differ from standard residual connections? Instead of fixed unit-weight aggregation, AttnRes uses learned, input-dependent softmax attention to selectively aggregate preceding layer outputs.
Why introduce Block AttnRes? The paper describes that full attention over all preceding layer outputs introduces memory and communication overhead at large scale; Block AttnRes reduces this by attending over block-level representations.
Is Block AttnRes intended to be practical for training? Yes. The description ties Block AttnRes to additional training components—cache-based pipeline communication and a two-phase computation strategy—aimed at reducing overhead and enabling use as a drop-in replacement for residual connections.
Where was AttnRes integrated and tested? The content mentions integration into a “Kimi Linear” architecture (48B total / 3B activated parameters) and pretraining on 1.4T tokens, along with reported downstream improvements across evaluated tasks.

Alternatives

Standard residual connections with PreNorm (baseline): The most direct alternative; it uses fixed unit-weight accumulation across layer outputs and serves as the baseline AttnRes aims to improve.
Residual connection variants that change normalization or aggregation mechanics: If your goal is to manage depth-related effects, you could compare other architectural modifications that alter how information is combined across layers without using attention over preceding outputs.
Other attention-efficient mechanisms for deep networks: For training-cost constraints, alternatives are methods that reduce attention memory/communication (for example, approaches that limit attention scope or restructure computation), though the specific algorithms would differ from the block attention design described here.
Content selection techniques outside residual aggregation: If you want input-dependent depth-wise selection, you can consider alternative ways to gate or route information across layers rather than applying softmax attention directly to preceding layer outputs.