Rethinking Attention Through Semirings: Toward Linear-Time Log-Domain Transformers
By Anooj Patel
This is post is build on some intuition that is worked through in:
Transformers are defined by the self-attention mechanism—a kernel of extraordinary power and cost. Standard attention computes weighted averages of value vectors with weights derived from exponentiated query-key interactions:
This operation is nonlinear, non-separable, and quadratic in sequence length. Over the years, the community has attacked each of these properties, yielding families of linear-time transformers such as Performer [Choromanski et al., 2020], Linear Transformers [Katharopoulos et al., 2020], and FlashAttention [Dao et al., 2022].
But there is a subtler, algebraic perspective that invites us to see attention through the lens of semirings.
1. The Log-Sum-Exp Semiring: Attention as Algebra
The softmax normalization can be expressed in the log-sum-exp (LSE) semiring, defined by:
This semiring has been used for decades in dynamic programming and probabilistic inference—most notably in HMM forward-backward and CRF computations [Eisner, 2002].
In this semiring, the self-attention output for a query becomes:
Here, softmax attention is linear within the semiring algebra (associative and distributive) but nonlinear in the reals. Computing it exactly still costs , though it yields perfect numerical stability.
2. Linearization: Kernel vs. Semiring Perspectives
There are two distinct routes to making attention linear-time:
| Method | Key Idea | Example |
|---|---|---|
| Kernel-based | Approximate the exponential kernel . | Performer, Linear Transformer |
| Semiring-based | Change the underlying algebra so exponentials become additive. | Log-sum-exp semiring, tropical semiring, expectation semiring |
Kernel linearization (Performer)
Performer and related models introduce a positive feature map such that:
Then, query-independent states can be accumulated:
These query-independent states are what make attention linear-time: they summarize all keys and values in a finite-dimensional space.
Semiring linearization
In the LSE semiring, one can express causal attention recursively:
This recurrence is exact, differentiable, and numerically stable, but it remains because each query involves new inner products .
Semiring linearity computational linearity.
3. Toward Linear-Time Semirings: Separable Approximations to LSE
The obstacle to reusing state in log-semiring attention is non-separability:
The smooth ridge near couples the two inputs; it cannot be written as . If we could approximate or replace this with a separable operation, we’d gain query-independent updates.
Approximation 1 — Tropical limit (Max-plus semiring)
Take the temperature :
The tropical semiring (max-plus) is associative, separable, and defines the hard-attention limit.
Approximation 2 — Polynomial / mean-field expansion
For small,
The first term is separable; the quadratic residual can be approximated by a low-rank kernel.
Approximation 3 — Parametric semiring interpolation
Define a family of semirings:
where controls curvature: yields max, yields mean.
Learning per layer or head could allow models to interpolate between smooth and sharp attention.
Approximation 4 — Learned separable decomposition
Because , we can approximate the ridge function as a low-rank expansion:
This leads to a learned, separable semiring—essentially a feature-map for the LSE surface itself.
4. The Expectation Semiring: Differentiable and Linearizable
The expectation semiring [Eisner, 2002; Goodman, 1999] defines pairs with:
Folding over a sequence yields the weighted average , which corresponds exactly to the attention expectation:
This semiring generalizes easily to geometric aggregation (replace with ). When paired with separable functions—like kernel approximations—it produces the linear-time expectation semiring attention.
5. Designing New Semirings for Attention
We can generalize the LSE semiring via a monotone warp :
Properties:
- Commutative, associative, differentiable if is.
- Includes LSE when .
To make this linear-time, choose such that admits a feature map factorization:
Examples of separable
- Mixture of exponentials (MoE-LSE): . This extends Performer with multiple exponential scales and mixing weights . Each component can be approximated by its own feature map, and the mixture combines them additively:
The inverse can be approximated with a small neural or Newton solver. Learning the mixture coefficients yields a differentiable, adaptive semiring that smoothly interpolates between exponential kernels of varying sharpness.
Softplus warp: , tunable curvature.
Polynomial-on-tanh: , bounded and stable.
This MoE-based semiring is particularly appealing: it’s differentiable, expressive, and efficiently separable. It generalizes the softmax () while preserving associativity and allowing learned curvature.
6. Relation to Other Architectures
- Performer — kernel linearization of softmax [Choromanski et al., 2020].
- HMM / CRF — log-semiring dynamic programs [Eisner, 2002].
- RNN Semiring View — many recurrent updates can be written as semiring folds [Goodman, 1999].
- Manifest AI (2024–2025) — Symmetric Power Transformer [6] and follow-up Conformal Symmetric Power work [7] explore symmetry-preserving linearizations closely related to the -semiring and MoE-LSE approach. For polynomial-kernel sketching, see PolySketchFormer [8].
References
[1] Choromanski, K. et al. (2020). Rethinking Attention with Performers. arXiv:2009.14794. https://arxiv.org/abs/2009.14794
[2] Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. (2020). Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. arXiv:2006.16236. https://arxiv.org/abs/2006.16236
[3] Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022. https://arxiv.org/abs/2205.14135
[4] Goodman, J. (1999). Semiring Parsing. Computational Linguistics, 25(4), 573–606. https://aclanthology.org/J99-4004.pdf
[5] Eisner, J. (2002). Parameter Estimation for Probabilistic Finite-State Transducers. Proceedings of ACL 2002. https://aclanthology.org/P02-1001/
[6] Manifest AI. (2024, Aug 15). Symmetric Power Transformers. https://manifestai.com/articles/symmetric-power-transformers/
[7] Kumar, S., Buckman, J., Gelada, C., & Zhang, S. (2025). Conformal Transformations for Symmetric Power Transformers. arXiv:2503.03269. https://arxiv.org/abs/2503.03269
[8] Kacham, P., Rao, A. N., Chen, X., Lin, W., Wang, S.-P., Seidel, K., Wang, X., Sun, L., Ye, S., & Zhang, X. (2024). Fast Transformers via Sketching Polynomial Kernels (PolySketchFormer). ICML 2024 (PMLR Vol. 235). https://arxiv.org/abs/2310.01655 OpenReview: https://openreview.net/forum?id=YkCjojDG3l
[9] Manifest AI. (2024, Dec 10). Improving Symmetric Power Transformers with Conformal Gating. https://manifestai.com/articles/optimizing-symmetric-power-transformers/