Unfolding Attention: How Log-Space, Semirings, and Separability Reveal a Path to Linear-Time Transformers
By Anooj P.
Transformer attention is usually presented as a compact, almost innocent equation:
But behind this compact form lies a deep algebraic and geometric structure. One that determines both the expressiveness and the computational cost of modern Transformers.
This essay explores these hidden structures. By examining attention in log-space, interpreting it through semirings, and understanding the role of separability, we see how attention can be reformulated into a structure that behaves like a Transformer but computes like an RNN.
This reframing suggests a path toward linear-time, softmax-like attention β potentially combining the best properties of RNNs, kernel machines, and probabilistic dynamic programming.
1. Why Examining Attention in Log-Space Matters
Consider the softmax denominator for a single query vector :
Taking a log gives:
where the log-sum-exp operator is
The geometry of LSE is the key.
If you graph , you see:
- For , the surface hugs the plane
- For , it hugs the plane
- Near , there is a smooth βridgeβ β the locus where the operator has high curvature
This ridge prevents the operator from decomposing into a sum of separate functions of its inputs:
That single fact β non-separability β forces the full computation in self-attention.
Understanding this leads directly to why attention is expensive, and what kind of changes might make it cheaper.
2. Separability: The Bridge Between Expressiveness and Compute Cost
A kernel is separable if it decomposes as:
When this holds, the entire sequence of keys can be compressed into query-independent states:
Then for any query:
This is linear in the sequence length.
This trick forms the basis of Performer-style linear Transformers. The limitation is that the exact kernel is not separable.
Thus we search for a new algebra (or a modified softmax) that preserves the behavior while becoming separable.
3. Semiring Linearity vs. Computational Linearity
In the log-sum-exp semiring:
- addition is
- multiplication is
Softmax attention becomes:
The softmax is linear in this algebra.
This is the same algebra that powers:
- HMM forward-backward
- CRF partition functions
- Dynamic programming under uncertainty
However:
Semiring linearity β computational linearity.
Even though the operator is associative, we must still evaluate the terms for each query, leading to the familiar behavior.
We therefore need an operator that is:
- semiring-linear
- and separable
- so that pre-aggregated state can be reused
This leads us to linearizable semirings.
4. Linearizable Semirings: Soft-Max-Plus, MoE-LSE, and Log-Exp-Mean
We seek a replacement operator for LSE that:
- is differentiable
- approximates softmax behavior
- is associative (semiring-like)
- is separable or approximately separable
We describe three such families.
4.1 Soft-Max-Plus Semiring
Define:
where is a smooth saturating function.
As :
which is the tropical semiring, perfectly separable.
This yields a softmax-like operator that remains nearly separable, allowing approximate linear-time accumulation.
4.2 Mixture-of-Exponentials LSE (MoE-LSE)
Define a monotone warp:
Semiring addition becomes:
The key is:
a mixture of exponentials. Each is separable via standard random features:
Thus:
- separable
- learnable curvature
- compatible with linear accumulation
- softmax-like behavior
This is arguably one of the most promising paths to exact linearization.
4.3 Log-Exp-Mean Semiring
Define the power mean:
As parameters vary:
- :
- : (softmax-like)
- : arithmetic mean
Because admits separable feature maps, this entire family becomes linearizable.
5. Why Linearizable Semirings Produce RNN-Like Computation
If the kernel becomes separable, then we may build the following recurrences:
Given a query :
This is literally an RNN hidden-state update:
- state update: linear, additive
- state dimension: fixed (feature-mapped rank)
- output: nonlinear, query-conditioned readout
Thus we recover:
- the long-range context modeling of attention
- the efficiency of an RNN
- the stability of log-domain arithmetic
This is the dream: softmax expressive power at RNN cost.
6. Closing Thoughts
By shifting attention into log-space and interpreting it through the lens of semirings, we gain a deeper understanding of:
- why softmax is expressive
- why softmax is expensive
- why separability matters
- how alternative semiring operations (Soft-Max-Plus, MoE-LSE, Log-Exp-Mean) can preserve softmax-like behavior
- and how these operators create Transformer layers that compute like RNNs
This perspective reveals a broader landscape where we can redesign the core algebra of attention itself β an emerging area that unifies classical probabilistic models, kernel methods, and modern deep learning architectures.
Visual Intuitions: Diagrams (WIP)
Diagram 1: The Folded Geometry of Log-Sum-Exp
z = log(e^x + e^y)
z
^
| / (sheet: z β y when y >> x)
| /
| /
| /
| /
| /
| / <-- smooth ridge near x = y
| /
| /
| / (sheet: z β x when x >> y)
+-----------------------------> x
\
\
v y
- Far from the diagonal , the surface is nearly planar ( or ).
- Near , the ridge has high curvature, preventing a decomposition .
- This non-separability is the geometric source of the quadratic cost of exact softmax attention.
Diagram 2: Separable vs. Non-Separable Kernels
Non-separable exponential kernel: Separable feature-map kernel:
K(q, k) = exp(qα΅k) K(q, k) β Ο(q)α΅ Ο(k)
q βββΊ[ β
α΅k , exp ]βββΊ weight q βββΊ[ Ο(q) ]ββββ
k βββΊ[ Ο(k) ]βββ΄ββΊ weight = Ο(q)α΅Ο(k)
- No finite-dimensional summary - Keys summarized as Z = Ξ£ Ο(k_j)
of {k_j} works for all q. - Reusable across queries q.
Separable kernels allow us to compress all keys into query-independent statistics and reuse them for every query, enabling linear-time attention.
Diagram 3: Linearizable Semiring Pipeline (MoE-LSE Example)
Query q, key k
s = qα΅k
β
βΌ
Ο(s) = Ξ£_r c_r exp(Ξ»_r s) (mixture of exponentials)
β
βΌ
Ο(s) β Ο(q)α΅ Ο(k) (separable approximation)
β
ββββββββββββββββ
βΌ βΌ
accumulate Z = Ξ£ Ο(k_j) accumulate S = Ξ£ Ο(k_j) v_jα΅
At query time:
q βββΊ Ο(q)
β
ββββΊ αΊ(q) = Ο(q)α΅ Z
ββββΊ Ε(q) = Ο(q)α΅ S
o(q) = Ε(q) / αΊ(q)
Here, defines a semiring addition with exponential-mixture curvature. Because is separable, we can maintain query-independent states and compute attention in linear time.
Diagram 4: RNN-Style Recurrence Hidden Inside Linear Attention
Time t-1: Time t:
Z_{t-1}, S_{t-1} k_t, v_t
β β
ββββββββββββββ update βββββββββ
βΌ
Z_t = Z_{t-1} + Ο(k_t)
S_t = S_{t-1} + Ο(k_t) v_tα΅
At query time:
q_t βββΊ Ο(q_t)
β
ββββΊ αΊ_t = Ο(q_t)α΅ Z_t
ββββΊ Ε_t = Ο(q_t)α΅ S_t
o_t = Ε_t / αΊ_t
This is structurally an RNN:
- is the hidden state.
- The update is linear-time and additive.
- The readout is a nonlinear function of .
Linearizable semirings give us Transformer-like attention that computes like an RNN.
Relationship to the Main Semiring-Attention Blog
This preface is intended as an intuitive and geometric introduction to the more technical post:
βRethinking Attention Through Semirings: Toward Linear-Time Log-Domain Transformersβ
That companion post dives deeper into:
- formal semiring definitions,
- expectation semiring and log-domain recurrences,
- linear-time state updates,
- and connections to Performer, HMMs, CRFs, and Manifest AIβs symmetric power / polynomial-sketch transformers.
You can read this preface as the conceptual βfront porchβ to that more algebraically detailed article.
References
[1] Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, Ε., Belanger, D., Colwell, L., & Weller, A. (2020). Rethinking Attention with Performers. arXiv:2009.14794. https://arxiv.org/abs/2009.14794
[2] Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. (2020). Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. arXiv:2006.16236. https://arxiv.org/abs/2006.16236
[3] Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & RΓ©, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022. https://arxiv.org/abs/2205.14135
[4] Goodman, J. (1999). Semiring Parsing. Computational Linguistics, 25(4), 573β606. https://aclanthology.org/J99-4004.pdf
[5] Eisner, J. (2002). Parameter Estimation for Probabilistic Finite-State Transducers. Proceedings of ACL 2002. https://aclanthology.org/P02-1001/
[6] Manifest AI. (2024, Aug 15). Symmetric Power Transformers. https://manifestai.com/articles/symmetric-power-transformers/
[7] Kumar, S., Buckman, J., Gelada, C., & Zhang, S. (2025). Conformal Transformations for Symmetric Power Transformers. arXiv:2503.03269. https://arxiv.org/abs/2503.03269
[8] Kacham, P., Rao, A. N., Chen, X., Lin, W., Wang, S.-P., Seidel, K., Wang, X., Sun, L., Ye, S., & Zhang, X. (2024). Fast Transformers via Sketching Polynomial Kernels (PolySketchFormer). ICML 2024 (PMLR Vol. 235). https://arxiv.org/abs/2310.01655 OpenReview: https://openreview.net/forum?id=YkCjojDG3l
[9] Manifest AI. (2024, Dec 10). Improving Symmetric Power Transformers with Conformal Gating. https://manifestai.com/articles/optimizing-symmetric-power-transformers/