Cindered Thoughts

Unfolding Attention: How Log-Space, Semirings, and Separability Reveal a Path to Linear-Time Transformers

By Anooj P.

Transformer attention is usually presented as a compact, almost innocent equation:

Attn(q,K,V)=eqK1eqKV.

But behind this compact form lies a deep algebraic and geometric structure. One that determines both the expressiveness and the computational cost of modern Transformers.

This essay explores these hidden structures. By examining attention in log-space, interpreting it through semirings, and understanding the role of separability, we see how attention can be reformulated into a structure that behaves like a Transformer but computes like an RNN.

This reframing suggests a path toward linear-time, softmax-like attention β€” potentially combining the best properties of RNNs, kernel machines, and probabilistic dynamic programming.


1. Why Examining Attention in Log-Space Matters

Consider the softmax denominator for a single query vector qt:

Zt=jteqtkj.

Taking a log gives:

logZt=LSEjt(qtkj),

where the log-sum-exp operator is

LSE(a,b)=log(ea+eb).

The geometry of LSE is the key.

If you graph z=log(ex+ey), you see:

This ridge prevents the operator from decomposing into a sum of separate functions of its inputs:

log(ex+ey)f(x)+g(y).

That single fact β€” non-separability β€” forces the full O(N2) computation in self-attention.

Understanding this leads directly to why attention is expensive, and what kind of changes might make it cheaper.


2. Separability: The Bridge Between Expressiveness and Compute Cost

A kernel K(q,k) is separable if it decomposes as:

K(q,k)=r=1Rϕr(q),ψr(k).

When this holds, the entire sequence of keys can be compressed into query-independent states:

Z=jψ(kj),S=jψ(kj)vj.

Then for any query:

Z^(q)=ϕ(q)Z,S^(q)=ϕ(q)S,o(q)=S^(q)Z^(q).

This is linear in the sequence length.

This trick forms the basis of Performer-style linear Transformers. The limitation is that the exact kernel eqk is not separable.

Thus we search for a new algebra (or a modified softmax) that preserves the behavior while becoming separable.


3. Semiring Linearity vs. Computational Linearity

In the log-sum-exp semiring:

Softmax attention becomes:

logZt=(((qtk1)(qtk2))).

The softmax is linear in this algebra.

This is the same algebra that powers:

However:

Semiring linearity β‰  computational linearity.

Even though the operator is associative, we must still evaluate the qtkj terms for each query, leading to the familiar O(N2) behavior.

We therefore need an operator that is:

This leads us to linearizable semirings.


4. Linearizable Semirings: Soft-Max-Plus, MoE-LSE, and Log-Exp-Mean

We seek a replacement operator for LSE that:

  1. is differentiable
  2. approximates softmax behavior
  3. is associative (semiring-like)
  4. is separable or approximately separable

We describe three such families.


4.1 Soft-Max-Plus Semiring

Define:

aτb=max(a,b)+τ,σ(|ab|),

where σ is a smooth saturating function.

As τ0:

aτbmax(a,b)

which is the tropical semiring, perfectly separable.

This yields a softmax-like operator that remains nearly separable, allowing approximate linear-time accumulation.


4.2 Mixture-of-Exponentials LSE (MoE-LSE)

Define a monotone warp:

ρ(a)=r=1Rcreλra,cr,λr>0.

Semiring addition becomes:

aρb=ρ1(ρ(a)+ρ(b)).

The key is:

ρ(qk)=r=1Rcreλrqk

a mixture of exponentials. Each eλrqk is separable via standard random features:

eλrqkϕλr(q)ϕλr(k).

Thus:

This is arguably one of the most promising paths to exact linearization.


4.3 Log-Exp-Mean Semiring

Define the power mean:

Mp(a,b)=1plog(epa+epb2).

As parameters vary:

Because epqk admits separable feature maps, this entire family becomes linearizable.


5. Why Linearizable Semirings Produce RNN-Like Computation

If the kernel becomes separable, then we may build the following recurrences:

Zt=Zt1+ψ(kt),St=St1+ψ(kt)vt.

Given a query qt:

Z^(qt)=ϕ(qt)Zt,S^(qt)=ϕ(qt)St,ot=S^(qt)Z^(qt).

This is literally an RNN hidden-state update:

Thus we recover:

This is the dream: softmax expressive power at RNN cost.


6. Closing Thoughts

By shifting attention into log-space and interpreting it through the lens of semirings, we gain a deeper understanding of:

This perspective reveals a broader landscape where we can redesign the core algebra of attention itself β€” an emerging area that unifies classical probabilistic models, kernel methods, and modern deep learning architectures.


Visual Intuitions: Diagrams (WIP)

Diagram 1: The Folded Geometry of Log-Sum-Exp

z = log(e^x + e^y)

      z
      ^
      |                /  (sheet: z β‰ˆ y when y >> x)
      |               /
      |              /
      |             /
      |            /
      |           /
      |          /  <-- smooth ridge near x = y
      |         /
      |        /
      |       /  (sheet: z β‰ˆ x when x >> y)
      +-----------------------------> x
                   \
                    \
                     v y

Diagram 2: Separable vs. Non-Separable Kernels

Non-separable exponential kernel:       Separable feature-map kernel:

  K(q, k) = exp(qα΅€k)                     K(q, k) β‰ˆ Ο†(q)α΅€ Ο†(k)

  q ──►[ β‹…α΅€k , exp ]──► weight           q ──►[ Ο†(q) ]───┐
                                          k ──►[ Ο†(k) ]──┴─► weight = Ο†(q)α΅€Ο†(k)

- No finite-dimensional summary         - Keys summarized as Z = Ξ£ Ο†(k_j)
  of {k_j} works for all q.             - Reusable across queries q.

Separable kernels allow us to compress all keys into query-independent statistics and reuse them for every query, enabling linear-time attention.


Diagram 3: Linearizable Semiring Pipeline (MoE-LSE Example)

Query q, key k

   s = qα΅€k
   β”‚
   β–Ό
  ρ(s) = Σ_r c_r exp(λ_r s)     (mixture of exponentials)
   β”‚
   β–Ό
  ρ(s) β‰ˆ Ο†(q)α΅€ Ο†(k)             (separable approximation)
   β”‚
   β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β–Ό              β–Ό
 accumulate Z = Ξ£ Ο†(k_j)        accumulate S = Ξ£ Ο†(k_j) v_jα΅€

At query time:

   q ──► Ο†(q)
           β”‚
           β”œβ”€β”€β–Ί Ẑ(q) = Ο†(q)α΅€ Z
           └──► Ŝ(q) = Ο†(q)α΅€ S

   o(q) = Ŝ(q) / Ẑ(q)

Here, ρ defines a semiring addition aρb=ρ1(ρ(a)+ρ(b)) with exponential-mixture curvature. Because ρ(qk) is separable, we can maintain query-independent states (Z,S) and compute attention in linear time.


Diagram 4: RNN-Style Recurrence Hidden Inside Linear Attention

Time t-1:                          Time t:

  Z_{t-1}, S_{t-1}                 k_t, v_t
        β”‚                             β”‚
        β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ update β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β–Ό
  Z_t = Z_{t-1} + ψ(k_t)
  S_t = S_{t-1} + ψ(k_t) v_tα΅€

At query time:

  q_t ──► Ο†(q_t)
             β”‚
             β”œβ”€β”€β–Ί Ẑ_t = Ο†(q_t)α΅€ Z_t
             └──► Ŝ_t = Ο†(q_t)α΅€ S_t

  o_t = Ŝ_t / Ẑ_t

This is structurally an RNN:

Linearizable semirings give us Transformer-like attention that computes like an RNN.


Relationship to the Main Semiring-Attention Blog

This preface is intended as an intuitive and geometric introduction to the more technical post:

β€œRethinking Attention Through Semirings: Toward Linear-Time Log-Domain Transformers”

That companion post dives deeper into:

You can read this preface as the conceptual β€œfront porch” to that more algebraically detailed article.


References

[1] Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, Ł., Belanger, D., Colwell, L., & Weller, A. (2020). Rethinking Attention with Performers. arXiv:2009.14794. https://arxiv.org/abs/2009.14794

[2] Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. (2020). Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. arXiv:2006.16236. https://arxiv.org/abs/2006.16236

[3] Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & RΓ©, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022. https://arxiv.org/abs/2205.14135

[4] Goodman, J. (1999). Semiring Parsing. Computational Linguistics, 25(4), 573–606. https://aclanthology.org/J99-4004.pdf

[5] Eisner, J. (2002). Parameter Estimation for Probabilistic Finite-State Transducers. Proceedings of ACL 2002. https://aclanthology.org/P02-1001/

[6] Manifest AI. (2024, Aug 15). Symmetric Power Transformers. https://manifestai.com/articles/symmetric-power-transformers/

[7] Kumar, S., Buckman, J., Gelada, C., & Zhang, S. (2025). Conformal Transformations for Symmetric Power Transformers. arXiv:2503.03269. https://arxiv.org/abs/2503.03269

[8] Kacham, P., Rao, A. N., Chen, X., Lin, W., Wang, S.-P., Seidel, K., Wang, X., Sun, L., Ye, S., & Zhang, X. (2024). Fast Transformers via Sketching Polynomial Kernels (PolySketchFormer). ICML 2024 (PMLR Vol. 235). https://arxiv.org/abs/2310.01655 OpenReview: https://openreview.net/forum?id=YkCjojDG3l

[9] Manifest AI. (2024, Dec 10). Improving Symmetric Power Transformers with Conformal Gating. https://manifestai.com/articles/optimizing-symmetric-power-transformers/