Unfolding Attention: How Log-Space, Semirings, and Separability Reveal a Path to Linear-Time Transformers

18 Nov, 2025

By Anooj P.

Transformer attention is usually presented as a compact, almost innocent equation:

A t t n (q, K, V) = \frac{e^{q K^{⊤}}}{1^{⊤} e^{q K^{⊤}}} V .

But behind this compact form lies a deep algebraic and geometric structure. One that determines both the expressiveness and the computational cost of modern Transformers.

This essay explores these hidden structures. By examining attention in log-space, interpreting it through semirings, and understanding the role of separability, we see how attention can be reformulated into a structure that behaves like a Transformer but computes like an RNN.

This reframing suggests a path toward linear-time, softmax-like attention — potentially combining the best properties of RNNs, kernel machines, and probabilistic dynamic programming.

1. Why Examining Attention in Log-Space Matters

Consider the softmax denominator for a single query vector $q_{t}$ :

Z_{t} = \sum_{j \leq t} e^{q_{t}^{⊤} k_{j}} .

Taking a log gives:

\log Z_{t} = {LSE}_{j \leq t} (q_{t}^{⊤} k_{j}),

where the log-sum-exp operator is

LSE (a, b) = \log (e^{a} + e^{b}) .

The geometry of LSE is the key.

If you graph $z = \log (e^{x} + e^{y})$ , you see:

For $x ≫ y$ , the surface hugs the plane $z \approx x$
For $y ≫ x$ , it hugs the plane $z \approx y)$
Near $x = y$ , there is a smooth “ridge” — the locus where the operator has high curvature

This ridge prevents the operator from decomposing into a sum of separate functions of its inputs:

\log (e^{x} + e^{y}) \neq f (x) + g (y) .

That single fact — non-separability — forces the full $O (N^{2})$ computation in self-attention.

Understanding this leads directly to why attention is expensive, and what kind of changes might make it cheaper.

2. Separability: The Bridge Between Expressiveness and Compute Cost

A kernel $K (q, k)$ is separable if it decomposes as:

K (q, k) = \sum_{r = 1}^{R} ϕ_{r} (q), ψ_{r} (k) .

When this holds, the entire sequence of keys can be compressed into query-independent states:

Z = \sum_{j} ψ (k_{j}), S = \sum_{j} ψ (k_{j}) v_{j}^{⊤} .

Then for any query:

\hat{Z} (q) = ϕ (q)^{⊤} Z, \hat{S} (q) = ϕ (q)^{⊤} S, o (q) = \frac{\hat{S} (q)}{\hat{Z} (q)} .

This is linear in the sequence length.

This trick forms the basis of Performer-style linear Transformers. The limitation is that the exact kernel $e^{q^{⊤} k}$ is not separable.

Thus we search for a new algebra (or a modified softmax) that preserves the behavior while becoming separable.

3. Semiring Linearity vs. Computational Linearity

In the log-sum-exp semiring:

addition is $a \oplus b = \log (e^{a} + e^{b})$
multiplication is $a \otimes b = a + b$

Softmax attention becomes:

\log Z_{t} = (((q_{t}^{⊤} k_{1}) \oplus (q_{t}^{⊤} k_{2})) \oplus \dots) .

The softmax is linear in this algebra.

This is the same algebra that powers:

HMM forward-backward
CRF partition functions
Dynamic programming under uncertainty

However:

Semiring linearity ≠ computational linearity.

Even though the operator is associative, we must still evaluate the $q_{t}^{⊤} k_{j}$ terms for each query, leading to the familiar $O (N^{2})$ behavior.

We therefore need an operator that is:

semiring-linear
and separable
so that pre-aggregated state can be reused

This leads us to linearizable semirings.

4. Linearizable Semirings: Soft-Max-Plus, MoE-LSE, and Log-Exp-Mean

We seek a replacement operator for LSE that:

is differentiable
approximates softmax behavior
is associative (semiring-like)
is separable or approximately separable

We describe three such families.

4.1 Soft-Max-Plus Semiring

Define:

a \oplus_{τ} b = max (a, b) + τ, σ (| a - b |),

where $σ$ is a smooth saturating function.

As $τ \to 0$ :

a \oplus_{τ} b \to max (a, b)

which is the tropical semiring, perfectly separable.

This yields a softmax-like operator that remains nearly separable, allowing approximate linear-time accumulation.

4.2 Mixture-of-Exponentials LSE (MoE-LSE)

Define a monotone warp:

ρ (a) = \sum_{r = 1}^{R} c_{r} e^{λ_{r} a}, c_{r}, λ_{r} > 0 .

Semiring addition becomes:

a \oplus_{ρ} b = ρ^{- 1} (ρ (a) + ρ (b)) .

The key is:

ρ (q^{⊤} k) = \sum_{r = 1}^{R} c_{r} e^{λ_{r} q^{⊤} k}

a mixture of exponentials. Each $e^{λ_{r} q^{⊤} k}$ is separable via standard random features:

e^{λ_{r} q^{⊤} k} \approx ϕ_{λ_{r}} (q)^{⊤} ϕ_{λ_{r}} (k) .

Thus:

separable
learnable curvature
compatible with linear accumulation
softmax-like behavior

This is arguably one of the most promising paths to exact linearization.

4.3 Log-Exp-Mean Semiring

Define the power mean:

M_{p} (a, b) = \frac{1}{p} \log (\frac{e^{p a} + e^{p b}}{2}) .

As parameters vary:

$p \to \infty$ : $M_{p} (a, b) \to max (a, b)$
$p = 1$ : $M_{1} (a, b) = \log ((e^{a} + e^{b}) / 2)$ (softmax-like)
$p \to 0$ : arithmetic mean

Because $e^{p q^{⊤} k}$ admits separable feature maps, this entire family becomes linearizable.

5. Why Linearizable Semirings Produce RNN-Like Computation

If the kernel becomes separable, then we may build the following recurrences:

Z_{t} = Z_{t - 1} + ψ (k_{t}), S_{t} = S_{t - 1} + ψ (k_{t}) v_{t}^{⊤} .

Given a query $q_{t}$ :

\hat{Z} (q_{t}) = ϕ (q_{t})^{⊤} Z_{t}, \hat{S} (q_{t}) = ϕ (q_{t})^{⊤} S_{t}, o_{t} = \frac{\hat{S} (q_{t})}{\hat{Z} (q_{t})} .

This is literally an RNN hidden-state update:

state update: linear, additive
state dimension: fixed (feature-mapped rank)
output: nonlinear, query-conditioned readout

Thus we recover:

the long-range context modeling of attention
the efficiency of an RNN
the stability of log-domain arithmetic

This is the dream: softmax expressive power at RNN cost.

6. Closing Thoughts

By shifting attention into log-space and interpreting it through the lens of semirings, we gain a deeper understanding of:

why softmax is expressive
why softmax is expensive
why separability matters
how alternative semiring operations (Soft-Max-Plus, MoE-LSE, Log-Exp-Mean) can preserve softmax-like behavior
and how these operators create Transformer layers that compute like RNNs

This perspective reveals a broader landscape where we can redesign the core algebra of attention itself — an emerging area that unifies classical probabilistic models, kernel methods, and modern deep learning architectures.

Visual Intuitions: Diagrams (WIP)

Diagram 1: The Folded Geometry of Log-Sum-Exp

z = log(e^x + e^y)

      z
      ^
      |                /  (sheet: z ≈ y when y >> x)
      |               /
      |              /
      |             /
      |            /
      |           /
      |          /  <-- smooth ridge near x = y
      |         /
      |        /
      |       /  (sheet: z ≈ x when x >> y)
      +-----------------------------> x
                   \
                    \
                     v y

Far from the diagonal $x = y$ , the surface is nearly planar ( $z \approx x$ or $z \approx y$ ).
Near $x = y$ , the ridge has high curvature, preventing a decomposition $z = f (x) + g (y)$ .
This non-separability is the geometric source of the quadratic cost of exact softmax attention.

Diagram 2: Separable vs. Non-Separable Kernels

Non-separable exponential kernel:       Separable feature-map kernel:

  K(q, k) = exp(qᵀk)                     K(q, k) ≈ φ(q)ᵀ φ(k)

  q ──►[ ⋅ᵀk , exp ]──► weight           q ──►[ φ(q) ]───┐
                                          k ──►[ φ(k) ]──┴─► weight = φ(q)ᵀφ(k)

- No finite-dimensional summary         - Keys summarized as Z = Σ φ(k_j)
  of {k_j} works for all q.             - Reusable across queries q.

Separable kernels allow us to compress all keys into query-independent statistics and reuse them for every query, enabling linear-time attention.

Diagram 3: Linearizable Semiring Pipeline (MoE-LSE Example)

Query q, key k

   s = qᵀk
   │
   ▼
  ρ(s) = Σ_r c_r exp(λ_r s)     (mixture of exponentials)
   │
   ▼
  ρ(s) ≈ φ(q)ᵀ φ(k)             (separable approximation)
   │
   ├──────────────┐
   ▼              ▼
 accumulate Z = Σ φ(k_j)        accumulate S = Σ φ(k_j) v_jᵀ

At query time:

   q ──► φ(q)
           │
           ├──► Ẑ(q) = φ(q)ᵀ Z
           └──► Ŝ(q) = φ(q)ᵀ S

   o(q) = Ŝ(q) / Ẑ(q)

Here, $ρ$ defines a semiring addition $a \oplus_{ρ} b = ρ^{- 1} (ρ (a) + ρ (b))$ with exponential-mixture curvature. Because $ρ (q^{⊤} k)$ is separable, we can maintain query-independent states $(Z, S)$ and compute attention in linear time.

Diagram 4: RNN-Style Recurrence Hidden Inside Linear Attention

Time t-1:                          Time t:

  Z_{t-1}, S_{t-1}                 k_t, v_t
        │                             │
        ├───────────── update ────────┘
        ▼
  Z_t = Z_{t-1} + ψ(k_t)
  S_t = S_{t-1} + ψ(k_t) v_tᵀ

At query time:

  q_t ──► φ(q_t)
             │
             ├──► Ẑ_t = φ(q_t)ᵀ Z_t
             └──► Ŝ_t = φ(q_t)ᵀ S_t

  o_t = Ŝ_t / Ẑ_t

This is structurally an RNN:

$(Z_{t}, S_{t})$ is the hidden state.
The update is linear-time and additive.
The readout is a nonlinear function of $(Z_{t}, S_{t}, q_{t})$ .

Linearizable semirings give us Transformer-like attention that computes like an RNN.

Relationship to the Main Semiring-Attention Blog

This preface is intended as an intuitive and geometric introduction to the more technical post:

“Rethinking Attention Through Semirings: Toward Linear-Time Log-Domain Transformers”

That companion post dives deeper into:

formal semiring definitions,
expectation semiring and log-domain recurrences,
linear-time state updates,
and connections to Performer, HMMs, CRFs, and Manifest AI’s symmetric power / polynomial-sketch transformers.

You can read this preface as the conceptual “front porch” to that more algebraically detailed article.

References

[1] Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, Ł., Belanger, D., Colwell, L., & Weller, A. (2020). Rethinking Attention with Performers. arXiv:2009.14794. https://arxiv.org/abs/2009.14794

[2] Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. (2020). Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. arXiv:2006.16236. https://arxiv.org/abs/2006.16236

[3] Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022. https://arxiv.org/abs/2205.14135

[4] Goodman, J. (1999). Semiring Parsing. Computational Linguistics, 25(4), 573–606. https://aclanthology.org/J99-4004.pdf

[5] Eisner, J. (2002). Parameter Estimation for Probabilistic Finite-State Transducers. Proceedings of ACL 2002. https://aclanthology.org/P02-1001/

[6] Manifest AI. (2024, Aug 15). Symmetric Power Transformers. https://manifestai.com/articles/symmetric-power-transformers/

[7] Kumar, S., Buckman, J., Gelada, C., & Zhang, S. (2025). Conformal Transformations for Symmetric Power Transformers. arXiv:2503.03269. https://arxiv.org/abs/2503.03269

[8] Kacham, P., Rao, A. N., Chen, X., Lin, W., Wang, S.-P., Seidel, K., Wang, X., Sun, L., Ye, S., & Zhang, X. (2024). Fast Transformers via Sketching Polynomial Kernels (PolySketchFormer). ICML 2024 (PMLR Vol. 235). https://arxiv.org/abs/2310.01655 OpenReview: https://openreview.net/forum?id=YkCjojDG3l

[9] Manifest AI. (2024, Dec 10). Improving Symmetric Power Transformers with Conformal Gating. https://manifestai.com/articles/optimizing-symmetric-power-transformers/