Rethinking Attention Through Semirings: Toward Linear-Time Log-Domain Transformers

18 Nov, 2025

By Anooj Patel

This is post is build on some intuition that is worked through in:

Unfolding Attention: How Log-Space, Semirings, and Separability Reveal a Path to Linear-Time Transformers

Transformers are defined by the self-attention mechanism—a kernel of extraordinary power and cost. Standard attention computes weighted averages of value vectors $V$ with weights derived from exponentiated query-key interactions:

A = s o f t m a x (Q K^{⊤} / \sqrt{d}) = \frac{e^{Q K^{⊤}}}{1^{⊤} e^{Q K^{⊤}}}, O = A V .

This operation is nonlinear, non-separable, and quadratic in sequence length. Over the years, the community has attacked each of these properties, yielding families of linear-time transformers such as Performer [Choromanski et al., 2020], Linear Transformers [Katharopoulos et al., 2020], and FlashAttention [Dao et al., 2022].

But there is a subtler, algebraic perspective that invites us to see attention through the lens of semirings.

1. The Log-Sum-Exp Semiring: Attention as Algebra

The softmax normalization can be expressed in the log-sum-exp (LSE) semiring, defined by:

a \oplus b = \log (e^{a} + e^{b}), a \otimes b = a + b, 0 = - \infty, 1 = 0 .

This semiring has been used for decades in dynamic programming and probabilistic inference—most notably in HMM forward-backward and CRF computations [Eisner, 2002].

In this semiring, the self-attention output for a query $q_{t}$ becomes:

\log o_{t} = {LSE}_{j} (q_{t}^{⊤} k_{j} + \log v_{j}) - {LSE}_{j} (q_{t}^{⊤} k_{j}) .

Here, softmax attention is linear within the semiring algebra (associative and distributive) but nonlinear in the reals. Computing it exactly still costs $O (N^{2})$ , though it yields perfect numerical stability.

2. Linearization: Kernel vs. Semiring Perspectives

There are two distinct routes to making attention linear-time:

Method	Key Idea	Example
Kernel-based	Approximate the exponential kernel $e^{q^{⊤} k} \approx ϕ (q)^{⊤} ϕ (k)$ .	Performer, Linear Transformer
Semiring-based	Change the underlying algebra so exponentials become additive.	Log-sum-exp semiring, tropical semiring, expectation semiring

Kernel linearization (Performer)

Performer and related models introduce a positive feature map $ϕ$ such that:

e^{q^{⊤} k} \approx ϕ (q)^{⊤} ϕ (k) .

Then, query-independent states can be accumulated:

Z_{t} = \sum_{j \leq t} ϕ (k_{j}), S_{t} = \sum_{j \leq t} ϕ (k_{j}) v_{j}^{⊤}, o_{t} = \frac{ϕ (q_{t})^{⊤} S_{t}}{ϕ (q_{t})^{⊤} Z_{t}} .

These query-independent states are what make attention linear-time: they summarize all keys and values in a finite-dimensional space.

Semiring linearization

In the LSE semiring, one can express causal attention recursively:

\begin{matrix} D_{t} = & \log (e^{D_{t - 1}} + e^{q_{t}^{⊤} k_{t}}), \\ N_{t} = & \log (e^{N_{t - 1}} + e^{q_{t}^{⊤} k_{t} + \log v_{t}}), \\ l o g o_{t} = & N_{t} - D_{t} . \end{matrix}

This recurrence is exact, differentiable, and numerically stable, but it remains $O (N^{2})$ because each query involves new inner products $q_{t}^{⊤} k_{j}$ .

Semiring linearity $\neq$ computational linearity.

3. Toward Linear-Time Semirings: Separable Approximations to LSE

The obstacle to reusing state in log-semiring attention is non-separability:

\log (e^{x} + e^{y}) = max (x, y) + \log (1 + e^{- | x - y |}) .

The smooth ridge near $x = y$ couples the two inputs; it cannot be written as $f (x) + g (y)$ . If we could approximate or replace this with a separable operation, we’d gain query-independent updates.

Approximation 1 — Tropical limit (Max-plus semiring)

Take the temperature $τ \to 0$ :

a \oplus_{τ} b = τ \log (e^{a / τ} + e^{b / τ}) \to max (a, b) .

The tropical semiring (max-plus) is associative, separable, and defines the hard-attention limit.

Approximation 2 — Polynomial / mean-field expansion

For $| x - y |$ small,

\log (e^{x} + e^{y}) \approx \log 2 + \frac{x + y}{2} + \frac{(x - y)^{2}}{8} .

The first term $(x + y) / 2$ is separable; the quadratic residual can be approximated by a low-rank kernel.

Approximation 3 — Parametric semiring interpolation

Define a family of semirings:

a \oplus_{α} b = \frac{1}{α} \log (e^{α a} + e^{α b}),

where $α$ controls curvature: $α \to \infty$ yields max, $α \to 0$ yields mean.

Learning $α$ per layer or head could allow models to interpolate between smooth and sharp attention.

Approximation 4 — Learned separable decomposition

Because $L S E (x, y) = h (x - y) + max (x, y)$ , we can approximate the ridge function $h (Δ)$ as a low-rank expansion:

h (Δ) \approx \sum_{r = 1}^{R} α_{r} f_{r} (x) g_{r} (y) .

This leads to a learned, separable semiring—essentially a feature-map for the LSE surface itself.

4. The Expectation Semiring: Differentiable and Linearizable

The expectation semiring [Eisner, 2002; Goodman, 1999] defines pairs $(w, s)$ with:

(w_{1}, s_{1}) \oplus (w_{2}, s_{2}) = (w_{1} + w_{2},; s_{1} + s_{2}), (w_{1}, s_{1}) \otimes (w_{2}, s_{2}) = (w_{1} w_{2},; w_{1} s_{2} + w_{2} s_{1}) .

Folding over a sequence yields the weighted average $S / W$ , which corresponds exactly to the attention expectation:

o_{t} = \frac{\sum_{j} e^{q_{t}^{⊤} k_{j}} v_{j}}{\sum_{j} e^{q_{t}^{⊤} k_{j}}} .

This semiring generalizes easily to geometric aggregation (replace $v_{j}$ with $\log v_{j}$ ). When paired with separable $ρ (q^{⊤} k)$ functions—like kernel approximations—it produces the linear-time expectation semiring attention.

5. Designing New Semirings for Attention

We can generalize the LSE semiring via a monotone warp $ρ$ :

a \oplus_{ρ} b = ρ^{- 1} (ρ (a) + ρ (b)), a \otimes b = a + b .

Properties:

Commutative, associative, differentiable if $ρ$ is.
Includes LSE when $ρ (a) = e^{a}$ .

To make this linear-time, choose $ρ$ such that $ρ (q^{⊤} k)$ admits a feature map factorization:

ρ (q^{⊤} k) \approx ϕ (q)^{⊤} ϕ (k) .

Examples of separable $ρ$

Mixture of exponentials (MoE-LSE): $ρ (a) = \sum_{r} c_{r} e^{λ_{r} a}$ . This extends Performer with multiple exponential scales $λ_{r}$ and mixing weights $c_{r}$ . Each component can be approximated by its own feature map, and the mixture combines them additively:

ϕ (q) = [\sqrt{c_{1}}, ϕ_{λ_{1}} (q); |; \dots; |; \sqrt{c_{R}}, ϕ_{λ_{R}} (q)] .

The inverse $ρ^{- 1}$ can be approximated with a small neural or Newton solver. Learning the mixture coefficients $c_{r}, λ_{r}$ yields a differentiable, adaptive semiring that smoothly interpolates between exponential kernels of varying sharpness.

Softplus warp: $ρ (a) = \frac{1}{α} \log (1 + e^{α a})$ , tunable curvature.
Polynomial-on-tanh: $ρ (a) = \sum_{p} c_{p} \tanh (γ a)^{p}$ , bounded and stable.

This MoE-based semiring is particularly appealing: it’s differentiable, expressive, and efficiently separable. It generalizes the softmax ( $R = 1$ ) while preserving associativity and allowing learned curvature.

6. Relation to Other Architectures

Performer — kernel linearization of softmax [Choromanski et al., 2020].
HMM / CRF — log-semiring dynamic programs [Eisner, 2002].
RNN Semiring View — many recurrent updates can be written as semiring folds [Goodman, 1999].
Manifest AI (2024–2025) — Symmetric Power Transformer [6] and follow-up Conformal Symmetric Power work [7] explore symmetry-preserving linearizations closely related to the $ρ$ -semiring and MoE-LSE approach. For polynomial-kernel sketching, see PolySketchFormer [8].

References

[1] Choromanski, K. et al. (2020). Rethinking Attention with Performers. arXiv:2009.14794. https://arxiv.org/abs/2009.14794

[2] Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. (2020). Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. arXiv:2006.16236. https://arxiv.org/abs/2006.16236

[3] Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022. https://arxiv.org/abs/2205.14135

[4] Goodman, J. (1999). Semiring Parsing. Computational Linguistics, 25(4), 573–606. https://aclanthology.org/J99-4004.pdf

[5] Eisner, J. (2002). Parameter Estimation for Probabilistic Finite-State Transducers. Proceedings of ACL 2002. https://aclanthology.org/P02-1001/

[6] Manifest AI. (2024, Aug 15). Symmetric Power Transformers. https://manifestai.com/articles/symmetric-power-transformers/

[7] Kumar, S., Buckman, J., Gelada, C., & Zhang, S. (2025). Conformal Transformations for Symmetric Power Transformers. arXiv:2503.03269. https://arxiv.org/abs/2503.03269

[8] Kacham, P., Rao, A. N., Chen, X., Lin, W., Wang, S.-P., Seidel, K., Wang, X., Sun, L., Ye, S., & Zhang, X. (2024). Fast Transformers via Sketching Polynomial Kernels (PolySketchFormer). ICML 2024 (PMLR Vol. 235). https://arxiv.org/abs/2310.01655 OpenReview: https://openreview.net/forum?id=YkCjojDG3l

[9] Manifest AI. (2024, Dec 10). Improving Symmetric Power Transformers with Conformal Gating. https://manifestai.com/articles/optimizing-symmetric-power-transformers/

#Attention #ML #Semiring