Cindered Thoughts

Rethinking Attention Through Semirings: Toward Linear-Time Log-Domain Transformers

By Anooj Patel


This is post is build on some intuition that is worked through in:

Unfolding Attention: How Log-Space, Semirings, and Separability Reveal a Path to Linear-Time Transformers


Transformers are defined by the self-attention mechanism—a kernel of extraordinary power and cost. Standard attention computes weighted averages of value vectors V with weights derived from exponentiated query-key interactions:

A=softmax(QK/d)=eQK1eQK,O=AV.

This operation is nonlinear, non-separable, and quadratic in sequence length. Over the years, the community has attacked each of these properties, yielding families of linear-time transformers such as Performer [Choromanski et al., 2020], Linear Transformers [Katharopoulos et al., 2020], and FlashAttention [Dao et al., 2022].

But there is a subtler, algebraic perspective that invites us to see attention through the lens of semirings.


1. The Log-Sum-Exp Semiring: Attention as Algebra

The softmax normalization can be expressed in the log-sum-exp (LSE) semiring, defined by:

ab=log(ea+eb),ab=a+b,0=,1=0.

This semiring has been used for decades in dynamic programming and probabilistic inference—most notably in HMM forward-backward and CRF computations [Eisner, 2002].

In this semiring, the self-attention output for a query qt becomes:

logot=LSEj(qtkj+logvj)LSEj(qtkj).

Here, softmax attention is linear within the semiring algebra (associative and distributive) but nonlinear in the reals. Computing it exactly still costs O(N2), though it yields perfect numerical stability.


2. Linearization: Kernel vs. Semiring Perspectives

There are two distinct routes to making attention linear-time:

Method Key Idea Example
Kernel-based Approximate the exponential kernel eqkϕ(q)ϕ(k). Performer, Linear Transformer
Semiring-based Change the underlying algebra so exponentials become additive. Log-sum-exp semiring, tropical semiring, expectation semiring

Kernel linearization (Performer)

Performer and related models introduce a positive feature map ϕ such that:

eqkϕ(q)ϕ(k).

Then, query-independent states can be accumulated:

Zt=jtϕ(kj),St=jtϕ(kj)vj,ot=ϕ(qt)Stϕ(qt)Zt.

These query-independent states are what make attention linear-time: they summarize all keys and values in a finite-dimensional space.

Semiring linearization

In the LSE semiring, one can express causal attention recursively:

Dt=log(eDt1+eqtkt),Nt=log(eNt1+eqtkt+logvt),logot=NtDt.

This recurrence is exact, differentiable, and numerically stable, but it remains O(N2) because each query involves new inner products qtkj.

Semiring linearity computational linearity.


3. Toward Linear-Time Semirings: Separable Approximations to LSE

The obstacle to reusing state in log-semiring attention is non-separability:

log(ex+ey)=max(x,y)+log(1+e|xy|).

The smooth ridge near x=y couples the two inputs; it cannot be written as f(x)+g(y). If we could approximate or replace this with a separable operation, we’d gain query-independent updates.

Approximation 1 — Tropical limit (Max-plus semiring)

Take the temperature τ0:

aτb=τlog(ea/τ+eb/τ)max(a,b).

The tropical semiring (max-plus) is associative, separable, and defines the hard-attention limit.

Approximation 2 — Polynomial / mean-field expansion

For |xy| small,

log(ex+ey)log2+x+y2+(xy)28.

The first term (x+y)/2 is separable; the quadratic residual can be approximated by a low-rank kernel.

Approximation 3 — Parametric semiring interpolation

Define a family of semirings:

aαb=1αlog(eαa+eαb),

where α controls curvature: α yields max, α0 yields mean.

Learning α per layer or head could allow models to interpolate between smooth and sharp attention.

Approximation 4 — Learned separable decomposition

Because LSE(x,y)=h(xy)+max(x,y), we can approximate the ridge function h(Δ) as a low-rank expansion:

h(Δ)r=1Rαrfr(x)gr(y).

This leads to a learned, separable semiring—essentially a feature-map for the LSE surface itself.


4. The Expectation Semiring: Differentiable and Linearizable

The expectation semiring [Eisner, 2002; Goodman, 1999] defines pairs (w,s) with:

(w1,s1)(w2,s2)=(w1+w2,;s1+s2),(w1,s1)(w2,s2)=(w1w2,;w1s2+w2s1).

Folding over a sequence yields the weighted average S/W, which corresponds exactly to the attention expectation:

ot=jeqtkjvjjeqtkj.

This semiring generalizes easily to geometric aggregation (replace vj with logvj). When paired with separable ρ(qk) functions—like kernel approximations—it produces the linear-time expectation semiring attention.


5. Designing New Semirings for Attention

We can generalize the LSE semiring via a monotone warp ρ:

aρb=ρ1(ρ(a)+ρ(b)),ab=a+b.

Properties:

To make this linear-time, choose ρ such that ρ(qk) admits a feature map factorization:

ρ(qk)ϕ(q)ϕ(k).

Examples of separable ρ

ϕ(q)=[c1,ϕλ1(q);|;;|;cR,ϕλR(q)].

The inverse ρ1 can be approximated with a small neural or Newton solver. Learning the mixture coefficients cr,λr yields a differentiable, adaptive semiring that smoothly interpolates between exponential kernels of varying sharpness.

This MoE-based semiring is particularly appealing: it’s differentiable, expressive, and efficiently separable. It generalizes the softmax (R=1) while preserving associativity and allowing learned curvature.


6. Relation to Other Architectures


References

[1] Choromanski, K. et al. (2020). Rethinking Attention with Performers. arXiv:2009.14794. https://arxiv.org/abs/2009.14794

[2] Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. (2020). Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. arXiv:2006.16236. https://arxiv.org/abs/2006.16236

[3] Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022. https://arxiv.org/abs/2205.14135

[4] Goodman, J. (1999). Semiring Parsing. Computational Linguistics, 25(4), 573–606. https://aclanthology.org/J99-4004.pdf

[5] Eisner, J. (2002). Parameter Estimation for Probabilistic Finite-State Transducers. Proceedings of ACL 2002. https://aclanthology.org/P02-1001/

[6] Manifest AI. (2024, Aug 15). Symmetric Power Transformers. https://manifestai.com/articles/symmetric-power-transformers/

[7] Kumar, S., Buckman, J., Gelada, C., & Zhang, S. (2025). Conformal Transformations for Symmetric Power Transformers. arXiv:2503.03269. https://arxiv.org/abs/2503.03269

[8] Kacham, P., Rao, A. N., Chen, X., Lin, W., Wang, S.-P., Seidel, K., Wang, X., Sun, L., Ye, S., & Zhang, X. (2024). Fast Transformers via Sketching Polynomial Kernels (PolySketchFormer). ICML 2024 (PMLR Vol. 235). https://arxiv.org/abs/2310.01655 OpenReview: https://openreview.net/forum?id=YkCjojDG3l

[9] Manifest AI. (2024, Dec 10). Improving Symmetric Power Transformers with Conformal Gating. https://manifestai.com/articles/optimizing-symmetric-power-transformers/


#Attention #ML #Semiring