HTTP RateLimit headers
2026-01-14 03:02https://dotat.at/@/2026-01-13-http-ratelimit.html
There is an IETF draft that aims to standardize RateLimit header
fields for HTTP. A RateLimit header in a successful response
can inform a client when it might expect to be throttled, so it can
avoid 429 Too Many Requests errors. Servers can also send
RateLimit headers in 429 errors to make the response more
informative.
The draft is in reasonably good shape. However as written it seems to require (or at least it assumes) that the server uses bad quota-reset rate limit algorithms. Quota-reset algorithms encourage clients into cyclic burst-pause behaviour; the draft has several paragraphs discussing this problem.
However, if we consider that RateLimit headers are supposed to tell
the client what acceptable behaviour looks like, they can be used with
any rate limit algorithm. (And it isn't too hard to rephrase the draft
so that it is written in terms of client behaviour instead of server
behaviour.)
When a client has more work to do than will fit in a single window's
quota, linear rate limit algorithms such as GCRA encourage the client
to smooth out its requests nicely. In this article I'll describe how a
server can use a linear rate limit algorithm with HTTP RateLimit
headers.
spec summary
The draft specifies two headers:
RateLimit-Policy:describes input parameters to a rate limit algorithm, which the server chooses based on the request in some unspecified way. The policies are expected to be largely static for a particular client. The parameters are,- the name of the policy
pk, the partition keyq, the quotaw, the windowqu, the quota units
RateLimit:describes which policies the server applied to this request, and the output results of the rate limit algorithm. The results are likely to vary per request depending on client behaviour or server load, etc. The results are,- the name of the policy
pk, the partition keyr, the remaining quotat, the remaining window
Both headers can list multiple named policies.
To obey a policy, the client should not make more than q requests
within w seconds. When it receives a response containing a
RateLimit header, the client should not make more than r requests
in the following t seconds.
policy parameters
The client's maximum sustained rate is
rate = quota / window
On average the interval between requests at the maximum rate is
interval = 1 / rate
= window / quota
The draft establishes a standard collection of possible quota units, which determine how the cost of a request is counted.
qu="requests"cost = 1qu="content-bytes"cost = size of contentqu="concurrent-requests"(not applicable to this article)
The partition key pk identifies where the server keeps the
per-client rate limit state. Typically a server will have a relatively
small fixed set of rate limit policies, and a large dynamic collection
of rate limit states.
linear rate limit algorithm
This linear rate limit algorithm is a variant of GCRA adapted to
use HTTP RateLimit header terminology.
The per-client rate limit state is just a "not-before" timestamp. Requests are allowed if they occur after that point in time. The not-before timestamp is normally in the recent past, and requests increase it towards and possibly (when the client is over its limit) beyond the present time.
The output results contain r and t values for the RateLimit
header. These must be whole numbers, though internally the algorithm
needs subsecond precision because there can be many requests per
second.
Get the client's rate limit state. New clients are given a default not-before time in the distant past.
time = state[pk] or 0Ensure the not-before time is within the sliding window. The minimum controls the client's burst size quota; the maximum controls the penalty applied to abusive clients. This
clamp()also protects against wild clock resets.time = clamp(now - window, time, now)Spend the nominal time consumed by the request.
time += interval * costThe request is allowed if it occurs after the not-before time. The difference between the times determines the remaining quota and remaining window. Round the quota down and the window up to whole numbers, so the client is given an underestimate of its service limit.
if now > time state[pk] = time d = now - time r = floor(d * rate) t = ceil(d) return ALLOW(r, t)The request is denied if there is not enough time available. (The commented-out line is discussed below.) The remaining quota is zero, and the remaining window is the not-before time, rounded up as before.
if time >= now # state[pk] = time r = 0 t = ceil(time - now) return DENY(r, t)
That's the core of the algorithm. There are a few refinements to consider, to do with strictness and cleanliness.
Over-limit clients can be treated more or less leniently:
If the server does not update
state[pk]when denying a request (like the commented-out line above), requests from the client will be allowed at the permitted rate regardless of how frequently the client makes requests: denied requests are not counted.Alternatively, if the server sets
state[pk] = timewhen denying a request, all requests from a client will be denied until its attempts slow down below the permitted rate.To apply an extra penalty time-out to clients that make requests within the not-before time, the server can increase the upper bound on the
clamp()into the future.
Finally, it's a good idea to have a background task that reclaims unnecessary state belonging to idle clients:
for pk, time in state
if time < now - window
delete state[pk]
other rate limiters
The following discussion is mostly about explaining my claims in the introductory paragraphs at the start of this article.
There is a family of linear rate limit algorithms: GCRA behaves the same as a leaky bucket or a token bucket rate limiter (when they are implemented properly -- there's no need for a separate periodic leak / refill task), but GCRA uses less space per client (it doesn't need an explicit bucket alongside the timestamp) and its code is simpler.
The HTTP RateLimit draft is based on algorithms that reset the quota
and window at a predictable time regardless of client behaviour. (The
draft allows other algorithms, but most of the text makes this
assumption.) After a client makes a burst of requests, it has to wait
until the reset time, then it is allowed to make another burst of
requests.
Another class of algorithm is based on keeping a log of client activity within a sliding window in the recent past. (This is expensive and wasteful!) Sliding-window algorithms allow new requests when old requests age out, so they encourage clients to repeat past behaviour. If a client starts with a burst then it can burst again in each subsequent window.
By contrast, when a linear rate limiter allows a client to make a fast burst of requests, it shrinks the remaining window and then keeps the effective window small while the client is busy.
Smaller quotas and windows restrict the burst size, so they require smoother behaviour than larger quotas and windows, even though the ratio quota / window describes the same maximum rate.
Automatically shrinking the effective window reduces the permitted burstiness of subsequent requests. As a result, linear rate limiters smooth out the interval between requests, as well as controlling the number of requests per window.
Other algorithms can make clients behave smoothly too.
A linear rate limiter indirectly assesses a client's request rate by relating a request counter to the passage of time. By contrast, an exponential rate limiter directly measures a client's average rate and allows or denies requests by comparing the rate to the limit.
An exponential rate limiter can be used with HTTP RateLimit
headers in a similar manner to a linear rate limiter. That previous
article explains how to calculate the remaining window (called
t_next) when a request is denied. When a request is allowed, the
output results for a RateLimit header can be calculated as follows.
(Beware, this snippet uses an awkward mixture of variable names from
this article and the previous article.)
if r_now < quota
state[pk].time = t_now
state[pk].rate = r_now
a = quota - r_now
r = floor(a)
t = ceil(a * window / quota)
return ALLOW(r, t)
The RateLimit header calculation for the exponential rate limiter is
just the reciprocal of the linear rate limiter. It has the same
smoothing effect for the same reasons.
I expect it's possible to retrofit dynamic window shrinking on to other rate limit algorithms. But (as I wrote yesterday) it's better to use a simple GCRA linear rate limiter.
PS. The terminology in this article is different from how
I previously described rate limit algorithms: in the past
I used rate = limit / period where the HTTP RateLimit draft uses
quota / window.