fanf | HTTP RateLimit headers

https://dotat.at/@/2026-01-13-http-ratelimit.html

There is an IETF draft that aims to standardize RateLimit header fields for HTTP. A RateLimit header in a successful response can inform a client when it might expect to be throttled, so it can avoid 429 Too Many Requests errors. Servers can also send RateLimit headers in 429 errors to make the response more informative.

The draft is in reasonably good shape. However as written it seems to require (or at least it assumes) that the server uses bad quota-reset rate limit algorithms. Quota-reset algorithms encourage clients into cyclic burst-pause behaviour; the draft has several paragraphs discussing this problem.

However, if we consider that RateLimit headers are supposed to tell the client what acceptable behaviour looks like, they can be used with any rate limit algorithm. (And it isn't too hard to rephrase the draft so that it is written in terms of client behaviour instead of server behaviour.)

When a client has more work to do than will fit in a single window's quota, linear rate limit algorithms such as GCRA encourage the client to smooth out its requests nicely. In this article I'll describe how a server can use a linear rate limit algorithm with HTTP RateLimit headers.

spec summary

The draft specifies two headers:

RateLimit-Policy: describes input parameters to a rate limit algorithm, which the server chooses based on the request in some unspecified way. The policies are expected to be largely static for a particular client. The parameters are,
- the name of the policy
- pk, the partition key
- q, the quota
- w, the window
- qu, the quota units
RateLimit: describes which policies the server applied to this request, and the output results of the rate limit algorithm. The results are likely to vary per request depending on client behaviour or server load, etc. The results are,
- the name of the policy
- pk, the partition key
- r, the remaining quota
- t, the remaining window

Both headers can list multiple named policies.

To obey a policy, the client should not make more than q requests within w seconds. When it receives a response containing a RateLimit header, the client should not make more than r requests in the following t seconds.

policy parameters

The client's maximum sustained rate is

    rate = quota / window

On average the interval between requests at the maximum rate is

    interval = 1 / rate
             = window / quota

The draft establishes a standard collection of possible quota units, which determine how the cost of a request is counted.

qu="requests"
```
cost = 1
```
qu="content-bytes"
```
cost = size of content
```
qu="concurrent-requests"

(not applicable to this article)

The partition key pk identifies where the server keeps the per-client rate limit state. Typically a server will have a relatively small fixed set of rate limit policies, and a large dynamic collection of rate limit states.

linear rate limit algorithm

This linear rate limit algorithm is a variant of GCRA adapted to use HTTP RateLimit header terminology.

The per-client rate limit state is just a "not-before" timestamp. Requests are allowed if they occur after that point in time. The not-before timestamp is normally in the recent past, and requests increase it towards and possibly (when the client is over its limit) beyond the present time.

The output results contain r and t values for the RateLimit header. These must be whole numbers, though internally the algorithm needs subsecond precision because there can be many requests per second.

Get the client's rate limit state. New clients are given a default not-before time in the distant past.
```
time = state[pk] or 0
```
Ensure the not-before time is within the sliding window. The minimum controls the client's burst size quota; the maximum controls the penalty applied to abusive clients. This clamp() also protects against wild clock resets.
```
time = clamp(now - window, time, now)
```
Spend the nominal time consumed by the request.
```
time += interval * cost
```
The request is allowed if it occurs after the not-before time. The difference between the times determines the remaining quota and remaining window. Round the quota down and the window up to whole numbers, so the client is given an underestimate of its service limit.
```
if now > time
    state[pk] = time
    d = now - time
    r = floor(d * rate)
    t = ceil(d)
    return ALLOW(r, t)
```
The request is denied if there is not enough time available. (The commented-out line is discussed below.) The remaining quota is zero, and the remaining window is the not-before time, rounded up as before.
```
if time >= now
    # state[pk] = time
    r = 0
    t = ceil(time - now)
    return DENY(r, t)
```

That's the core of the algorithm. There are a few refinements to consider, to do with strictness and cleanliness.

Over-limit clients can be treated more or less leniently:

If the server does not update state[pk] when denying a request (like the commented-out line above), requests from the client will be allowed at the permitted rate regardless of how frequently the client makes requests: denied requests are not counted.
Alternatively, if the server sets state[pk] = time when denying a request, all requests from a client will be denied until its attempts slow down below the permitted rate.
To apply an extra penalty time-out to clients that make requests within the not-before time, the server can increase the upper bound on the clamp() into the future.

Finally, it's a good idea to have a background task that reclaims unnecessary state belonging to idle clients:

    for pk, time in state
        if time < now - window
            delete state[pk]

other rate limiters

The following discussion is mostly about explaining my claims in the introductory paragraphs at the start of this article.

There is a family of linear rate limit algorithms: GCRA behaves the same as a leaky bucket or a token bucket rate limiter (when they are implemented properly -- there's no need for a separate periodic leak / refill task), but GCRA uses less space per client (it doesn't need an explicit bucket alongside the timestamp) and its code is simpler.

The HTTP RateLimit draft is based on algorithms that reset the quota and window at a predictable time regardless of client behaviour. (The draft allows other algorithms, but most of the text makes this assumption.) After a client makes a burst of requests, it has to wait until the reset time, then it is allowed to make another burst of requests.

Another class of algorithm is based on keeping a log of client activity within a sliding window in the recent past. (This is expensive and wasteful!) Sliding-window algorithms allow new requests when old requests age out, so they encourage clients to repeat past behaviour. If a client starts with a burst then it can burst again in each subsequent window.

By contrast, when a linear rate limiter allows a client to make a fast burst of requests, it shrinks the remaining window and then keeps the effective window small while the client is busy.

Smaller quotas and windows restrict the burst size, so they require smoother behaviour than larger quotas and windows, even though the ratio quota / window describes the same maximum rate.

Automatically shrinking the effective window reduces the permitted burstiness of subsequent requests. As a result, linear rate limiters smooth out the interval between requests, as well as controlling the number of requests per window.

Other algorithms can make clients behave smoothly too.

A linear rate limiter indirectly assesses a client's request rate by relating a request counter to the passage of time. By contrast, an exponential rate limiter directly measures a client's average rate and allows or denies requests by comparing the rate to the limit.

An exponential rate limiter can be used with HTTP RateLimit headers in a similar manner to a linear rate limiter. That previous article explains how to calculate the remaining window (called t_next) when a request is denied. When a request is allowed, the output results for a RateLimit header can be calculated as follows. (Beware, this snippet uses an awkward mixture of variable names from this article and the previous article.)

    if r_now < quota
        state[pk].time = t_now
        state[pk].rate = r_now
        a = quota - r_now
        r = floor(a)
        t = ceil(a * window / quota)
        return ALLOW(r, t)

The RateLimit header calculation for the exponential rate limiter is just the reciprocal of the linear rate limiter. It has the same smoothing effect for the same reasons.

I expect it's possible to retrofit dynamic window shrinking on to other rate limit algorithms. But (as I wrote yesterday) it's better to use a simple GCRA linear rate limiter.

PS. The terminology in this article is different from how I previously described rate limit algorithms: in the past I used rate = limit / period where the HTTP RateLimit draft uses quota / window.