There are two common ways to page database results. The simplest and most common is to use SQL's offset
. The slightly more complicated option is to use a "cursor". Each approach has trade-offs. Offset-based paging can be slow for large offsets and, in most cases, requires a second select count(*) ...
query. Cursor-based paging is more complicated to implement and can't be used to jump to a specific page (you're largely limited to prev
and next
style paging).
You're probably already familiar with offset-based paging. You get a "page" and "perpage" value, and you calculate the SQL's limit
and offset
values.
select * from users
order by username
limit 20 offset 0
Assuming a 1-based page
, the offset
is calculated as (page - 1) * perpage
.
Cursor paging relies on a where
filter to get rows that come after (or before) a specific "cursor". This is easier to understand if we assume that we're sorting on a unique value (this isn't required for cursor-based paging, it just makes it easier to understand)
Imagine we have these usernames: duncan
, ghanima
, jessica
, leto
, paul
, vladimir
. Using offset-based paging with a limit of 2, we can get the 2nd page using:
select username form users
order by username
limit 2 offset 2
Using cursor-based paging, our query would look like:
select username form users
where username > 'ghanima'
order by username
limit 2
Here ghanima
is the cursor. Both SQL statements return jessica
and leto
. To get the next page, the cursor would become leto
. To get the previous page, the cursor would become jessica
AND we'd change our comparison to be <
(i.e. username < 'jessica'
).
For cursors to work, you'll sort your results on a non-unique column. Whether you're using an offset or a cursor, you should really include a 2nd column in your order by
(i.e. order by date, id
) as a tie breaker to ensure consistent ordering. Including this "tie-breaking" column is also the solution for being able to use our cursor. Our where clause now becomes:
where (date > $1 or (date = $2 and id > $3))
Consider this data:
id | date
1 | 2023-07-28
2 | 2023-07-28
3 | 2023-07-28
4 | 2023-07-29
5 | 2023-07-29
6 | 2023-07-30
If we only the date for the cursor:
select * from users
where date > '2023-07-08'
order by date, id
limit 2
Our first page will return ids 1
and 2
and our second page will return ids 4
and 5
. The third row gets skipped!
Hence we need to include the tie-breaking unique column:
select * from users
where (date > '2023-07-08' or (date = '2023-07-08' and id > 2))
order by date, id
limit 2
Of course, you have to pass the "prev" and "next" cursor to the client. As far the client is concerned, they are just opaque values that they include when they want to go forwards or backwards. Aa simple approach is to glue the fields of your cursor together (maybe with an _
), e.g.2023-07-28_2
. I'd probably go even further and encode the "prev" or "next" data into the cursor, something like: n_2023-07-28_2
. Now even more implementation details are hidden from the client.
In case you're wondering, cursors do work with in conjunction with other where
filters.
Why is all of this necessary? The issue with offset
is that it requires an O(N) scan where N is the size of the offset. Even if you have other (indexable) parts in your query, you still need to skip over N rows. For example, say we have this query:
select *
from drive_stats
where date >= '2020-01-01'
order by date
limit 100 offset 100000
An index on date
can be used, but only to find first row where date >= '2020-01-01'
. But you don't want the first row, you want 100000th and the only way to get to it is to iterate through them one at a time. A cursor lets you utilize that index to get directly to the record you're after.
But I was curious, just how much of a performance difference are we talking about? This is a very informal test, but I re-used the Backblaze Data Dumps that I used for my DuckDB-Proxy demo and tried a few different statements.
The full dump has a bit over 50 million rows. With SQLite, a select statement with an cursor (on an indexed column) is consistently 1ms. It's only with an offset of around 20 000 where things get consistently slower, at 2ms. At 100 000, it's around 8ms. At 1 million, it's 45ms. At 20 million, it's 420ms. And at 50 million, it's around 1 second.
Similarly with PostreSQL, using a cursor we get a consistent result of around 1ms. Any offset less than 1000 is equally fast. At 10 000, it takes 3ms. at 50 000 it takes 10ms. At 1 million is takes 75ms. At 20 million it takes 1.3 seconds. And at 50 million, it's 4.3 seconds.
Obviously you should make your own measurement with your own data and your own access pattern. But unless you're regularly dealing with large offsets or you're dealing with high loads, offset-based paging is still pretty fast.
If you were building a catalog system, it might make sense to use offset-based paging in the backend CMS, where users might appreciate being able to jump to specific pages. And, for the busier public facing side, you could use cursors, particularly if you're focusing on mobile-users where cursors fit the "page by scroll" paradigm.