Using CTID Based Pagination for Data Cleanups in PostgreSQL

When dealing with very large PostgreSQL tables (we’re talking 15TB+), sometimes routine maintenance like archiving very old data can become surprisingly challenging. Despite having good indexes. I recently faced this issue when trying to clean up very old data on a very large and legacy table.

The Problem

Initial approach used standard ID-based pagination. Imagine a query like this:

DELETE FROM large_table
WHERE id IN (
  SELECT id
  FROM large_table
  WHERE created_at < '2023-04-01'
  ORDER BY id
  LIMIT 10000
);

While totally harmless on the surface, this kept timing out (10 min statement_timeout 😬). Why? Because the databases in this case needs to:

Scan the index to find old records
Sort them by ID
Perform the deletion
Maintain indexes during deletion

While the deletion and maintaining of indexes isn’t so much of an issue; On a 15TB large table, the lookup operations become extremely expensive, especially when indexes are involved.

Enter CTID

PostgreSQL maintains a physical location identifier for each row called CTID. It’s a tuple of (page_number,row_number) where:

page_number is the physical page in the table (each page is typically 8KB)
row_number is the row’s position within that page

Here’s how you can see it:

SELECT ctid, id, created_at
FROM large_table
LIMIT 5;

-- Results might look like:
-- "(0,1)","1","2023-01-01"  -- Page 0, row 1
-- "(0,2)","2","2023-01-01"  -- Page 0, row 2
-- "(0,3)","3","2023-01-01"  -- Page 0, row 3
-- "(1,1)","4","2023-01-01"  -- Page 1, row 1
-- "(1,2)","5","2023-01-01"  -- Page 1, row 2

The interesting thing about CTID is that it represents the physical storage order. So, I started to wonder - what if we can use this to process the table sequentially, page by page:

-- Process first 200,000 physical pages
WITH to_delete AS (
  SELECT ctid
  FROM large_table
  WHERE ctid BETWEEN '(0,0)'::tid AND '(200000,0)'::tid
    AND created_at < '2023-04-01'
  FOR UPDATE SKIP LOCKED
)
DELETE FROM large_table
WHERE ctid IN (SELECT ctid FROM to_delete);

Note that when we specify a CTID range like BETWEEN '(0,0)'::tid AND '(200000,0)'::tid, we’re actually selecting all rows from all positions within those pages. Each page can (and usually does) contain multiple rows, making this an efficient way to read chunks of the table sequentially.

And the benefit? These lookups are super quick and doesn’t put a lot of strain on the system like the previous query. This clean up task can slowly and easily keep doing the cleanup.

This is not an ideal solution IMO (see trade-off below), but as a short term mitigation, it gets the job done.

The complete solution

Here’s how we can structure the full cleanup (Ruby / Rails based example).

UPDATE: Note that we use REPEATABLE READ isolation to handle CTID stability issues (thanks to Diego Coladas for this important insight) - CTIDs can change during updates or VACUUM FULL operations, so we need consistent snapshots to avoid missing rows:

page_size = 200_000  # Number of pages to process at once
current_page = 0
cutoff_date = '2023-04-01'
deleted_count = 0

ApplicationRecord.transaction(isolation: :repeatable_read) do  # Ensure consistent CTID snapshots
  ApplicationRecord.with_statement_timeout(TIMEOUT) do
    loop do
      delete_sql = <<~SQL
        WITH to_delete AS (
          SELECT ctid
          FROM large_table
          WHERE ctid BETWEEN '(#{current_page * page_size},0)'::tid
                        AND '(#{(current_page + 1) * page_size},0)'::tid
            AND created_at < '#{cutoff_date}'
          FOR UPDATE OF large_table SKIP LOCKED
        )
        DELETE FROM large_table
        WHERE ctid IN (SELECT ctid FROM to_delete)
        RETURNING id;
      SQL

      result = ActiveRecord::Base.connection.exec_query(delete_sql)
      deleted_count += result.rows.size

      current_page += 1

      # Check if there are any rows in next page range
      check_sql = <<~SQL
        SELECT EXISTS (
          SELECT 1
          FROM large_table
          WHERE ctid BETWEEN '(#{current_page * page_size},0)'::tid
                        AND '(#{(current_page + 1) * page_size},0)'::tid
          LIMIT 1
        );
      SQL

      has_more_rows = ActiveRecord::Base.connection.exec_query(check_sql).rows[0][0]
      break unless has_more_rows
    end
  end
end

The trade-off

This approach is overall slower than index-based deletion would be (if it worked). We’re doing a sequential scan of the entire table, which means we’re reading every page, even those without old records. Which is what database’s query planner and index lookup would have saved us from.

However, it’s reliable, super quick and doesn’t put strain on the system. Instead of timing out on large deletions, we now:

Process the table in predictable chunks based on physical storage
Avoid expensive index operations
Maintain a consistent load on the database
Eventually complete the cleanup

For this specific use case of cleaning up old data, I am totally happy to trade speed for predictability and reliability. A slow-but-successful cleanup is better than a fast-but-always-failing one. Lastly, because this is a periodic job, its also ok if the pages distribution isn’t fully up to date. We have the option to run an ANALYZE right before the job as well.

What about sequential scans

You might wonder if simply forcing PostgreSQL to use sequential scans instead of index scans would solve the timeout issues:

SET enable_indexscan = off;
SET enable_seqscan = on;

DELETE FROM large_table
WHERE created_at < '2023-04-01'
LIMIT 10000;

Sadly not. The database would still have to sequentially scan each rows (like we would in our program, and less efficiently) and depending on the rows and row sizes, this operation would still get timed out.

Parting notes

Use indexes when possible of course. I am not saying CTID pagination is better than index lookups. Indexes are just wicked fast and the query planner is absolutely great at what it does.
Unfortunately, in this case, it was just not possible to keep this archival job from running into timeouts and worrying about straining system resources. Hence, the fallback to the CTID based approach as temporary mitigation.
And before you come at me with partitioning (😆), yes, that would greatly help here, but it’s not quite easy to partition today (due to query pattern and inter-dependency reasons), but that’s where I want it to get to.

I am also using CTID based pagination in my new tool, pg_flo, to bulk copy and move data between PostgreSQL databases, so feel free to check it out and share your feedback.

Thank you for reading 😊