Postgresql – Does Postgres read through entire WAL when starting replication from later point


I have been investigating an inefficiency in our ETL pipeline, where on one side we have Postgres database with logical replication slots.

A consumer keeps track of the LSN where it's at, and sometimes disconnects by design. Upon return, it tells the database the LSN where it would like to begin. However, the time it takes for the replication slot to start returning any content seems to depend on the amount of WAL files currently on disk. It appears as if the replication slot parses through them all to find the right position, which can sometimes take hours.

This behavior does not seem normal or necessary, am I missing something?

Best Answer

I think I found the explanation by looking at open file descriptors while running through some scenarios and waiting for the replication slot to start sending anything. It appears that Postgres does indeed parse a potentially large block of WAL before the replication slot send its first message.

Transactions spanning across several WAL files exacerbate the problem. If the slot's restart/flushed LSN is in the middle of a long transaction, and you tell Postgres you want to resume from that point, it will first read all the WAL files that encompass that transaction (potentially reading far back from where you left off, and far ahead of the requested starting point).

In addition to Postgres reading more WAL files than it does if you only have small transactions, the files which are part of a long transaction seem to take order(s) of magnitude longer to process.

Below are some excerpts from my test log, which include the output from a process monitoring osxfs's file descriptors.

Here the requested starting LSN is in the middle of a long transaction (ending at 0/213AF688), which appears to be why these files take a long time to process, and also why Postgres reads that far ahead before it send the first message (0/1892F2A8)

00:03:52.839   DB current LSN: 0/389866A8, test_slot_1 flushed currently to: 0/179F4948
00:03:52.993   Replication slot test_slot_1 opened with starting LSN: 0/18875A48
00:03:53.160   000000010000000000000017 is open
00:04:06.363   000000010000000000000018 is open
00:04:31.311   000000010000000000000019 is open
00:04:49.656   00000001000000000000001A is open
00:05:01.984   00000001000000000000001B is open
00:05:14.989   00000001000000000000001C is open
00:05:26.570   00000001000000000000001D is open
00:05:38.112   00000001000000000000001E is open
00:05:51.576   00000001000000000000001F is open
00:06:05.497   000000010000000000000020 is open
00:06:26.949   000000010000000000000021 is open
00:06:32.220   Received first message after PT2M39.227S

These WAL files, on the other hand, are full of single statement transactions transactions, Postgres doesn't take too long to go through them. It also doesn't seem to read any further ahead than the requested starting point.

23:30:40.633   DB current LSN: 0/389865C0, test_slot_3 flushed currently to: 0/26321EF0
23:30:40.757   Replication slot test_slot_3 opened with starting LSN: 0/37321EB8
23:30:40.847   000000010000000000000026 is open
23:30:41.522   000000010000000000000027 is open
23:30:42.525   000000010000000000000028 is open
23:30:43.395   000000010000000000000029 is open
23:30:44.278   00000001000000000000002A is open
23:30:45.271   00000001000000000000002B is open
23:30:46.143   00000001000000000000002C is open
23:30:47.137   00000001000000000000002D is open
23:30:47.764   00000001000000000000002E is open
23:30:47.891   000000010000000000000031 is open
23:30:48.023   000000010000000000000033 is open
23:30:48.151   000000010000000000000035 is open
23:30:48.279   000000010000000000000037 is open
23:30:48.289   Received first message after PT7.532S

What remains unclear is why there is such a big difference in how fast Postgres can get through the files it needs to, depending on whether they're part of a single long transaction, or full of smaller ones.

I also wish it was possible to control this behavior with some slot option, I get the feeling it is being overly cautious for my specific use case.