Does a SSD page size > filesystem block size increase SSD write amplification


After reading up on SSD technology, there's a question I can't seem to answer.

Judging by this Anandtech review, most current SSDs use at least a 8KB page size. As far as I understand, this means that no less than 8KB can be written to the SSD's NAND at once. Apparently, this helps for random writes as well as other things.

However, many current filesystems use a 4KB block size. Does this mean that, for a single 4K write, the SSD write amplification is at least 2, since a whole 8KB page is written? What about random 4K writes?

Best Answer

I'm not 100% sure but have been looking into the same question. I believe that it can have an effect but depends also on the physical and logical block sizes reported by the disk. If so, it would affect both performance and durability.

The filesystem block size is the smallest addressable unit for the filesystem. Filesystem requests are passed to the device driver to get the data from the disk. The device driver will convert into device block requests based on the logical/physical block sizes of the disk. On Linux you can see e.g. /sys/block/sda/queue/logical_block_size and /sys/block/sda/queue/physical_block_size.

If the device block size is lower than the page size of the device then the requests will anyway be broken down further no matter how large they are to start with. This seems like suboptimal behaviour anyway that should be fixed first - I think it is not uncommon to see 512B on a modern SSD with 4/8KiB pages. See Changing sector size on Samsung 840 SSDs.

If the device block size is the same as the page size, say 8KiB, then it seems likely that two 4KiB requests from filesystem would result in a unnecessary second request and in the case of writes especially this would be pretty bad.

What is not clear to me is the extent to which either of these effects is mitigated by controller or OS caching. jon's answer does not provide much detail or evidence. It sounds pretty likely that reads would be cached, as this would anyway be a risk-free performance and durability optimisation. Write-caching is normally enabled for SSDs too but is optional. It then depends a bit on the cache policy/flush interval. On Windows there seems to be some confusion around write caching options (the accepted answer to What does "Write cache buffer flushing" mean doesn't really explain the difference between two settings and other articles also conflate them).