Linux filesystem with inodes close on the disk


I'd like to make the ls -laR /media/myfs on Linux as fast as possible. I'll have 1 million files on the filesystem, 2TB of total file size, and some directories containing as much as 10000 files. Which filesystem should I use and how should I configure it?

As far as I understand, the reason why ls -laR is slow because it has to stat(2) each inode (i.e. 1 million stat(2)s), and since inodes are distributed randomly on the disk, each stat(2) needs one disk seek.

Here are some solutions I had in mind, none of which I am satisfied with:

  • Create the filesystem on an SSD, because the seek operations on SSDs are fast. This wouldn't work, because a 2TB SSD doesn't exist, or it's prohibitively expensive.

  • Create a filesystem which spans on two block devices: an SSD and a disk; the disk contains file data, and the SSD contains all the metadata (including directory entries, inodes and POSIX extended attributes). Is there a filesystem which supports this? Would it survive a system crash (power outage)?

  • Use find /media/myfs on ext2, ext3 or ext4, instead of ls -laR /media/myfs, because the former can the advantage of the d_type field (see in the getdents(2) man page), so it doesn't have to stat. Unfortunately, this doesn't meet my requirements, because I need all file sizes as well, which find /media/myfs doesn't print.

  • Use a filesystem, such as VFAT, which stores inodes in the directory entries. I'd love this one, but VFAT is not reliable and flexible enough for me, and I don't know of any other filesystem which does that. Do you? Of course, storing inodes in the directory entries wouldn't work for files with a link count more than 1, but that's not a problem since I have only a few dozen such files in my use case.

  • Adjust some settings in /proc or sysctl so that inodes are locked to system memory forever. This would not speed up the first ls -laR /media/myfs, but it would make all subsequent invocations amazingly fast. How can I do this? I don't like this idea, because it doesn't speed up the first invocation, which currently takes 30 minutes. Also I'd like to lock the POSIX extended attributes in memory as well. What do I have to do for that?

  • Use a filesystem which has an online defragmentation tool, which can be instructed to relocate inodes to the the beginning of the block device. Once the relocation is done, I can run dd if=/dev/sdb of=/dev/null bs=1M count=256 to get the beginning of the block device fetched to the kernel in-memory cache without seeking, and then the stat(2) operations would be fast, because they read from the cache. Is there a way to lock those inodes and/or blocks into memory once they have been read? Which filesystem has such a defragmentation tool?

Best Answer

I'll trade you my answer to your question for your answer to mine: What knobs have to be fiddled in /proc or /sys to keep all the inodes in memory?

Now for my answer to your question:

I'm struggling with a similar-ish issue, where I'm trying to get ls -l to work quickly over NFS for a directory with a few thousand files when the server is heavily loaded.

A NetApp performs the task brilliantly; everything else I've tried so far doesn't.

Researching this, I've found a few filesystems that separate metadata from data, but they all have some shortcomings:

  • dualfs: Has some patches available for 2.4.19 but not much else.
  • lustre: ls -l is a worst-case scenario because all the metadata except the file size is stored on the metadata server.
  • QFS for Solaris, StorNext/Xsan: Not known for great metadata performance without a substantial investment.

So that won't help (unless you can revive dualfs).

The best answer in your case is to increase your spindle count as much as possible. The ugliest - but cheapest and most practical - way to do this is to get an enterprise-class JBOD (or two) and fiber channel card off of Ebay that are a few years old. If you look hard, you should be able to keep your costs under $500 or so. The search terms "146gb" and "73gb" will be of great help. You should be able to convince a seller to make a deal on something like this, since they've got a bunch of them sitting around and hardly any interested buyers:

Set up a RAID-0 stripe across all the drives. Back up your data religiously, because one or two of the drives will inevitably fail. Use tar for the backup instead of cp or rsync so that the receiving single drive won't have to deal with the millions of inodes.

This is the single cheapest way I've found (at this particular historical moment, anyway) to increase IOPs for filesystems in the 2-4TB range.

Hope that helps - or is at least interesting!

Related Question