LVM Cache Modes and Behavior

, .

In a previous blog post, I showed how to set up an encrypted Linux setup across an SSD and an HDD. (Technically, you can set this up with any kind of disk, but I’ll use “SSD” to refer to the “cache device” and “HDD” for the “origin device”.) I still use this setup on my desktop PC, but I’ve now gained some more experience specifically with the SSD+HDD caching part of it, and wanted to discuss some aspects of this here. (The encryption part has just been smooth sailing, as far as I remember; nothing to report there.)

Cache modes

There are two main cache modes that can be used in this cached setup, and the default cache modes actually depends on which “layer” of the system you’re looking at.

writeback is the default mode if you’re using the kernel’s dm-cache feature directly. In this mode, modified blocks are written to the SSD, and reported as complete as soon as the SSD write finishes. The block is marked as “dirty” in the metadata (which also resides on the SSD), but not necessarily written to the HDD at all (but see below for more on that).

writethrough is the default mode in lvmcache, the caching component of LVM which is based on dm-cache. (Since we’re using LVM, rather than using dm-cache directly, this means that writethrough is the actual default value for our purposes.) In this mode, modified blocks are written to both the SSD and the HDD, and only reported as complete when both writes have finished. Blocks that are already “clean” remain so.

I haven’t been able to find out why LVM chooses a different default cache mode. My guess is that it’s a “conservative” or “safe” choice: in writeback mode, losing the SSD means potentially losing data. Since the LVM maintainers don’t know how much you trust the SSD (could it break at any moment?), defaulting to the cache mode that always has the latest data on the HDD is safer.

For my use case, I’m not particularly worried about failure of the SSD (at least, not any more than I’m worried about failure of the HDD – and I have backups in any case), so I changed the cache mode to writeback, and it’s been working fairly well. If you’re also happy to put some trust into your SSD, then I recommend you consider using writeback mode too.

Changing cache mode

Changing the cache mode is a relatively simple command:

# lvchange --cachemode writeback RootVG/cryptroot

I did this from an Arch Linux live system (on a USB stick), but as far as I now understand, there’s actually no need to do that. LVM doesn’t have a concept of volumes being “open” or “closed” or “live” or whatever (unlike cryptsetup, which I think is where I got this idea from), so you should also be able to do this while your system is running as usual.

The actual change of the cache mode happens very quickly – when I rerun that command above, i.e. reset the cache mode to what it already is, it finishes in half a second. However, LVM will first insist that the cache is fully clean before it changes the cache mode, and wait for any dirty blocks to be cleaned (written to the HDD) before it proceeds. This can take quite a while; to see why, let’s discuss when blocks become clean and dirty.

When blocks become clean

If you’re using the writeback mode, then blocks will be written to the SSD and marked as “dirty”. They become marked as “clean” once they’ve been written to the HDD as well, but when does that happen? The documentation on this question isn’t entirely clear, in my opinion. The kernel dm-cache docs say that writes will go only to the cache, suggesting that dirty blocks will accumulate until the whole SSD is dirty, and they are never cleaned unless explicitly requested in some way. On the other hand, the LVM docs say that writeback delays writing data blocks from the cache back to the origin LV, so that you would expect the blocks to become clean after some unknown period of time.

In practice, the dirty cache blocks on my system usually stay in the single or low double digits, and quite often there are 0 dirty cache blocks. (You can see the number with sudo lvs -ocache_dirty_blocks.) So it seems like blocks are cleaned automatically; I haven’t been able to figure out whether this is default kernel behavior (and the kernel docs are misleading), or whether LVM initiates this behavior in some way. But in any case, dirty blocks become clean over time automatically, often within seconds.

When blocks become dirty

In writethrough mode, as mentioned above, blocks never become dirty in the first place, and if any blocks are already dirty for whatever reason, they get gradually cleaned as well, as we’ve just seen. In writeback mode, blocks temporarily become dirty when they’re written, until they’re automatically cleaned at a later time, usually fairly soon.

However, there is one other condition that causes blocks to become dirty: after a system crash, every block becomes dirty. As the kernel documentation explains, the dirty state of blocks changes so often that writing it to disk every time would not be feasible. Therefore, when the system boots and encounters a dm-cache device that wasn’t properly shut down, it can’t know how many blocks’ dirty state wasn’t written to disk and got lost when the system crashed; it has to assume that every single block on the SSD is dirty.

This is fairly catastrophic for I/O performance. With the whole cache device dirty, any time the kernel wants to promote a block to the cache, it first has to evict a different block, and because that block is dirty, it first needs to be written back to the HDD. The only way out of this situation is to clean all the blocks, which the kernel dutifully sets out to do, but now we’re talking about cleaning hundreds of thousands of blocks! Until this is finished – which will take at least several hours, and possibly days – your system performance will suffer.

Behavior after a system crash

If the situation described above happens to you, there’s no way around it: you will have a bad time using your computer. When a simple sync takes ten minutes, you’ll have wholly new opportunities to discover what actions on your computer block on disk I/O. For example:

I don’t have a “magic bullet” for this problem, and I suspect there isn’t one. You just have to let the kernel work through the backlog of dirty blocks, and the only way you can help is to leave your system running, instead of powering it down or putting it into standby or something similar. (I left it running overnight from an Arch Linux USB stick, instead of booted into the normal system, figuring that this way there would be less other I/O load; but to be honest, this probably made no significant difference.) As mentioned above, you can run sudo lvs -ocache_dirty_blocks to see how many dirty blocks are left; you can try to calculate an estimated remaining time based on how the number decreases over a period of time, which in my experience works well enough for a rough estimate at least (“it should be done by mid-day tomorrow”, estimated on the previous evening, turned out to be correct).

However, you can prepare for this situation when you put together your system hardware, by recognizing that the time required to clean the whole cache is directly proportional to the size of the SSD: the larger the SSD, the longer it will take to write its entire contents back to the HDD. My current system has a 4 TB HDD with a 500 GB SSD; I now think that this SSD is rather too large, and 250 GB or 128 GB would probably have been enough. (Of course, this also depends on how often you expect your system to crash.)

Summary

On the whole, I would still recommend this SSD+HDD setup for a desktop PC. However, looking at current SSD prices, I think within a few years it will become a sensible option to only have one or more large SSDs, with no HDD at all. At that point, we won’t need dm-cache / lvmcache anymore, and can hopefully stop worrying about dirty cache blocks.