| .. SPDX-License-Identifier: GPL-2.0 |
| |
| ================================= |
| dm-pcache — Persistent Cache |
| ================================= |
| |
| *Author: Dongsheng Yang <dongsheng.yang@linux.dev>* |
| |
| This document describes *dm-pcache*, a Device-Mapper target that lets a |
| byte-addressable *DAX* (persistent-memory, “pmem”) region act as a |
| high-performance, crash-persistent cache in front of a slower block |
| device. The code lives in `drivers/md/dm-pcache/`. |
| |
| Quick feature summary |
| ===================== |
| |
| * *Write-back* caching (only mode currently supported). |
| * *16 MiB segments* allocated on the pmem device. |
| * *Data CRC32* verification (optional, per cache). |
| * Crash-safe: every metadata structure is duplicated (`PCACHE_META_INDEX_MAX |
| == 2`) and protected with CRC+sequence numbers. |
| * *Multi-tree indexing* (indexing trees sharded by logical address) for high PMem parallelism |
| * Pure *DAX path* I/O – no extra BIO round-trips |
| * *Log-structured write-back* that preserves backend crash-consistency |
| |
| |
| Constructor |
| =========== |
| |
| :: |
| |
| pcache <cache_dev> <backing_dev> [<number_of_optional_arguments> <cache_mode writeback> <data_crc true|false>] |
| |
| ========================= ==================================================== |
| ``cache_dev`` Any DAX-capable block device (``/dev/pmem0``…). |
| All metadata *and* cached blocks are stored here. |
| |
| ``backing_dev`` The slow block device to be cached. |
| |
| ``cache_mode`` Optional, Only ``writeback`` is accepted at the |
| moment. |
| |
| ``data_crc`` Optional, default to ``false`` |
| |
| * ``true`` – store CRC32 for every cached entry |
| and verify on reads |
| * ``false`` – skip CRC (faster) |
| ========================= ==================================================== |
| |
| Example |
| ------- |
| |
| .. code-block:: shell |
| |
| dmsetup create pcache_sdb --table \ |
| "0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true" |
| |
| The first time a pmem device is used, dm-pcache formats it automatically |
| (super-block, cache_info, etc.). |
| |
| |
| Status line |
| =========== |
| |
| ``dmsetup status <device>`` (``STATUSTYPE_INFO``) prints: |
| |
| :: |
| |
| <sb_flags> <seg_total> <cache_segs> <segs_used> \ |
| <gc_percent> <cache_flags> \ |
| <key_head_seg>:<key_head_off> \ |
| <dirty_tail_seg>:<dirty_tail_off> \ |
| <key_tail_seg>:<key_tail_off> |
| |
| Field meanings |
| -------------- |
| |
| =============================== ============================================= |
| ``sb_flags`` Super-block flags (e.g. endian marker). |
| |
| ``seg_total`` Number of physical *pmem* segments. |
| |
| ``cache_segs`` Number of segments used for cache. |
| |
| ``segs_used`` Segments currently allocated (bitmap weight). |
| |
| ``gc_percent`` Current GC high-water mark (0-90). |
| |
| ``cache_flags`` Bit 0 – DATA_CRC enabled |
| Bit 1 – INIT_DONE (cache initialised) |
| Bits 2-5 – cache mode (0 == WB). |
| |
| ``key_head`` Where new key-sets are being written. |
| |
| ``dirty_tail`` First dirty key-set that still needs |
| write-back to the backing device. |
| |
| ``key_tail`` First key-set that may be reclaimed by GC. |
| =============================== ============================================= |
| |
| |
| Messages |
| ======== |
| |
| *Change GC trigger* |
| |
| :: |
| |
| dmsetup message <dev> 0 gc_percent <0-90> |
| |
| |
| Theory of operation |
| =================== |
| |
| Sub-devices |
| ----------- |
| |
| ==================== ========================================================= |
| backing_dev Any block device (SSD/HDD/loop/LVM, etc.). |
| cache_dev DAX device; must expose direct-access memory. |
| ==================== ========================================================= |
| |
| Segments and key-sets |
| --------------------- |
| |
| * The pmem space is divided into *16 MiB segments*. |
| * Each write allocates space from a per-CPU *data_head* inside a segment. |
| * A *cache-key* records a logical range on the origin and where it lives |
| inside pmem (segment + offset + generation). |
| * 128 keys form a *key-set* (kset); ksets are written sequentially in pmem |
| and are themselves crash-safe (CRC). |
| * The pair *(key_tail, dirty_tail)* delimit clean/dirty and live/dead ksets. |
| |
| Write-back |
| ---------- |
| |
| Dirty keys are queued into a tree; a background worker copies data |
| back to the backing_dev and advances *dirty_tail*. A FLUSH/FUA bio from the |
| upper layers forces an immediate metadata commit. |
| |
| Garbage collection |
| ------------------ |
| |
| GC starts when ``segs_used >= seg_total * gc_percent / 100``. It walks |
| from *key_tail*, frees segments whose every key has been invalidated, and |
| advances *key_tail*. |
| |
| CRC verification |
| ---------------- |
| |
| If ``data_crc is enabled`` dm-pcache computes a CRC32 over every cached data |
| range when it is inserted and stores it in the on-media key. Reads |
| validate the CRC before copying to the caller. |
| |
| |
| Failure handling |
| ================ |
| |
| * *pmem media errors* – all metadata copies are read with |
| ``copy_mc_to_kernel``; an uncorrectable error logs and aborts initialisation. |
| * *Cache full* – if no free segment can be found, writes return ``-EBUSY``; |
| dm-pcache retries internally (request deferral). |
| * *System crash* – on attach, the driver replays ksets from *key_tail* to |
| rebuild the in-core trees; every segment’s generation guards against |
| use-after-free keys. |
| |
| |
| Limitations & TODO |
| ================== |
| |
| * Only *write-back* mode; other modes planned. |
| * Only FIFO cache invalidate; other (LRU, ARC...) planned. |
| * Table reload is not supported currently. |
| * Discard planned. |
| |
| |
| Example workflow |
| ================ |
| |
| .. code-block:: shell |
| |
| # 1. Create devices |
| dmsetup create pcache_sdb --table \ |
| "0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true" |
| |
| # 2. Put a filesystem on top |
| mkfs.ext4 /dev/mapper/pcache_sdb |
| mount /dev/mapper/pcache_sdb /mnt |
| |
| # 3. Tune GC threshold to 80 % |
| dmsetup message pcache_sdb 0 gc_percent 80 |
| |
| # 4. Observe status |
| watch -n1 'dmsetup status pcache_sdb' |
| |
| # 5. Shutdown |
| umount /mnt |
| dmsetup remove pcache_sdb |
| |
| |
| ``dm-pcache`` is under active development; feedback, bug reports and patches |
| are very welcome! |