)]}' { "commit": "c901b88adf2d1d6a61d9cb23f9e6833c500c602d", "tree": "95832e510fc2e6ab747bd019df2d84f889059feb", "parents": [ "5613b8354d5a23b499623ccd9ff84573ccb19393" ], "author": { "name": "Mike Christie", "email": "mchristi@redhat.com", "time": "Mon Nov 11 18:19:00 2019 -0600" }, "committer": { "name": "Christian Brauner", "email": "christian.brauner@ubuntu.com", "time": "Wed Jan 29 12:33:56 2020 +0100" }, "message": "prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim\n\nThere are several storage drivers like dm-multipath, iscsi, tcmu-runner,\namd nbd that have userspace components that can run in the IO path. For\nexample, iscsi and nbd\u0027s userspace deamons may need to recreate a socket\nand/or send IO on it, and dm-multipath\u0027s daemon multipathd may need to\nsend SG IO or read/write IO to figure out the state of paths and re-set\nthem up.\n\nIn the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the\nmemalloc_*_save/restore functions to control the allocation behavior,\nbut for userspace we would end up hitting an allocation that ended up\nwriting data back to the same device we are trying to allocate for.\nThe device is then in a state of deadlock, because to execute IO the\ndevice needs to allocate memory, but to allocate memory the memory\nlayers want execute IO to the device.\n\nHere is an example with nbd using a local userspace daemon that performs\nnetwork IO to a remote server. We are using XFS on top of the nbd device,\nbut it can happen with any FS or other modules layered on top of the nbd\ndevice that can write out data to free memory. Here a nbd daemon helper\nthread, msgr-worker-1, is performing a write/sendmsg on a socket to execute\na request. This kicks off a reclaim operation which results in a WRITE to\nthe nbd device and the nbd thread calling back into the mm layer.\n\n[ 1626.609191] msgr-worker-1 D 0 1026 1 0x00004000\n[ 1626.609193] Call Trace:\n[ 1626.609195] ? __schedule+0x29b/0x630\n[ 1626.609197] ? wait_for_completion+0xe0/0x170\n[ 1626.609198] schedule+0x30/0xb0\n[ 1626.609200] schedule_timeout+0x1f6/0x2f0\n[ 1626.609202] ? blk_finish_plug+0x21/0x2e\n[ 1626.609204] ? _xfs_buf_ioapply+0x2e6/0x410\n[ 1626.609206] ? wait_for_completion+0xe0/0x170\n[ 1626.609208] wait_for_completion+0x108/0x170\n[ 1626.609210] ? wake_up_q+0x70/0x70\n[ 1626.609212] ? __xfs_buf_submit+0x12e/0x250\n[ 1626.609214] ? xfs_bwrite+0x25/0x60\n[ 1626.609215] xfs_buf_iowait+0x22/0xf0\n[ 1626.609218] __xfs_buf_submit+0x12e/0x250\n[ 1626.609220] xfs_bwrite+0x25/0x60\n[ 1626.609222] xfs_reclaim_inode+0x2e8/0x310\n[ 1626.609224] xfs_reclaim_inodes_ag+0x1b6/0x300\n[ 1626.609227] xfs_reclaim_inodes_nr+0x31/0x40\n[ 1626.609228] super_cache_scan+0x152/0x1a0\n[ 1626.609231] do_shrink_slab+0x12c/0x2d0\n[ 1626.609233] shrink_slab+0x9c/0x2a0\n[ 1626.609235] shrink_node+0xd7/0x470\n[ 1626.609237] do_try_to_free_pages+0xbf/0x380\n[ 1626.609240] try_to_free_pages+0xd9/0x1f0\n[ 1626.609245] __alloc_pages_slowpath+0x3a4/0xd30\n[ 1626.609251] ? ___slab_alloc+0x238/0x560\n[ 1626.609254] __alloc_pages_nodemask+0x30c/0x350\n[ 1626.609259] skb_page_frag_refill+0x97/0xd0\n[ 1626.609274] sk_page_frag_refill+0x1d/0x80\n[ 1626.609279] tcp_sendmsg_locked+0x2bb/0xdd0\n[ 1626.609304] tcp_sendmsg+0x27/0x40\n[ 1626.609307] sock_sendmsg+0x54/0x60\n[ 1626.609308] ___sys_sendmsg+0x29f/0x320\n[ 1626.609313] ? sock_poll+0x66/0xb0\n[ 1626.609318] ? ep_item_poll.isra.15+0x40/0xc0\n[ 1626.609320] ? ep_send_events_proc+0xe6/0x230\n[ 1626.609322] ? hrtimer_try_to_cancel+0x54/0xf0\n[ 1626.609324] ? ep_read_events_proc+0xc0/0xc0\n[ 1626.609326] ? _raw_write_unlock_irq+0xa/0x20\n[ 1626.609327] ? ep_scan_ready_list.constprop.19+0x218/0x230\n[ 1626.609329] ? __hrtimer_init+0xb0/0xb0\n[ 1626.609331] ? _raw_spin_unlock_irq+0xa/0x20\n[ 1626.609334] ? ep_poll+0x26c/0x4a0\n[ 1626.609337] ? tcp_tsq_write.part.54+0xa0/0xa0\n[ 1626.609339] ? release_sock+0x43/0x90\n[ 1626.609341] ? _raw_spin_unlock_bh+0xa/0x20\n[ 1626.609342] __sys_sendmsg+0x47/0x80\n[ 1626.609347] do_syscall_64+0x5f/0x1c0\n[ 1626.609349] ? prepare_exit_to_usermode+0x75/0xa0\n[ 1626.609351] entry_SYSCALL_64_after_hwframe+0x44/0xa9\n\nThis patch adds a new prctl command that daemons can use after they have\ndone their initial setup, and before they start to do allocations that\nare in the IO path. It sets the PF_MEMALLOC_NOIO and PF_LESS_THROTTLE\nflags so both userspace block and FS threads can use it to avoid the\nallocation recursion and try to prevent from being throttled while\nwriting out data to free up memory.\n\nSigned-off-by: Mike Christie \u003cmchristi@redhat.com\u003e\nAcked-by: Michal Hocko \u003cmhocko@suse.com\u003e\nTested-by: Masato Suzuki \u003cmasato.suzuki@wdc.com\u003e\nReviewed-by: Damien Le Moal \u003cdamien.lemoal@wdc.com\u003e\nReviewed-by: Bart Van Assche \u003cbvanassche@acm.org\u003e\nReviewed-by: Dave Chinner \u003cdchinner@redhat.com\u003e\nReviewed-by: Darrick J. Wong \u003cdarrick.wong@oracle.com\u003e\nLink: https://lore.kernel.org/r/20191112001900.9206-1-mchristi@redhat.com\nSigned-off-by: Christian Brauner \u003cchristian.brauner@ubuntu.com\u003e\n", "tree_diff": [ { "type": "modify", "old_id": "240fdb9a60f6851f4f129b2fbb1af5775841abcb", "old_mode": 33188, "old_path": "include/uapi/linux/capability.h", "new_id": "272dc69fa0801efa74c199a7414b936f453d9043", "new_mode": 33188, "new_path": "include/uapi/linux/capability.h" }, { "type": "modify", "old_id": "7da1b37b27aa5b75fb89b79f0d9f193e5021a911", "old_mode": 33188, "old_path": "include/uapi/linux/prctl.h", "new_id": "07b4f8131e362bdc815f37cea0c9067a9464f256", "new_mode": 33188, "new_path": "include/uapi/linux/prctl.h" }, { "type": "modify", "old_id": "a9331f101883c13aa4027f399b9624b3bc62d730", "old_mode": 33188, "old_path": "kernel/sys.c", "new_id": "f9bc5c303e3f42be77cb88c5b4a630f765165ee2", "new_mode": 33188, "new_path": "kernel/sys.c" } ] }