Commit Graph

11660 Commits

Author SHA1 Message Date
Josef Bacik
9c8237021b btrfs: handle errors from btrfs_dec_ref() properly
commit 5eb178f373b4f16f3b42d55ff88fc94dd95b93b1 upstream.

In walk_up_proc() we BUG_ON(ret) from btrfs_dec_ref().  This is
incorrect, we have proper error handling here, return the error.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Jianqi Ren <jianqi.ren.cn@windriver.com>
Signed-off-by: He Zhe <zhe.he@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-10 14:33:42 +02:00
Filipe Manana
791d0082c1 btrfs: fix hole expansion when writing at an offset beyond EOF
commit da2dccd7451de62b175fb8f0808d644959e964c7 upstream.

At btrfs_write_check() if our file's i_size is not sector size aligned and
we have a write that starts at an offset larger than the i_size that falls
within the same page of the i_size, then we end up not zeroing the file
range [i_size, write_offset).

The code is this:

    start_pos = round_down(pos, fs_info->sectorsize);
    oldsize = i_size_read(inode);
    if (start_pos > oldsize) {
        /* Expand hole size to cover write data, preventing empty gap */
        loff_t end_pos = round_up(pos + count, fs_info->sectorsize);

        ret = btrfs_cont_expand(BTRFS_I(inode), oldsize, end_pos);
        if (ret)
            return ret;
    }

So if our file's i_size is 90269 bytes and a write at offset 90365 bytes
comes in, we get 'start_pos' set to 90112 bytes, which is less than the
i_size and therefore we don't zero out the range [90269, 90365) by
calling btrfs_cont_expand().

This is an old bug introduced in commit 9036c10208 ("Btrfs: update hole
handling v2"), from 2008, and the buggy code got moved around over the
years.

Fix this by discarding 'start_pos' and comparing against the write offset
('pos') without any alignment.

This bug was recently exposed by test case generic/363 which tests this
scenario by polluting ranges beyond EOF with an mmap write and than verify
that after a file increases we get zeroes for the range which is supposed
to be a hole and not what we wrote with the previous mmaped write.

We're only seeing this exposed now because generic/363 used to run only
on xfs until last Sunday's fstests update.

The test was failing like this:

   $ ./check generic/363
   FSTYP         -- btrfs
   PLATFORM      -- Linux/x86_64 debian0 6.13.0-rc7-btrfs-next-185+ #17 SMP PREEMPT_DYNAMIC Mon Feb  3 12:28:46 WET 2025
   MKFS_OPTIONS  -- /dev/sdc
   MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1

   generic/363 0s ... [failed, exit status 1]- output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/363.out.bad)
#      --- tests/generic/363.out	2025-02-05 15:31:14.013646509 +0000
#      +++ /home/fdmanana/git/hub/xfstests/results//generic/363.out.bad	2025-02-05 17:25:33.112630781 +0000
       @@ -1 +1,46 @@
        QA output created by 363
       +READ BAD DATA: offset = 0xdcad, size = 0xd921, fname = /home/fdmanana/btrfs-tests/dev/junk
       +OFFSET      GOOD    BAD     RANGE
       +0x1609d     0x0000  0x3104  0x0
       +operation# (mod 256) for the bad data may be 4
       +0x1609e     0x0000  0x0472  0x1
       +operation# (mod 256) for the bad data may be 4
       ...
       (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/363.out /home/fdmanana/git/hub/xfstests/results//generic/363.out.bad'  to see the entire diff)
   Ran: generic/363
   Failures: generic/363
   Failed 1 of 1 tests

Fixes: 9036c10208 ("Btrfs: update hole handling v2")
CC: stable@vger.kernel.org
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-21 13:50:06 +01:00
Filipe Manana
0c7ba1d65c btrfs: avoid monopolizing a core when activating a swap file
commit 2c8507c63f5498d4ee4af404a8e44ceae4345056 upstream.

This commit re-attempts the backport of the change to the linux-6.1.y
branch. Commit bb8e287f59 ("btrfs: avoid monopolizing a core when
activating a swap file") on this branch was reverted.

During swap activation we iterate over the extents of a file and we can
have many thousands of them, so we can end up in a busy loop monopolizing
a core. Avoid this by doing a voluntary reschedule after processing each
extent.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-21 13:49:55 +01:00
Koichiro Den
8c481939fb Revert "btrfs: avoid monopolizing a core when activating a swap file"
This reverts commit bb8e287f59.

The backport for linux-6.1.y, commit bb8e287f59 ("btrfs: avoid
monopolizing a core when activating a swap file"), inserted
cond_resched() in the wrong location.

Revert it now; a subsequent commit will re-backport the original patch.

Fixes: bb8e287f59 ("btrfs: avoid monopolizing a core when activating a swap file") # linux-6.1.y
Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-21 13:49:55 +01:00
Josef Bacik
f8b7f725ca btrfs: convert BUG_ON in btrfs_reloc_cow_block() to proper error handling
[ Upstream commit 6a4730b325aaa48f7a5d5ba97aff0a955e2d9cec ]

This BUG_ON is meant to catch backref cache problems, but these can
arise from either bugs in the backref cache or corruption in the extent
tree.  Fix it to be a proper error.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-21 13:49:29 +01:00
Hao-ran Zheng
db4223632a btrfs: fix data race when accessing the inode's disk_i_size at btrfs_drop_extents()
[ Upstream commit 5324c4e10e9c2ce307a037e904c0d9671d7137d9 ]

A data race occurs when the function `insert_ordered_extent_file_extent()`
and the function `btrfs_inode_safe_disk_i_size_write()` are executed
concurrently. The function `insert_ordered_extent_file_extent()` is not
locked when reading inode->disk_i_size, causing
`btrfs_inode_safe_disk_i_size_write()` to cause data competition when
writing inode->disk_i_size, thus affecting the value of `modify_tree`.

The specific call stack that appears during testing is as follows:

  ============DATA_RACE============
   btrfs_drop_extents+0x89a/0xa060 [btrfs]
   insert_reserved_file_extent+0xb54/0x2960 [btrfs]
   insert_ordered_extent_file_extent+0xff5/0x1760 [btrfs]
   btrfs_finish_one_ordered+0x1b85/0x36a0 [btrfs]
   btrfs_finish_ordered_io+0x37/0x60 [btrfs]
   finish_ordered_fn+0x3e/0x50 [btrfs]
   btrfs_work_helper+0x9c9/0x27a0 [btrfs]
   process_scheduled_works+0x716/0xf10
   worker_thread+0xb6a/0x1190
   kthread+0x292/0x330
   ret_from_fork+0x4d/0x80
   ret_from_fork_asm+0x1a/0x30
  ============OTHER_INFO============
   btrfs_inode_safe_disk_i_size_write+0x4ec/0x600 [btrfs]
   btrfs_finish_one_ordered+0x24c7/0x36a0 [btrfs]
   btrfs_finish_ordered_io+0x37/0x60 [btrfs]
   finish_ordered_fn+0x3e/0x50 [btrfs]
   btrfs_work_helper+0x9c9/0x27a0 [btrfs]
   process_scheduled_works+0x716/0xf10
   worker_thread+0xb6a/0x1190
   kthread+0x292/0x330
   ret_from_fork+0x4d/0x80
   ret_from_fork_asm+0x1a/0x30
  =================================

The main purpose of the check of the inode's disk_i_size is to avoid
taking write locks on a btree path when we have a write at or beyond
EOF, since in these cases we don't expect to find extent items in the
root to drop. However if we end up taking write locks due to a data
race on disk_i_size, everything is still correct, we only add extra
lock contention on the tree in case there's concurrency from other tasks.
If the race causes us to not take write locks when we actually need them,
then everything is functionally correct as well, since if we find out we
have extent items to drop and we took read locks (modify_tree set to 0),
we release the path and retry again with write locks.

Since this data race does not affect the correctness of the function,
it is a harmless data race, use data_race() to check inode->disk_i_size.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Hao-ran Zheng <zhenghaoran154@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-21 13:49:29 +01:00
Filipe Manana
6ba4663ada btrfs: fix use-after-free when attempting to join an aborted transaction
[ Upstream commit e2f0943cf37305dbdeaf9846e3c941451bcdef63 ]

When we are trying to join the current transaction and if it's aborted,
we read its 'aborted' field after unlocking fs_info->trans_lock and
without holding any extra reference count on it. This means that a
concurrent task that is aborting the transaction may free the transaction
before we read its 'aborted' field, leading to a use-after-free.

Fix this by reading the 'aborted' field while holding fs_info->trans_lock
since any freeing task must first acquire that lock and set
fs_info->running_transaction to NULL before freeing the transaction.

This was reported by syzbot and Dmitry with the following stack traces
from KASAN:

   ==================================================================
   BUG: KASAN: slab-use-after-free in join_transaction+0xd9b/0xda0 fs/btrfs/transaction.c:278
   Read of size 4 at addr ffff888011839024 by task kworker/u4:9/1128

   CPU: 0 UID: 0 PID: 1128 Comm: kworker/u4:9 Not tainted 6.13.0-rc7-syzkaller-00019-gc45323b7560e #0
   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
   Workqueue: events_unbound btrfs_async_reclaim_data_space
   Call Trace:
    <TASK>
    __dump_stack lib/dump_stack.c:94 [inline]
    dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
    print_address_description mm/kasan/report.c:378 [inline]
    print_report+0x169/0x550 mm/kasan/report.c:489
    kasan_report+0x143/0x180 mm/kasan/report.c:602
    join_transaction+0xd9b/0xda0 fs/btrfs/transaction.c:278
    start_transaction+0xaf8/0x1670 fs/btrfs/transaction.c:697
    flush_space+0x448/0xcf0 fs/btrfs/space-info.c:803
    btrfs_async_reclaim_data_space+0x159/0x510 fs/btrfs/space-info.c:1321
    process_one_work kernel/workqueue.c:3236 [inline]
    process_scheduled_works+0xa66/0x1840 kernel/workqueue.c:3317
    worker_thread+0x870/0xd30 kernel/workqueue.c:3398
    kthread+0x2f0/0x390 kernel/kthread.c:389
    ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
    ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
    </TASK>

   Allocated by task 5315:
    kasan_save_stack mm/kasan/common.c:47 [inline]
    kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
    poison_kmalloc_redzone mm/kasan/common.c:377 [inline]
    __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:394
    kasan_kmalloc include/linux/kasan.h:260 [inline]
    __kmalloc_cache_noprof+0x243/0x390 mm/slub.c:4329
    kmalloc_noprof include/linux/slab.h:901 [inline]
    join_transaction+0x144/0xda0 fs/btrfs/transaction.c:308
    start_transaction+0xaf8/0x1670 fs/btrfs/transaction.c:697
    btrfs_create_common+0x1b2/0x2e0 fs/btrfs/inode.c:6572
    lookup_open fs/namei.c:3649 [inline]
    open_last_lookups fs/namei.c:3748 [inline]
    path_openat+0x1c03/0x3590 fs/namei.c:3984
    do_filp_open+0x27f/0x4e0 fs/namei.c:4014
    do_sys_openat2+0x13e/0x1d0 fs/open.c:1402
    do_sys_open fs/open.c:1417 [inline]
    __do_sys_creat fs/open.c:1495 [inline]
    __se_sys_creat fs/open.c:1489 [inline]
    __x64_sys_creat+0x123/0x170 fs/open.c:1489
    do_syscall_x64 arch/x86/entry/common.c:52 [inline]
    do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
    entry_SYSCALL_64_after_hwframe+0x77/0x7f

   Freed by task 5336:
    kasan_save_stack mm/kasan/common.c:47 [inline]
    kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
    kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:582
    poison_slab_object mm/kasan/common.c:247 [inline]
    __kasan_slab_free+0x59/0x70 mm/kasan/common.c:264
    kasan_slab_free include/linux/kasan.h:233 [inline]
    slab_free_hook mm/slub.c:2353 [inline]
    slab_free mm/slub.c:4613 [inline]
    kfree+0x196/0x430 mm/slub.c:4761
    cleanup_transaction fs/btrfs/transaction.c:2063 [inline]
    btrfs_commit_transaction+0x2c97/0x3720 fs/btrfs/transaction.c:2598
    insert_balance_item+0x1284/0x20b0 fs/btrfs/volumes.c:3757
    btrfs_balance+0x992/0x10c0 fs/btrfs/volumes.c:4633
    btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670
    vfs_ioctl fs/ioctl.c:51 [inline]
    __do_sys_ioctl fs/ioctl.c:906 [inline]
    __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
    do_syscall_x64 arch/x86/entry/common.c:52 [inline]
    do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
    entry_SYSCALL_64_after_hwframe+0x77/0x7f

   The buggy address belongs to the object at ffff888011839000
    which belongs to the cache kmalloc-2k of size 2048
   The buggy address is located 36 bytes inside of
    freed 2048-byte region [ffff888011839000, ffff888011839800)

   The buggy address belongs to the physical page:
   page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11838
   head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
   flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff)
   page_type: f5(slab)
   raw: 00fff00000000040 ffff88801ac42000 ffffea0000493400 dead000000000002
   raw: 0000000000000000 0000000000080008 00000001f5000000 0000000000000000
   head: 00fff00000000040 ffff88801ac42000 ffffea0000493400 dead000000000002
   head: 0000000000000000 0000000000080008 00000001f5000000 0000000000000000
   head: 00fff00000000003 ffffea0000460e01 ffffffffffffffff 0000000000000000
   head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
   page dumped because: kasan: bad access detected
   page_owner tracks the page as allocated
   page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 57, tgid 57 (kworker/0:2), ts 67248182943, free_ts 67229742023
    set_page_owner include/linux/page_owner.h:32 [inline]
    post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1558
    prep_new_page mm/page_alloc.c:1566 [inline]
    get_page_from_freelist+0x365c/0x37a0 mm/page_alloc.c:3476
    __alloc_pages_noprof+0x292/0x710 mm/page_alloc.c:4753
    alloc_pages_mpol_noprof+0x3e1/0x780 mm/mempolicy.c:2269
    alloc_slab_page+0x6a/0x110 mm/slub.c:2423
    allocate_slab+0x5a/0x2b0 mm/slub.c:2589
    new_slab mm/slub.c:2642 [inline]
    ___slab_alloc+0xc27/0x14a0 mm/slub.c:3830
    __slab_alloc+0x58/0xa0 mm/slub.c:3920
    __slab_alloc_node mm/slub.c:3995 [inline]
    slab_alloc_node mm/slub.c:4156 [inline]
    __do_kmalloc_node mm/slub.c:4297 [inline]
    __kmalloc_node_track_caller_noprof+0x2e9/0x4c0 mm/slub.c:4317
    kmalloc_reserve+0x111/0x2a0 net/core/skbuff.c:609
    __alloc_skb+0x1f3/0x440 net/core/skbuff.c:678
    alloc_skb include/linux/skbuff.h:1323 [inline]
    alloc_skb_with_frags+0xc3/0x820 net/core/skbuff.c:6612
    sock_alloc_send_pskb+0x91a/0xa60 net/core/sock.c:2884
    sock_alloc_send_skb include/net/sock.h:1803 [inline]
    mld_newpack+0x1c3/0xaf0 net/ipv6/mcast.c:1747
    add_grhead net/ipv6/mcast.c:1850 [inline]
    add_grec+0x1492/0x19a0 net/ipv6/mcast.c:1988
    mld_send_cr net/ipv6/mcast.c:2114 [inline]
    mld_ifc_work+0x691/0xd90 net/ipv6/mcast.c:2651
   page last free pid 5300 tgid 5300 stack trace:
    reset_page_owner include/linux/page_owner.h:25 [inline]
    free_pages_prepare mm/page_alloc.c:1127 [inline]
    free_unref_page+0xd3f/0x1010 mm/page_alloc.c:2659
    __slab_free+0x2c2/0x380 mm/slub.c:4524
    qlink_free mm/kasan/quarantine.c:163 [inline]
    qlist_free_all+0x9a/0x140 mm/kasan/quarantine.c:179
    kasan_quarantine_reduce+0x14f/0x170 mm/kasan/quarantine.c:286
    __kasan_slab_alloc+0x23/0x80 mm/kasan/common.c:329
    kasan_slab_alloc include/linux/kasan.h:250 [inline]
    slab_post_alloc_hook mm/slub.c:4119 [inline]
    slab_alloc_node mm/slub.c:4168 [inline]
    __do_kmalloc_node mm/slub.c:4297 [inline]
    __kmalloc_noprof+0x236/0x4c0 mm/slub.c:4310
    kmalloc_noprof include/linux/slab.h:905 [inline]
    kzalloc_noprof include/linux/slab.h:1037 [inline]
    fib_create_info+0xc14/0x25b0 net/ipv4/fib_semantics.c:1435
    fib_table_insert+0x1f6/0x1f20 net/ipv4/fib_trie.c:1231
    fib_magic+0x3d8/0x620 net/ipv4/fib_frontend.c:1112
    fib_add_ifaddr+0x40c/0x5e0 net/ipv4/fib_frontend.c:1156
    fib_netdev_event+0x375/0x490 net/ipv4/fib_frontend.c:1494
    notifier_call_chain+0x1a5/0x3f0 kernel/notifier.c:85
    __dev_notify_flags+0x207/0x400
    dev_change_flags+0xf0/0x1a0 net/core/dev.c:9045
    do_setlink+0xc90/0x4210 net/core/rtnetlink.c:3109
    rtnl_changelink net/core/rtnetlink.c:3723 [inline]
    __rtnl_newlink net/core/rtnetlink.c:3875 [inline]
    rtnl_newlink+0x1bb6/0x2210 net/core/rtnetlink.c:4012

   Memory state around the buggy address:
    ffff888011838f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff888011838f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
   >ffff888011839000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                  ^
    ffff888011839080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff888011839100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ==================================================================

Reported-by: syzbot+45212e9d87a98c3f5b42@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/678e7da5.050a0220.303755.007c.GAE@google.com/
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Link: https://lore.kernel.org/linux-btrfs/CACT4Y+ZFBdo7pT8L2AzM=vegZwjp-wNkVJZQf0Ta3vZqtExaSw@mail.gmail.com/
Fixes: 871383be59 ("btrfs: add missing unlocks to transaction abort paths")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-21 13:49:29 +01:00
Qu Wenruo
ca3aa28ea6 btrfs: output the reason for open_ctree() failure
commit d0f038104fa37380e2a725e669508e43d0c503e9 upstream.

There is a recent ML report that mounting a large fs backed by hardware
RAID56 controller (with one device missing) took too much time, and
systemd seems to kill the mount attempt.

In that case, the only error message is:

  BTRFS error (device sdj): open_ctree failed

There is no reason on why the failure happened, making it very hard to
understand the reason.

At least output the error number (in the particular case it should be
-EINTR) to provide some clue.

Link: https://lore.kernel.org/linux-btrfs/9b9c4d2810abcca2f9f76e32220ed9a90febb235.camel@scientia.org/
Reported-by: Christoph Anton Mitterer <calestyo@scientia.org>
Cc: stable@vger.kernel.org
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-21 13:49:28 +01:00
Filipe Manana
1ea629e7bb btrfs: flush delalloc workers queue before stopping cleaner kthread during unmount
[ Upstream commit f10bef73fb355e3fc85e63a50386798be68ff486 ]

During the unmount path, at close_ctree(), we first stop the cleaner
kthread, using kthread_stop() which frees the associated task_struct, and
then stop and destroy all the work queues. However after we stopped the
cleaner we may still have a worker from the delalloc_workers queue running
inode.c:submit_compressed_extents(), which calls btrfs_add_delayed_iput(),
which in turn tries to wake up the cleaner kthread - which was already
destroyed before, resulting in a use-after-free on the task_struct.

Syzbot reported this with the following stack traces:

  BUG: KASAN: slab-use-after-free in __lock_acquire+0x78/0x2100 kernel/locking/lockdep.c:5089
  Read of size 8 at addr ffff8880259d2818 by task kworker/u8:3/52

  CPU: 1 UID: 0 PID: 52 Comm: kworker/u8:3 Not tainted 6.13.0-rc1-syzkaller-00002-gcdd30ebb1b9f #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
  Workqueue: btrfs-delalloc btrfs_work_helper
  Call Trace:
   <TASK>
   __dump_stack lib/dump_stack.c:94 [inline]
   dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
   print_address_description mm/kasan/report.c:378 [inline]
   print_report+0x169/0x550 mm/kasan/report.c:489
   kasan_report+0x143/0x180 mm/kasan/report.c:602
   __lock_acquire+0x78/0x2100 kernel/locking/lockdep.c:5089
   lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
   __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
   _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162
   class_raw_spinlock_irqsave_constructor include/linux/spinlock.h:551 [inline]
   try_to_wake_up+0xc2/0x1470 kernel/sched/core.c:4205
   submit_compressed_extents+0xdf/0x16e0 fs/btrfs/inode.c:1615
   run_ordered_work fs/btrfs/async-thread.c:288 [inline]
   btrfs_work_helper+0x96f/0xc40 fs/btrfs/async-thread.c:324
   process_one_work kernel/workqueue.c:3229 [inline]
   process_scheduled_works+0xa66/0x1840 kernel/workqueue.c:3310
   worker_thread+0x870/0xd30 kernel/workqueue.c:3391
   kthread+0x2f0/0x390 kernel/kthread.c:389
   ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
   </TASK>

  Allocated by task 2:
   kasan_save_stack mm/kasan/common.c:47 [inline]
   kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
   unpoison_slab_object mm/kasan/common.c:319 [inline]
   __kasan_slab_alloc+0x66/0x80 mm/kasan/common.c:345
   kasan_slab_alloc include/linux/kasan.h:250 [inline]
   slab_post_alloc_hook mm/slub.c:4104 [inline]
   slab_alloc_node mm/slub.c:4153 [inline]
   kmem_cache_alloc_node_noprof+0x1d9/0x380 mm/slub.c:4205
   alloc_task_struct_node kernel/fork.c:180 [inline]
   dup_task_struct+0x57/0x8c0 kernel/fork.c:1113
   copy_process+0x5d1/0x3d50 kernel/fork.c:2225
   kernel_clone+0x223/0x870 kernel/fork.c:2807
   kernel_thread+0x1bc/0x240 kernel/fork.c:2869
   create_kthread kernel/kthread.c:412 [inline]
   kthreadd+0x60d/0x810 kernel/kthread.c:767
   ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

  Freed by task 24:
   kasan_save_stack mm/kasan/common.c:47 [inline]
   kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
   kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:582
   poison_slab_object mm/kasan/common.c:247 [inline]
   __kasan_slab_free+0x59/0x70 mm/kasan/common.c:264
   kasan_slab_free include/linux/kasan.h:233 [inline]
   slab_free_hook mm/slub.c:2338 [inline]
   slab_free mm/slub.c:4598 [inline]
   kmem_cache_free+0x195/0x410 mm/slub.c:4700
   put_task_struct include/linux/sched/task.h:144 [inline]
   delayed_put_task_struct+0x125/0x300 kernel/exit.c:227
   rcu_do_batch kernel/rcu/tree.c:2567 [inline]
   rcu_core+0xaaa/0x17a0 kernel/rcu/tree.c:2823
   handle_softirqs+0x2d4/0x9b0 kernel/softirq.c:554
   run_ksoftirqd+0xca/0x130 kernel/softirq.c:943
   smpboot_thread_fn+0x544/0xa30 kernel/smpboot.c:164
   kthread+0x2f0/0x390 kernel/kthread.c:389
   ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

  Last potentially related work creation:
   kasan_save_stack+0x3f/0x60 mm/kasan/common.c:47
   __kasan_record_aux_stack+0xac/0xc0 mm/kasan/generic.c:544
   __call_rcu_common kernel/rcu/tree.c:3086 [inline]
   call_rcu+0x167/0xa70 kernel/rcu/tree.c:3190
   context_switch kernel/sched/core.c:5372 [inline]
   __schedule+0x1803/0x4be0 kernel/sched/core.c:6756
   __schedule_loop kernel/sched/core.c:6833 [inline]
   schedule+0x14b/0x320 kernel/sched/core.c:6848
   schedule_timeout+0xb0/0x290 kernel/time/sleep_timeout.c:75
   do_wait_for_common kernel/sched/completion.c:95 [inline]
   __wait_for_common kernel/sched/completion.c:116 [inline]
   wait_for_common kernel/sched/completion.c:127 [inline]
   wait_for_completion+0x355/0x620 kernel/sched/completion.c:148
   kthread_stop+0x19e/0x640 kernel/kthread.c:712
   close_ctree+0x524/0xd60 fs/btrfs/disk-io.c:4328
   generic_shutdown_super+0x139/0x2d0 fs/super.c:642
   kill_anon_super+0x3b/0x70 fs/super.c:1237
   btrfs_kill_super+0x41/0x50 fs/btrfs/super.c:2112
   deactivate_locked_super+0xc4/0x130 fs/super.c:473
   cleanup_mnt+0x41f/0x4b0 fs/namespace.c:1373
   task_work_run+0x24f/0x310 kernel/task_work.c:239
   ptrace_notify+0x2d2/0x380 kernel/signal.c:2503
   ptrace_report_syscall include/linux/ptrace.h:415 [inline]
   ptrace_report_syscall_exit include/linux/ptrace.h:477 [inline]
   syscall_exit_work+0xc7/0x1d0 kernel/entry/common.c:173
   syscall_exit_to_user_mode_prepare kernel/entry/common.c:200 [inline]
   __syscall_exit_to_user_mode_work kernel/entry/common.c:205 [inline]
   syscall_exit_to_user_mode+0x24a/0x340 kernel/entry/common.c:218
   do_syscall_64+0x100/0x230 arch/x86/entry/common.c:89
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

  The buggy address belongs to the object at ffff8880259d1e00
   which belongs to the cache task_struct of size 7424
  The buggy address is located 2584 bytes inside of
   freed 7424-byte region [ffff8880259d1e00, ffff8880259d3b00)

  The buggy address belongs to the physical page:
  page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x259d0
  head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
  memcg:ffff88802f4b56c1
  flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff)
  page_type: f5(slab)
  raw: 00fff00000000040 ffff88801bafe500 dead000000000100 dead000000000122
  raw: 0000000000000000 0000000000040004 00000001f5000000 ffff88802f4b56c1
  head: 00fff00000000040 ffff88801bafe500 dead000000000100 dead000000000122
  head: 0000000000000000 0000000000040004 00000001f5000000 ffff88802f4b56c1
  head: 00fff00000000003 ffffea0000967401 ffffffffffffffff 0000000000000000
  head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
  page dumped because: kasan: bad access detected
  page_owner tracks the page as allocated
  page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 12, tgid 12 (kworker/u8:1), ts 7328037942, free_ts 0
   set_page_owner include/linux/page_owner.h:32 [inline]
   post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1556
   prep_new_page mm/page_alloc.c:1564 [inline]
   get_page_from_freelist+0x3651/0x37a0 mm/page_alloc.c:3474
   __alloc_pages_noprof+0x292/0x710 mm/page_alloc.c:4751
   alloc_pages_mpol_noprof+0x3e8/0x680 mm/mempolicy.c:2265
   alloc_slab_page+0x6a/0x140 mm/slub.c:2408
   allocate_slab+0x5a/0x2f0 mm/slub.c:2574
   new_slab mm/slub.c:2627 [inline]
   ___slab_alloc+0xcd1/0x14b0 mm/slub.c:3815
   __slab_alloc+0x58/0xa0 mm/slub.c:3905
   __slab_alloc_node mm/slub.c:3980 [inline]
   slab_alloc_node mm/slub.c:4141 [inline]
   kmem_cache_alloc_node_noprof+0x269/0x380 mm/slub.c:4205
   alloc_task_struct_node kernel/fork.c:180 [inline]
   dup_task_struct+0x57/0x8c0 kernel/fork.c:1113
   copy_process+0x5d1/0x3d50 kernel/fork.c:2225
   kernel_clone+0x223/0x870 kernel/fork.c:2807
   user_mode_thread+0x132/0x1a0 kernel/fork.c:2885
   call_usermodehelper_exec_work+0x5c/0x230 kernel/umh.c:171
   process_one_work kernel/workqueue.c:3229 [inline]
   process_scheduled_works+0xa66/0x1840 kernel/workqueue.c:3310
   worker_thread+0x870/0xd30 kernel/workqueue.c:3391
  page_owner free stack trace missing

  Memory state around the buggy address:
   ffff8880259d2700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff8880259d2780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  >ffff8880259d2800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                              ^
   ffff8880259d2880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff8880259d2900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ==================================================================

Fix this by flushing the delalloc workers queue before stopping the
cleaner kthread.

Reported-by: syzbot+b7cf50a0c173770dcb14@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/674ed7e8.050a0220.48a03.0031.GAE@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-01-09 13:30:03 +01:00
Filipe Manana
66376f1a73 btrfs: fix use-after-free when COWing tree bock and tracing is enabled
[ Upstream commit 44f52bbe96dfdbe4aca3818a2534520082a07040 ]

When a COWing a tree block, at btrfs_cow_block(), and we have the
tracepoint trace_btrfs_cow_block() enabled and preemption is also enabled
(CONFIG_PREEMPT=y), we can trigger a use-after-free in the COWed extent
buffer while inside the tracepoint code. This is because in some paths
that call btrfs_cow_block(), such as btrfs_search_slot(), we are holding
the last reference on the extent buffer @buf so btrfs_force_cow_block()
drops the last reference on the @buf extent buffer when it calls
free_extent_buffer_stale(buf), which schedules the release of the extent
buffer with RCU. This means that if we are on a kernel with preemption,
the current task may be preempted before calling trace_btrfs_cow_block()
and the extent buffer already released by the time trace_btrfs_cow_block()
is called, resulting in a use-after-free.

Fix this by moving the trace_btrfs_cow_block() from btrfs_cow_block() to
btrfs_force_cow_block() before the COWed extent buffer is freed.
This also has a side effect of invoking the tracepoint in the tree defrag
code, at defrag.c:btrfs_realloc_node(), since btrfs_force_cow_block() is
called there, but this is fine and it was actually missing there.

Reported-by: syzbot+8517da8635307182c8a5@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6759a9b9.050a0220.1ac542.000d.GAE@google.com/
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-01-09 13:30:03 +01:00
Filipe Manana
9caa207980 btrfs: rename and export __btrfs_cow_block()
[ Upstream commit 95f93bc4cbcac6121a5ee85cd5019ee8e7447e0b ]

Rename and export __btrfs_cow_block() as btrfs_force_cow_block(). This is
to allow to move defrag specific code out of ctree.c and into defrag.c in
one of the next patches.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 44f52bbe96df ("btrfs: fix use-after-free when COWing tree bock and tracing is enabled")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-01-09 13:30:03 +01:00
Johannes Thumshirn
a40de0330a btrfs: fix use-after-free in btrfs_encoded_read_endio()
commit 05b36b04d74a517d6675bf2f90829ff1ac7e28dc upstream.

Shinichiro reported the following use-after free that sometimes is
happening in our CI system when running fstests' btrfs/284 on a TCMU
runner device:

  BUG: KASAN: slab-use-after-free in lock_release+0x708/0x780
  Read of size 8 at addr ffff888106a83f18 by task kworker/u80:6/219

  CPU: 8 UID: 0 PID: 219 Comm: kworker/u80:6 Not tainted 6.12.0-rc6-kts+ #15
  Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
  Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
  Call Trace:
   <TASK>
   dump_stack_lvl+0x6e/0xa0
   ? lock_release+0x708/0x780
   print_report+0x174/0x505
   ? lock_release+0x708/0x780
   ? __virt_addr_valid+0x224/0x410
   ? lock_release+0x708/0x780
   kasan_report+0xda/0x1b0
   ? lock_release+0x708/0x780
   ? __wake_up+0x44/0x60
   lock_release+0x708/0x780
   ? __pfx_lock_release+0x10/0x10
   ? __pfx_do_raw_spin_lock+0x10/0x10
   ? lock_is_held_type+0x9a/0x110
   _raw_spin_unlock_irqrestore+0x1f/0x60
   __wake_up+0x44/0x60
   btrfs_encoded_read_endio+0x14b/0x190 [btrfs]
   btrfs_check_read_bio+0x8d9/0x1360 [btrfs]
   ? lock_release+0x1b0/0x780
   ? trace_lock_acquire+0x12f/0x1a0
   ? __pfx_btrfs_check_read_bio+0x10/0x10 [btrfs]
   ? process_one_work+0x7e3/0x1460
   ? lock_acquire+0x31/0xc0
   ? process_one_work+0x7e3/0x1460
   process_one_work+0x85c/0x1460
   ? __pfx_process_one_work+0x10/0x10
   ? assign_work+0x16c/0x240
   worker_thread+0x5e6/0xfc0
   ? __pfx_worker_thread+0x10/0x10
   kthread+0x2c3/0x3a0
   ? __pfx_kthread+0x10/0x10
   ret_from_fork+0x31/0x70
   ? __pfx_kthread+0x10/0x10
   ret_from_fork_asm+0x1a/0x30
   </TASK>

  Allocated by task 3661:
   kasan_save_stack+0x30/0x50
   kasan_save_track+0x14/0x30
   __kasan_kmalloc+0xaa/0xb0
   btrfs_encoded_read_regular_fill_pages+0x16c/0x6d0 [btrfs]
   send_extent_data+0xf0f/0x24a0 [btrfs]
   process_extent+0x48a/0x1830 [btrfs]
   changed_cb+0x178b/0x2ea0 [btrfs]
   btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
   _btrfs_ioctl_send+0x117/0x330 [btrfs]
   btrfs_ioctl+0x184a/0x60a0 [btrfs]
   __x64_sys_ioctl+0x12e/0x1a0
   do_syscall_64+0x95/0x180
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

  Freed by task 3661:
   kasan_save_stack+0x30/0x50
   kasan_save_track+0x14/0x30
   kasan_save_free_info+0x3b/0x70
   __kasan_slab_free+0x4f/0x70
   kfree+0x143/0x490
   btrfs_encoded_read_regular_fill_pages+0x531/0x6d0 [btrfs]
   send_extent_data+0xf0f/0x24a0 [btrfs]
   process_extent+0x48a/0x1830 [btrfs]
   changed_cb+0x178b/0x2ea0 [btrfs]
   btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
   _btrfs_ioctl_send+0x117/0x330 [btrfs]
   btrfs_ioctl+0x184a/0x60a0 [btrfs]
   __x64_sys_ioctl+0x12e/0x1a0
   do_syscall_64+0x95/0x180
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

  The buggy address belongs to the object at ffff888106a83f00
   which belongs to the cache kmalloc-rnd-07-96 of size 96
  The buggy address is located 24 bytes inside of
   freed 96-byte region [ffff888106a83f00, ffff888106a83f60)

  The buggy address belongs to the physical page:
  page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888106a83800 pfn:0x106a83
  flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
  page_type: f5(slab)
  raw: 0017ffffc0000000 ffff888100053680 ffffea0004917200 0000000000000004
  raw: ffff888106a83800 0000000080200019 00000001f5000000 0000000000000000
  page dumped because: kasan: bad access detected

  Memory state around the buggy address:
   ffff888106a83e00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
   ffff888106a83e80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
  >ffff888106a83f00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
                              ^
   ffff888106a83f80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
   ffff888106a84000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  ==================================================================

Further analyzing the trace and the crash dump's vmcore file shows that
the wake_up() call in btrfs_encoded_read_endio() is calling wake_up() on
the wait_queue that is in the private data passed to the end_io handler.

Commit 4ff47df40447 ("btrfs: move priv off stack in
btrfs_encoded_read_regular_fill_pages()") moved 'struct
btrfs_encoded_read_private' off the stack.

Before that commit one can see a corruption of the private data when
analyzing the vmcore after a crash:

*(struct btrfs_encoded_read_private *)0xffff88815626eec8 = {
	.wait = (wait_queue_head_t){
		.lock = (spinlock_t){
			.rlock = (struct raw_spinlock){
				.raw_lock = (arch_spinlock_t){
					.val = (atomic_t){
						.counter = (int)-2005885696,
					},
					.locked = (u8)0,
					.pending = (u8)157,
					.locked_pending = (u16)40192,
					.tail = (u16)34928,
				},
				.magic = (unsigned int)536325682,
				.owner_cpu = (unsigned int)29,
				.owner = (void *)__SCT__tp_func_btrfs_transaction_commit+0x0 = 0x0,
				.dep_map = (struct lockdep_map){
					.key = (struct lock_class_key *)0xffff8881575a3b6c,
					.class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
					.name = (const char *)0xffff88815626f100 = "",
					.wait_type_outer = (u8)37,
					.wait_type_inner = (u8)178,
					.lock_type = (u8)154,
				},
			},
			.__padding = (u8 [24]){ 0, 157, 112, 136, 50, 174, 247, 31, 29 },
			.dep_map = (struct lockdep_map){
				.key = (struct lock_class_key *)0xffff8881575a3b6c,
				.class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
				.name = (const char *)0xffff88815626f100 = "",
				.wait_type_outer = (u8)37,
				.wait_type_inner = (u8)178,
				.lock_type = (u8)154,
			},
		},
		.head = (struct list_head){
			.next = (struct list_head *)0x112cca,
			.prev = (struct list_head *)0x47,
		},
	},
	.pending = (atomic_t){
		.counter = (int)-1491499288,
	},
	.status = (blk_status_t)130,
}

Here we can see several indicators of in-memory data corruption, e.g. the
large negative atomic values of ->pending or
->wait->lock->rlock->raw_lock->val, as well as the bogus spinlock magic
0x1ff7ae32 (decimal 536325682 above) instead of 0xdead4ead or the bogus
pointer values for ->wait->head.

To fix this, change atomic_dec_return() to atomic_dec_and_test() to fix the
corruption, as atomic_dec_return() is defined as two instructions on
x86_64, whereas atomic_dec_and_test() is defined as a single atomic
operation. This can lead to a situation where counter value is already
decremented but the if statement in btrfs_encoded_read_endio() is not
completely processed, i.e. the 0 test has not completed. If another thread
continues executing btrfs_encoded_read_regular_fill_pages() the
atomic_dec_return() there can see an already updated ->pending counter and
continues by freeing the private data. Continuing in the endio handler the
test for 0 succeeds and the wait_queue is woken up, resulting in a
use-after-free.

Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Suggested-by: Damien Le Moal <Damien.LeMoal@wdc.com>
Fixes: 1881fba89b ("btrfs: add BTRFS_IOC_ENCODED_READ ioctl")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Alva Lan <alvalan9@foxmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-01-09 13:29:56 +01:00
Qu Wenruo
459ef4a242 btrfs: sysfs: fix direct super block member reads
commit fca432e73db2bec0fdbfbf6d98d3ebcd5388a977 upstream.

The following sysfs entries are reading super block member directly,
which can have a different endian and cause wrong values:

- sys/fs/btrfs/<uuid>/nodesize
- sys/fs/btrfs/<uuid>/sectorsize
- sys/fs/btrfs/<uuid>/clone_alignment

Thankfully those values (nodesize and sectorsize) are always aligned
inside the btrfs_super_block, so it won't trigger unaligned read errors,
just endian problems.

Fix them by using the native cached members instead.

Fixes: df93589a17 ("btrfs: export more from FS_INFO to sysfs")
CC: stable@vger.kernel.org
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-01-02 10:30:55 +01:00
Filipe Manana
bb8e287f59 btrfs: avoid monopolizing a core when activating a swap file
commit 2c8507c63f5498d4ee4af404a8e44ceae4345056 upstream.

During swap activation we iterate over the extents of a file and we can
have many thousands of them, so we can end up in a busy loop monopolizing
a core. Avoid this by doing a voluntary reschedule after processing each
extent.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-01-02 10:30:55 +01:00
Qu Wenruo
0323e6b113 btrfs: tree-checker: reject inline extent items with 0 ref count
commit dfb92681a19e1d5172420baa242806414b3eff6f upstream.

[BUG]
There is a bug report in the mailing list where btrfs_run_delayed_refs()
failed to drop the ref count for logical 25870311358464 num_bytes
2113536.

The involved leaf dump looks like this:

  item 166 key (25870311358464 168 2113536) itemoff 10091 itemsize 50
    extent refs 1 gen 84178 flags 1
    ref#0: shared data backref parent 32399126528000 count 0 <<<
    ref#1: shared data backref parent 31808973717504 count 1

Notice the count number is 0.

[CAUSE]
There is no concrete evidence yet, but considering 0 -> 1 is also a
single bit flipped, it's possible that hardware memory bitflip is
involved, causing the on-disk extent tree to be corrupted.

[FIX]
To prevent us reading such corrupted extent item, or writing such
damaged extent item back to disk, enhance the handling of
BTRFS_EXTENT_DATA_REF_KEY and BTRFS_SHARED_DATA_REF_KEY keys for both
inlined and key items, to detect such 0 ref count and reject them.

CC: stable@vger.kernel.org # 5.4+
Link: https://lore.kernel.org/linux-btrfs/7c69dd49-c346-4806-86e7-e6f863a66f48@app.fastmail.com/
Reported-by: Frankie Fisher <frankie@terrorise.me.uk>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-27 13:52:59 +01:00
Filipe Manana
b15437fb26 btrfs: fix missing snapshot drew unlock when root is dead during swap activation
[ Upstream commit 9c803c474c6c002d8ade68ebe99026cc39c37f85 ]

When activating a swap file we acquire the root's snapshot drew lock and
then check if the root is dead, failing and returning with -EPERM if it's
dead but without unlocking the root's snapshot lock. Fix this by adding
the missing unlock.

Fixes: 60021bd754 ("btrfs: prevent subvol with swapfile from being deleted")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-14 19:54:50 +01:00
Boris Burkov
bcea29dff6 btrfs: do not clear read-only when adding sprout device
[ Upstream commit 70958a949d852cbecc3d46127bf0b24786df0130 ]

If you follow the seed/sprout wiki, it suggests the following workflow:

btrfstune -S 1 seed_dev
mount seed_dev mnt
btrfs device add sprout_dev
mount -o remount,rw mnt

The first mount mounts the FS readonly, which results in not setting
BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
somewhat surprisingly clears the readonly bit on the sb (though the
mount is still practically readonly, from the users perspective...).
Finally, the remount checks the readonly bit on the sb against the flag
and sees no change, so it does not run the code intended to run on
ro->rw transitions, leaving BTRFS_FS_OPEN unset.

As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
does no work. This results in leaking deleted snapshots until we run out
of space.

I propose fixing it at the first departure from what feels reasonable:
when we clear the readonly bit on the sb during device add.

A new fstest I have written reproduces the bug and confirms the fix.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-14 19:54:37 +01:00
Qu Wenruo
72c49143fb btrfs: avoid unnecessary device path update for the same device
[ Upstream commit 2e8b6bc0ab41ce41e6dfcc204b6cc01d5abbc952 ]

[PROBLEM]
It is very common for udev to trigger device scan, and every time a
mounted btrfs device got re-scan from different soft links, we will get
some of unnecessary device path updates, this is especially common
for LVM based storage:

 # lvs
  scratch1 test -wi-ao---- 10.00g
  scratch2 test -wi-a----- 10.00g
  scratch3 test -wi-a----- 10.00g
  scratch4 test -wi-a----- 10.00g
  scratch5 test -wi-a----- 10.00g
  test     test -wi-a----- 10.00g

 # mkfs.btrfs -f /dev/test/scratch1
 # mount /dev/test/scratch1 /mnt/btrfs
 # dmesg -c
 [  205.705234] BTRFS: device fsid 7be2602f-9e35-4ecf-a6ff-9e91d2c182c9 devid 1 transid 6 /dev/mapper/test-scratch1 (253:4) scanned by mount (1154)
 [  205.710864] BTRFS info (device dm-4): first mount of filesystem 7be2602f-9e35-4ecf-a6ff-9e91d2c182c9
 [  205.711923] BTRFS info (device dm-4): using crc32c (crc32c-intel) checksum algorithm
 [  205.713856] BTRFS info (device dm-4): using free-space-tree
 [  205.722324] BTRFS info (device dm-4): checking UUID tree

So far so good, but even if we just touched any soft link of
"dm-4", we will get quite some unnecessary device path updates.

 # touch /dev/mapper/test-scratch1
 # dmesg -c
 [  469.295796] BTRFS info: devid 1 device path /dev/mapper/test-scratch1 changed to /dev/dm-4 scanned by (udev-worker) (1221)
 [  469.300494] BTRFS info: devid 1 device path /dev/dm-4 changed to /dev/mapper/test-scratch1 scanned by (udev-worker) (1221)

Such device path rename is unnecessary and can lead to random path
change due to the udev race.

[CAUSE]
Inside device_list_add(), we are using a very primitive way checking if
the device has changed, strcmp().

Which can never handle links well, no matter if it's hard or soft links.

So every different link of the same device will be treated as a different
device, causing the unnecessary device path update.

[FIX]
Introduce a helper, is_same_device(), and use path_equal() to properly
detect the same block device.
So that the different soft links won't trigger the rename race.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Link: https://bugzilla.suse.com/show_bug.cgi?id=1230641
Reported-by: Fabian Vogt <fvogt@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-14 19:54:37 +01:00
Josef Bacik
44a2c518ab btrfs: don't BUG_ON on ENOMEM from btrfs_lookup_extent_info() in walk_down_proc()
commit a580fb2c3479d993556e1c31b237c9e5be4944a3 upstream.

We handle errors here properly, ENOMEM isn't fatal, return the error.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Keerthana K <keerthana.kalyanasundaram@broadcom.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-14 19:54:17 +01:00
Filipe Manana
6370db28af btrfs: ref-verify: fix use-after-free after invalid ref action
[ Upstream commit 7c4e39f9d2af4abaf82ca0e315d1fd340456620f ]

At btrfs_ref_tree_mod() after we successfully inserted the new ref entry
(local variable 'ref') into the respective block entry's rbtree (local
variable 'be'), if we find an unexpected action of BTRFS_DROP_DELAYED_REF,
we error out and free the ref entry without removing it from the block
entry's rbtree. Then in the error path of btrfs_ref_tree_mod() we call
btrfs_free_ref_cache(), which iterates over all block entries and then
calls free_block_entry() for each one, and there we will trigger a
use-after-free when we are called against the block entry to which we
added the freed ref entry to its rbtree, since the rbtree still points
to the block entry, as we didn't remove it from the rbtree before freeing
it in the error path at btrfs_ref_tree_mod(). Fix this by removing the
new ref entry from the rbtree before freeing it.

Syzbot report this with the following stack traces:

   BTRFS error (device loop0 state EA):   Ref action 2, root 5, ref_root 0, parent 8564736, owner 0, offset 0, num_refs 18446744073709551615
      __btrfs_mod_ref+0x7dd/0xac0 fs/btrfs/extent-tree.c:2523
      update_ref_for_cow+0x9cd/0x11f0 fs/btrfs/ctree.c:512
      btrfs_force_cow_block+0x9f6/0x1da0 fs/btrfs/ctree.c:594
      btrfs_cow_block+0x35e/0xa40 fs/btrfs/ctree.c:754
      btrfs_search_slot+0xbdd/0x30d0 fs/btrfs/ctree.c:2116
      btrfs_insert_empty_items+0x9c/0x1a0 fs/btrfs/ctree.c:4314
      btrfs_insert_empty_item fs/btrfs/ctree.h:669 [inline]
      btrfs_insert_orphan_item+0x1f1/0x320 fs/btrfs/orphan.c:23
      btrfs_orphan_add+0x6d/0x1a0 fs/btrfs/inode.c:3482
      btrfs_unlink+0x267/0x350 fs/btrfs/inode.c:4293
      vfs_unlink+0x365/0x650 fs/namei.c:4469
      do_unlinkat+0x4ae/0x830 fs/namei.c:4533
      __do_sys_unlinkat fs/namei.c:4576 [inline]
      __se_sys_unlinkat fs/namei.c:4569 [inline]
      __x64_sys_unlinkat+0xcc/0xf0 fs/namei.c:4569
      do_syscall_x64 arch/x86/entry/common.c:52 [inline]
      do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
      entry_SYSCALL_64_after_hwframe+0x77/0x7f
   BTRFS error (device loop0 state EA):   Ref action 1, root 5, ref_root 5, parent 0, owner 260, offset 0, num_refs 1
      __btrfs_mod_ref+0x76b/0xac0 fs/btrfs/extent-tree.c:2521
      update_ref_for_cow+0x96a/0x11f0
      btrfs_force_cow_block+0x9f6/0x1da0 fs/btrfs/ctree.c:594
      btrfs_cow_block+0x35e/0xa40 fs/btrfs/ctree.c:754
      btrfs_search_slot+0xbdd/0x30d0 fs/btrfs/ctree.c:2116
      btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:411
      __btrfs_update_delayed_inode+0x1e7/0xb90 fs/btrfs/delayed-inode.c:1030
      btrfs_update_delayed_inode fs/btrfs/delayed-inode.c:1114 [inline]
      __btrfs_commit_inode_delayed_items+0x2318/0x24a0 fs/btrfs/delayed-inode.c:1137
      __btrfs_run_delayed_items+0x213/0x490 fs/btrfs/delayed-inode.c:1171
      btrfs_commit_transaction+0x8a8/0x3740 fs/btrfs/transaction.c:2313
      prepare_to_relocate+0x3c4/0x4c0 fs/btrfs/relocation.c:3586
      relocate_block_group+0x16c/0xd40 fs/btrfs/relocation.c:3611
      btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4081
      btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3377
      __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4161
      btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4538
   BTRFS error (device loop0 state EA):   Ref action 2, root 5, ref_root 0, parent 8564736, owner 0, offset 0, num_refs 18446744073709551615
      __btrfs_mod_ref+0x7dd/0xac0 fs/btrfs/extent-tree.c:2523
      update_ref_for_cow+0x9cd/0x11f0 fs/btrfs/ctree.c:512
      btrfs_force_cow_block+0x9f6/0x1da0 fs/btrfs/ctree.c:594
      btrfs_cow_block+0x35e/0xa40 fs/btrfs/ctree.c:754
      btrfs_search_slot+0xbdd/0x30d0 fs/btrfs/ctree.c:2116
      btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:411
      __btrfs_update_delayed_inode+0x1e7/0xb90 fs/btrfs/delayed-inode.c:1030
      btrfs_update_delayed_inode fs/btrfs/delayed-inode.c:1114 [inline]
      __btrfs_commit_inode_delayed_items+0x2318/0x24a0 fs/btrfs/delayed-inode.c:1137
      __btrfs_run_delayed_items+0x213/0x490 fs/btrfs/delayed-inode.c:1171
      btrfs_commit_transaction+0x8a8/0x3740 fs/btrfs/transaction.c:2313
      prepare_to_relocate+0x3c4/0x4c0 fs/btrfs/relocation.c:3586
      relocate_block_group+0x16c/0xd40 fs/btrfs/relocation.c:3611
      btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4081
      btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3377
      __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4161
      btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4538
   ==================================================================
   BUG: KASAN: slab-use-after-free in rb_first+0x69/0x70 lib/rbtree.c:473
   Read of size 8 at addr ffff888042d1af38 by task syz.0.0/5329

   CPU: 0 UID: 0 PID: 5329 Comm: syz.0.0 Not tainted 6.12.0-rc7-syzkaller #0
   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
   Call Trace:
    <TASK>
    __dump_stack lib/dump_stack.c:94 [inline]
    dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
    print_address_description mm/kasan/report.c:377 [inline]
    print_report+0x169/0x550 mm/kasan/report.c:488
    kasan_report+0x143/0x180 mm/kasan/report.c:601
    rb_first+0x69/0x70 lib/rbtree.c:473
    free_block_entry+0x78/0x230 fs/btrfs/ref-verify.c:248
    btrfs_free_ref_cache+0xa3/0x100 fs/btrfs/ref-verify.c:917
    btrfs_ref_tree_mod+0x139f/0x15e0 fs/btrfs/ref-verify.c:898
    btrfs_free_extent+0x33c/0x380 fs/btrfs/extent-tree.c:3544
    __btrfs_mod_ref+0x7dd/0xac0 fs/btrfs/extent-tree.c:2523
    update_ref_for_cow+0x9cd/0x11f0 fs/btrfs/ctree.c:512
    btrfs_force_cow_block+0x9f6/0x1da0 fs/btrfs/ctree.c:594
    btrfs_cow_block+0x35e/0xa40 fs/btrfs/ctree.c:754
    btrfs_search_slot+0xbdd/0x30d0 fs/btrfs/ctree.c:2116
    btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:411
    __btrfs_update_delayed_inode+0x1e7/0xb90 fs/btrfs/delayed-inode.c:1030
    btrfs_update_delayed_inode fs/btrfs/delayed-inode.c:1114 [inline]
    __btrfs_commit_inode_delayed_items+0x2318/0x24a0 fs/btrfs/delayed-inode.c:1137
    __btrfs_run_delayed_items+0x213/0x490 fs/btrfs/delayed-inode.c:1171
    btrfs_commit_transaction+0x8a8/0x3740 fs/btrfs/transaction.c:2313
    prepare_to_relocate+0x3c4/0x4c0 fs/btrfs/relocation.c:3586
    relocate_block_group+0x16c/0xd40 fs/btrfs/relocation.c:3611
    btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4081
    btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3377
    __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4161
    btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4538
    btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3673
    vfs_ioctl fs/ioctl.c:51 [inline]
    __do_sys_ioctl fs/ioctl.c:907 [inline]
    __se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893
    do_syscall_x64 arch/x86/entry/common.c:52 [inline]
    do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
    entry_SYSCALL_64_after_hwframe+0x77/0x7f
   RIP: 0033:0x7f996df7e719
   RSP: 002b:00007f996ede7038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
   RAX: ffffffffffffffda RBX: 00007f996e135f80 RCX: 00007f996df7e719
   RDX: 0000000020000180 RSI: 00000000c4009420 RDI: 0000000000000004
   RBP: 00007f996dff139e R08: 0000000000000000 R09: 0000000000000000
   R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   R13: 0000000000000000 R14: 00007f996e135f80 R15: 00007fff79f32e68
    </TASK>

   Allocated by task 5329:
    kasan_save_stack mm/kasan/common.c:47 [inline]
    kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
    poison_kmalloc_redzone mm/kasan/common.c:377 [inline]
    __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:394
    kasan_kmalloc include/linux/kasan.h:257 [inline]
    __kmalloc_cache_noprof+0x19c/0x2c0 mm/slub.c:4295
    kmalloc_noprof include/linux/slab.h:878 [inline]
    kzalloc_noprof include/linux/slab.h:1014 [inline]
    btrfs_ref_tree_mod+0x264/0x15e0 fs/btrfs/ref-verify.c:701
    btrfs_free_extent+0x33c/0x380 fs/btrfs/extent-tree.c:3544
    __btrfs_mod_ref+0x7dd/0xac0 fs/btrfs/extent-tree.c:2523
    update_ref_for_cow+0x9cd/0x11f0 fs/btrfs/ctree.c:512
    btrfs_force_cow_block+0x9f6/0x1da0 fs/btrfs/ctree.c:594
    btrfs_cow_block+0x35e/0xa40 fs/btrfs/ctree.c:754
    btrfs_search_slot+0xbdd/0x30d0 fs/btrfs/ctree.c:2116
    btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:411
    __btrfs_update_delayed_inode+0x1e7/0xb90 fs/btrfs/delayed-inode.c:1030
    btrfs_update_delayed_inode fs/btrfs/delayed-inode.c:1114 [inline]
    __btrfs_commit_inode_delayed_items+0x2318/0x24a0 fs/btrfs/delayed-inode.c:1137
    __btrfs_run_delayed_items+0x213/0x490 fs/btrfs/delayed-inode.c:1171
    btrfs_commit_transaction+0x8a8/0x3740 fs/btrfs/transaction.c:2313
    prepare_to_relocate+0x3c4/0x4c0 fs/btrfs/relocation.c:3586
    relocate_block_group+0x16c/0xd40 fs/btrfs/relocation.c:3611
    btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4081
    btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3377
    __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4161
    btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4538
    btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3673
    vfs_ioctl fs/ioctl.c:51 [inline]
    __do_sys_ioctl fs/ioctl.c:907 [inline]
    __se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893
    do_syscall_x64 arch/x86/entry/common.c:52 [inline]
    do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
    entry_SYSCALL_64_after_hwframe+0x77/0x7f

   Freed by task 5329:
    kasan_save_stack mm/kasan/common.c:47 [inline]
    kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
    kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:579
    poison_slab_object mm/kasan/common.c:247 [inline]
    __kasan_slab_free+0x59/0x70 mm/kasan/common.c:264
    kasan_slab_free include/linux/kasan.h:230 [inline]
    slab_free_hook mm/slub.c:2342 [inline]
    slab_free mm/slub.c:4579 [inline]
    kfree+0x1a0/0x440 mm/slub.c:4727
    btrfs_ref_tree_mod+0x136c/0x15e0
    btrfs_free_extent+0x33c/0x380 fs/btrfs/extent-tree.c:3544
    __btrfs_mod_ref+0x7dd/0xac0 fs/btrfs/extent-tree.c:2523
    update_ref_for_cow+0x9cd/0x11f0 fs/btrfs/ctree.c:512
    btrfs_force_cow_block+0x9f6/0x1da0 fs/btrfs/ctree.c:594
    btrfs_cow_block+0x35e/0xa40 fs/btrfs/ctree.c:754
    btrfs_search_slot+0xbdd/0x30d0 fs/btrfs/ctree.c:2116
    btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:411
    __btrfs_update_delayed_inode+0x1e7/0xb90 fs/btrfs/delayed-inode.c:1030
    btrfs_update_delayed_inode fs/btrfs/delayed-inode.c:1114 [inline]
    __btrfs_commit_inode_delayed_items+0x2318/0x24a0 fs/btrfs/delayed-inode.c:1137
    __btrfs_run_delayed_items+0x213/0x490 fs/btrfs/delayed-inode.c:1171
    btrfs_commit_transaction+0x8a8/0x3740 fs/btrfs/transaction.c:2313
    prepare_to_relocate+0x3c4/0x4c0 fs/btrfs/relocation.c:3586
    relocate_block_group+0x16c/0xd40 fs/btrfs/relocation.c:3611
    btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4081
    btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3377
    __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4161
    btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4538
    btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3673
    vfs_ioctl fs/ioctl.c:51 [inline]
    __do_sys_ioctl fs/ioctl.c:907 [inline]
    __se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893
    do_syscall_x64 arch/x86/entry/common.c:52 [inline]
    do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
    entry_SYSCALL_64_after_hwframe+0x77/0x7f

   The buggy address belongs to the object at ffff888042d1af00
    which belongs to the cache kmalloc-64 of size 64
   The buggy address is located 56 bytes inside of
    freed 64-byte region [ffff888042d1af00, ffff888042d1af40)

   The buggy address belongs to the physical page:
   page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x42d1a
   anon flags: 0x4fff00000000000(node=1|zone=1|lastcpupid=0x7ff)
   page_type: f5(slab)
   raw: 04fff00000000000 ffff88801ac418c0 0000000000000000 dead000000000001
   raw: 0000000000000000 0000000000200020 00000001f5000000 0000000000000000
   page dumped because: kasan: bad access detected
   page_owner tracks the page as allocated
   page last allocated via order 0, migratetype Unmovable, gfp_mask 0x52c40(GFP_NOFS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP), pid 5055, tgid 5055 (dhcpcd-run-hook), ts 40377240074, free_ts 40376848335
    set_page_owner include/linux/page_owner.h:32 [inline]
    post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1541
    prep_new_page mm/page_alloc.c:1549 [inline]
    get_page_from_freelist+0x3649/0x3790 mm/page_alloc.c:3459
    __alloc_pages_noprof+0x292/0x710 mm/page_alloc.c:4735
    alloc_pages_mpol_noprof+0x3e8/0x680 mm/mempolicy.c:2265
    alloc_slab_page+0x6a/0x140 mm/slub.c:2412
    allocate_slab+0x5a/0x2f0 mm/slub.c:2578
    new_slab mm/slub.c:2631 [inline]
    ___slab_alloc+0xcd1/0x14b0 mm/slub.c:3818
    __slab_alloc+0x58/0xa0 mm/slub.c:3908
    __slab_alloc_node mm/slub.c:3961 [inline]
    slab_alloc_node mm/slub.c:4122 [inline]
    __do_kmalloc_node mm/slub.c:4263 [inline]
    __kmalloc_noprof+0x25a/0x400 mm/slub.c:4276
    kmalloc_noprof include/linux/slab.h:882 [inline]
    kzalloc_noprof include/linux/slab.h:1014 [inline]
    tomoyo_encode2 security/tomoyo/realpath.c:45 [inline]
    tomoyo_encode+0x26f/0x540 security/tomoyo/realpath.c:80
    tomoyo_realpath_from_path+0x59e/0x5e0 security/tomoyo/realpath.c:283
    tomoyo_get_realpath security/tomoyo/file.c:151 [inline]
    tomoyo_check_open_permission+0x255/0x500 security/tomoyo/file.c:771
    security_file_open+0x777/0x990 security/security.c:3109
    do_dentry_open+0x369/0x1460 fs/open.c:945
    vfs_open+0x3e/0x330 fs/open.c:1088
    do_open fs/namei.c:3774 [inline]
    path_openat+0x2c84/0x3590 fs/namei.c:3933
   page last free pid 5055 tgid 5055 stack trace:
    reset_page_owner include/linux/page_owner.h:25 [inline]
    free_pages_prepare mm/page_alloc.c:1112 [inline]
    free_unref_page+0xcfb/0xf20 mm/page_alloc.c:2642
    free_pipe_info+0x300/0x390 fs/pipe.c:860
    put_pipe_info fs/pipe.c:719 [inline]
    pipe_release+0x245/0x320 fs/pipe.c:742
    __fput+0x23f/0x880 fs/file_table.c:431
    __do_sys_close fs/open.c:1567 [inline]
    __se_sys_close fs/open.c:1552 [inline]
    __x64_sys_close+0x7f/0x110 fs/open.c:1552
    do_syscall_x64 arch/x86/entry/common.c:52 [inline]
    do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
    entry_SYSCALL_64_after_hwframe+0x77/0x7f

   Memory state around the buggy address:
    ffff888042d1ae00: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
    ffff888042d1ae80: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc
   >ffff888042d1af00: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
                                           ^
    ffff888042d1af80: 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc
    ffff888042d1b000: 00 00 00 00 00 fc fc 00 00 00 00 00 fc fc 00 00

Reported-by: syzbot+7325f164162e200000c1@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/673723eb.050a0220.1324f8.00a8.GAE@google.com/T/#u
Fixes: fd708b81d9 ("Btrfs: add a extent ref verify tool")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-14 19:54:10 +01:00
Lizhi Xu
db66fb87c2 btrfs: add a sanity check for btrfs root in btrfs_search_slot()
[ Upstream commit 3ed51857a50f530ac7a1482e069dfbd1298558d4 ]

Syzbot reports a null-ptr-deref in btrfs_search_slot().

The reproducer is using rescue=ibadroots, and the extent tree root is
corrupted thus the extent tree is NULL.

When scrub tries to search the extent tree to gather the needed extent
info, btrfs_search_slot() doesn't check if the target root is NULL or
not, resulting the null-ptr-deref.

Add sanity check for btrfs root before using it in btrfs_search_slot().

Reported-by: syzbot+3030e17bd57a73d39bd7@syzkaller.appspotmail.com
Fixes: 42437a6386 ("btrfs: introduce mount option rescue=ignorebadroots")
Link: https://syzkaller.appspot.com/bug?extid=3030e17bd57a73d39bd7
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-by: syzbot+3030e17bd57a73d39bd7@syzkaller.appspotmail.com
Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-14 19:54:10 +01:00
ChenXiaoSong
1dfc86bea8 btrfs: add might_sleep() annotations
[ Upstream commit a4c853af0c511d7e0f7cb306bbc8a4f1dbdb64ca ]

Add annotations to functions that might sleep due to allocations or IO
and could be called from various contexts. In case of btrfs_search_slot
it's not obvious why it would sleep:

    btrfs_search_slot
      setup_nodes_for_search
        reada_for_balance
          btrfs_readahead_node_child
            btrfs_readahead_tree_block
              btrfs_find_create_tree_block
                alloc_extent_buffer
                  kmem_cache_zalloc
                    /* allocate memory non-atomically, might sleep */
                    kmem_cache_alloc(GFP_NOFS|__GFP_NOFAIL|__GFP_ZERO)
              read_extent_buffer_pages
                submit_extent_page
                  /* disk IO, might sleep */
                  submit_one_bio

Other examples where the sleeping could happen is in 3 places might
sleep in update_qgroup_limit_item(), as shown below:

  update_qgroup_limit_item
    btrfs_alloc_path
      /* allocate memory non-atomically, might sleep */
      kmem_cache_zalloc(btrfs_path_cachep, GFP_NOFS)

Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 3ed51857a50f ("btrfs: add a sanity check for btrfs root in btrfs_search_slot()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-14 19:54:10 +01:00
Filipe Manana
dcbcd9c88e btrfs: don't loop for nowait writes when checking for cross references
[ Upstream commit ed67f2a913a4f0fc505db29805c41dd07d3cb356 ]

When checking for delayed refs when verifying if there are cross
references for a data extent, we stop if the path has nowait set and we
can't try lock the delayed ref head's mutex, returning -EAGAIN with the
goal of making a write fallback to a blocking context. However we ignore
the -EAGAIN at btrfs_cross_ref_exist() when check_delayed_ref() returns
it, and keep looping instead of immediately returning the -EAGAIN to the
caller.

Fix this by not looping if we get -EAGAIN and we have a nowait path.

Fixes: 26ce911446 ("btrfs: make can_nocow_extent nowait compatible")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-14 19:54:10 +01:00
Boris Burkov
945559be6e btrfs: qgroup: fix qgroup prealloc rsv leak in subvolume operations
commit 74e97958121aa1f5854da6effba70143f051b0cd upstream.

Create subvolume, create snapshot and delete subvolume all use
btrfs_subvolume_reserve_metadata() to reserve metadata for the changes
done to the parent subvolume's fs tree, which cannot be mediated in the
normal way via start_transaction. When quota groups (squota or qgroups)
are enabled, this reserves qgroup metadata of type PREALLOC. Once the
operation is associated to a transaction, we convert PREALLOC to
PERTRANS, which gets cleared in bulk at the end of the transaction.

However, the error paths of these three operations were not implementing
this lifecycle correctly. They unconditionally converted the PREALLOC to
PERTRANS in a generic cleanup step regardless of errors or whether the
operation was fully associated to a transaction or not. This resulted in
error paths occasionally converting this rsv to PERTRANS without calling
record_root_in_trans successfully, which meant that unless that root got
recorded in the transaction by some other thread, the end of the
transaction would not free that root's PERTRANS, leaking it. Ultimately,
this resulted in hitting a WARN in CONFIG_BTRFS_DEBUG builds at unmount
for the leaked reservation.

The fix is to ensure that every qgroup PREALLOC reservation observes the
following properties:

1. any failure before record_root_in_trans is called successfully
   results in freeing the PREALLOC reservation.
2. after record_root_in_trans, we convert to PERTRANS, and now the
   transaction owns freeing the reservation.

This patch enforces those properties on the three operations. Without
it, generic/269 with squotas enabled at mkfs time would fail in ~5-10
runs on my system. With this patch, it ran successfully 1000 times in a
row.

Fixes: e85fde5162 ("btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
[Xiangyu: BP to fix CVE-2024-35956, due to 6.1 btrfs_subvolume_release_metadata()
defined in ctree.h, modified the header file name from root-tree.h to ctree.h]
Signed-off-by: Xiangyu Chen <xiangyu.chen@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-14 19:53:56 +01:00
Filipe Manana
2cb1a73d1d btrfs: reinitialize delayed ref list after deleting it from the list
commit c9a75ec45f1111ef530ab186c2a7684d0a0c9245 upstream.

At insert_delayed_ref() if we need to update the action of an existing
ref to BTRFS_DROP_DELAYED_REF, we delete the ref from its ref head's
ref_add_list using list_del(), which leaves the ref's add_list member
not reinitialized, as list_del() sets the next and prev members of the
list to LIST_POISON1 and LIST_POISON2, respectively.

If later we end up calling drop_delayed_ref() against the ref, which can
happen during merging or when destroying delayed refs due to a transaction
abort, we can trigger a crash since at drop_delayed_ref() we call
list_empty() against the ref's add_list, which returns false since
the list was not reinitialized after the list_del() and as a consequence
we call list_del() again at drop_delayed_ref(). This results in an
invalid list access since the next and prev members are set to poison
pointers, resulting in a splat if CONFIG_LIST_HARDENED and
CONFIG_DEBUG_LIST are set or invalid poison pointer dereferences
otherwise.

So fix this by deleting from the list with list_del_init() instead.

Fixes: 1d57ee9416 ("btrfs: improve delayed refs iterations")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-14 13:15:17 +01:00
Naohiro Aota
deee4bd713 btrfs: zoned: fix zone unusable accounting for freed reserved extent
commit bf9821ba4792a0d9a2e72803ae7b4341faf3d532 upstream.

When btrfs reserves an extent and does not use it (e.g, by an error), it
calls btrfs_free_reserved_extent() to free the reserved extent. In the
process, it calls btrfs_add_free_space() and then it accounts the region
bytes as block_group->zone_unusable.

However, it leaves the space_info->bytes_zone_unusable side not updated. As
a result, ENOSPC can happen while a space_info reservation succeeded. The
reservation is fine because the freed region is not added in
space_info->bytes_zone_unusable, leaving that space as "free". OTOH,
corresponding block group counts it as zone_unusable and its allocation
pointer is not rewound, we cannot allocate an extent from that block group.
That will also negate space_info's async/sync reclaim process, and cause an
ENOSPC error from the extent allocation process.

Fix that by returning the space to space_info->bytes_zone_unusable.
Ideally, since a bio is not submitted for this reserved region, we should
return the space to free space and rewind the allocation pointer. But, it
needs rework on extent allocation handling, so let it work in this way for
now.

Fixes: 169e0da91a ("btrfs: zoned: track unusable bytes for zones")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-01 01:56:06 +01:00
Yue Haibing
8a43e8aed9 btrfs: fix passing 0 to ERR_PTR in btrfs_search_dir_index_item()
commit 75f49c3dc7b7423d3734f2e4dabe3dac8d064338 upstream.

The ret may be zero in btrfs_search_dir_index_item() and should not
passed to ERR_PTR(). Now btrfs_unlink_subvol() is the only caller to
this, reconstructed it to check ERR_PTR(-ENOENT) while ret >= 0.

This fixes smatch warnings:

fs/btrfs/dir-item.c:353
  btrfs_search_dir_index_item() warn: passing zero to 'ERR_PTR'

Fixes: 9dcbe16fcc ("btrfs: use btrfs_for_each_slot in btrfs_search_dir_index_item")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-01 01:56:06 +01:00
Roi Martin
b37de9491f btrfs: fix uninitialized pointer free on read_alloc_one_name() error
commit 2ab5e243c2266c841e0f6904fad1514b18eaf510 upstream.

The function read_alloc_one_name() does not initialize the name field of
the passed fscrypt_str struct if kmalloc fails to allocate the
corresponding buffer.  Thus, it is not guaranteed that
fscrypt_str.name is initialized when freeing it.

This is a follow-up to the linked patch that fixes the remaining
instances of the bug introduced by commit e43eec81c516 ("btrfs: use
struct qstr instead of name and namelen pairs").

Link: https://lore.kernel.org/linux-btrfs/20241009080833.1355894-1-jroi.martin@gmail.com/
Fixes: e43eec81c516 ("btrfs: use struct qstr instead of name and namelen pairs")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Roi Martin <jroi.martin@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-10-22 15:56:39 +02:00
Roi Martin
12cf028381 btrfs: fix uninitialized pointer free in add_inode_ref()
commit 66691c6e2f18d2aa4b22ffb624b9bdc97e9979e4 upstream.

The add_inode_ref() function does not initialize the "name" struct when
it is declared.  If any of the following calls to "read_one_inode()
returns NULL,

	dir = read_one_inode(root, parent_objectid);
	if (!dir) {
		ret = -ENOENT;
		goto out;
	}

	inode = read_one_inode(root, inode_objectid);
	if (!inode) {
		ret = -EIO;
		goto out;
	}

then "name.name" would be freed on "out" before being initialized.

out:
	...
	kfree(name.name);

This issue was reported by Coverity with CID 1526744.

Fixes: e43eec81c516 ("btrfs: use struct qstr instead of name and namelen pairs")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Roi Martin <jroi.martin@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-10-22 15:56:39 +02:00
Filipe Manana
8409539dbb btrfs: zoned: fix missing RCU locking in error message when loading zone info
[ Upstream commit fe4cd7ed128fe82ab9fe4f9fc8a73d4467699787 ]

At btrfs_load_zone_info() we have an error path that is dereferencing
the name of a device which is a RCU string but we are not holding a RCU
read lock, which is incorrect.

Fix this by using btrfs_err_in_rcu() instead of btrfs_err().

The problem is there since commit 08e11a3db0 ("btrfs: zoned: load zone's
allocation offset"), back then at btrfs_load_block_group_zone_info() but
then later on that code was factored out into the helper
btrfs_load_zone_info() by commit 09a46725cc84 ("btrfs: zoned: factor out
per-zone logic from btrfs_load_block_group_zone_info").

Fixes: 08e11a3db0 ("btrfs: zoned: load zone's allocation offset")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-10-17 15:22:22 +02:00
Filipe Manana
9da40aea63 btrfs: wait for fixup workers before stopping cleaner kthread during umount
commit 41fd1e94066a815a7ab0a7025359e9b40e4b3576 upstream.

During unmount, at close_ctree(), we have the following steps in this order:

1) Park the cleaner kthread - this doesn't destroy the kthread, it basically
   halts its execution (wake ups against it work but do nothing);

2) We stop the cleaner kthread - this results in freeing the respective
   struct task_struct;

3) We call btrfs_stop_all_workers() which waits for any jobs running in all
   the work queues and then free the work queues.

Syzbot reported a case where a fixup worker resulted in a crash when doing
a delayed iput on its inode while attempting to wake up the cleaner at
btrfs_add_delayed_iput(), because the task_struct of the cleaner kthread
was already freed. This can happen during unmount because we don't wait
for any fixup workers still running before we call kthread_stop() against
the cleaner kthread, which stops and free all its resources.

Fix this by waiting for any fixup workers at close_ctree() before we call
kthread_stop() against the cleaner and run pending delayed iputs.

The stack traces reported by syzbot were the following:

  BUG: KASAN: slab-use-after-free in __lock_acquire+0x77/0x2050 kernel/locking/lockdep.c:5065
  Read of size 8 at addr ffff8880272a8a18 by task kworker/u8:3/52

  CPU: 1 UID: 0 PID: 52 Comm: kworker/u8:3 Not tainted 6.12.0-rc1-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
  Workqueue: btrfs-fixup btrfs_work_helper
  Call Trace:
   <TASK>
   __dump_stack lib/dump_stack.c:94 [inline]
   dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
   print_address_description mm/kasan/report.c:377 [inline]
   print_report+0x169/0x550 mm/kasan/report.c:488
   kasan_report+0x143/0x180 mm/kasan/report.c:601
   __lock_acquire+0x77/0x2050 kernel/locking/lockdep.c:5065
   lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5825
   __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
   _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162
   class_raw_spinlock_irqsave_constructor include/linux/spinlock.h:551 [inline]
   try_to_wake_up+0xb0/0x1480 kernel/sched/core.c:4154
   btrfs_writepage_fixup_worker+0xc16/0xdf0 fs/btrfs/inode.c:2842
   btrfs_work_helper+0x390/0xc50 fs/btrfs/async-thread.c:314
   process_one_work kernel/workqueue.c:3229 [inline]
   process_scheduled_works+0xa63/0x1850 kernel/workqueue.c:3310
   worker_thread+0x870/0xd30 kernel/workqueue.c:3391
   kthread+0x2f0/0x390 kernel/kthread.c:389
   ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
   </TASK>

  Allocated by task 2:
   kasan_save_stack mm/kasan/common.c:47 [inline]
   kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
   unpoison_slab_object mm/kasan/common.c:319 [inline]
   __kasan_slab_alloc+0x66/0x80 mm/kasan/common.c:345
   kasan_slab_alloc include/linux/kasan.h:247 [inline]
   slab_post_alloc_hook mm/slub.c:4086 [inline]
   slab_alloc_node mm/slub.c:4135 [inline]
   kmem_cache_alloc_node_noprof+0x16b/0x320 mm/slub.c:4187
   alloc_task_struct_node kernel/fork.c:180 [inline]
   dup_task_struct+0x57/0x8c0 kernel/fork.c:1107
   copy_process+0x5d1/0x3d50 kernel/fork.c:2206
   kernel_clone+0x223/0x880 kernel/fork.c:2787
   kernel_thread+0x1bc/0x240 kernel/fork.c:2849
   create_kthread kernel/kthread.c:412 [inline]
   kthreadd+0x60d/0x810 kernel/kthread.c:765
   ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

  Freed by task 61:
   kasan_save_stack mm/kasan/common.c:47 [inline]
   kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
   kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:579
   poison_slab_object mm/kasan/common.c:247 [inline]
   __kasan_slab_free+0x59/0x70 mm/kasan/common.c:264
   kasan_slab_free include/linux/kasan.h:230 [inline]
   slab_free_hook mm/slub.c:2343 [inline]
   slab_free mm/slub.c:4580 [inline]
   kmem_cache_free+0x1a2/0x420 mm/slub.c:4682
   put_task_struct include/linux/sched/task.h:144 [inline]
   delayed_put_task_struct+0x125/0x300 kernel/exit.c:228
   rcu_do_batch kernel/rcu/tree.c:2567 [inline]
   rcu_core+0xaaa/0x17a0 kernel/rcu/tree.c:2823
   handle_softirqs+0x2c5/0x980 kernel/softirq.c:554
   __do_softirq kernel/softirq.c:588 [inline]
   invoke_softirq kernel/softirq.c:428 [inline]
   __irq_exit_rcu+0xf4/0x1c0 kernel/softirq.c:637
   irq_exit_rcu+0x9/0x30 kernel/softirq.c:649
   instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1037 [inline]
   sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1037
   asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:702

  Last potentially related work creation:
   kasan_save_stack+0x3f/0x60 mm/kasan/common.c:47
   __kasan_record_aux_stack+0xac/0xc0 mm/kasan/generic.c:541
   __call_rcu_common kernel/rcu/tree.c:3086 [inline]
   call_rcu+0x167/0xa70 kernel/rcu/tree.c:3190
   context_switch kernel/sched/core.c:5318 [inline]
   __schedule+0x184b/0x4ae0 kernel/sched/core.c:6675
   schedule_idle+0x56/0x90 kernel/sched/core.c:6793
   do_idle+0x56a/0x5d0 kernel/sched/idle.c:354
   cpu_startup_entry+0x42/0x60 kernel/sched/idle.c:424
   start_secondary+0x102/0x110 arch/x86/kernel/smpboot.c:314
   common_startup_64+0x13e/0x147

  The buggy address belongs to the object at ffff8880272a8000
   which belongs to the cache task_struct of size 7424
  The buggy address is located 2584 bytes inside of
   freed 7424-byte region [ffff8880272a8000, ffff8880272a9d00)

  The buggy address belongs to the physical page:
  page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x272a8
  head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
  flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff)
  page_type: f5(slab)
  raw: 00fff00000000040 ffff88801bafa500 dead000000000122 0000000000000000
  raw: 0000000000000000 0000000080040004 00000001f5000000 0000000000000000
  head: 00fff00000000040 ffff88801bafa500 dead000000000122 0000000000000000
  head: 0000000000000000 0000000080040004 00000001f5000000 0000000000000000
  head: 00fff00000000003 ffffea00009caa01 ffffffffffffffff 0000000000000000
  head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
  page dumped because: kasan: bad access detected
  page_owner tracks the page as allocated
  page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 2, tgid 2 (kthreadd), ts 71247381401, free_ts 71214998153
   set_page_owner include/linux/page_owner.h:32 [inline]
   post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1537
   prep_new_page mm/page_alloc.c:1545 [inline]
   get_page_from_freelist+0x3039/0x3180 mm/page_alloc.c:3457
   __alloc_pages_noprof+0x256/0x6c0 mm/page_alloc.c:4733
   alloc_pages_mpol_noprof+0x3e8/0x680 mm/mempolicy.c:2265
   alloc_slab_page+0x6a/0x120 mm/slub.c:2413
   allocate_slab+0x5a/0x2f0 mm/slub.c:2579
   new_slab mm/slub.c:2632 [inline]
   ___slab_alloc+0xcd1/0x14b0 mm/slub.c:3819
   __slab_alloc+0x58/0xa0 mm/slub.c:3909
   __slab_alloc_node mm/slub.c:3962 [inline]
   slab_alloc_node mm/slub.c:4123 [inline]
   kmem_cache_alloc_node_noprof+0x1fe/0x320 mm/slub.c:4187
   alloc_task_struct_node kernel/fork.c:180 [inline]
   dup_task_struct+0x57/0x8c0 kernel/fork.c:1107
   copy_process+0x5d1/0x3d50 kernel/fork.c:2206
   kernel_clone+0x223/0x880 kernel/fork.c:2787
   kernel_thread+0x1bc/0x240 kernel/fork.c:2849
   create_kthread kernel/kthread.c:412 [inline]
   kthreadd+0x60d/0x810 kernel/kthread.c:765
   ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
  page last free pid 5230 tgid 5230 stack trace:
   reset_page_owner include/linux/page_owner.h:25 [inline]
   free_pages_prepare mm/page_alloc.c:1108 [inline]
   free_unref_page+0xcd0/0xf00 mm/page_alloc.c:2638
   discard_slab mm/slub.c:2678 [inline]
   __put_partials+0xeb/0x130 mm/slub.c:3146
   put_cpu_partial+0x17c/0x250 mm/slub.c:3221
   __slab_free+0x2ea/0x3d0 mm/slub.c:4450
   qlink_free mm/kasan/quarantine.c:163 [inline]
   qlist_free_all+0x9a/0x140 mm/kasan/quarantine.c:179
   kasan_quarantine_reduce+0x14f/0x170 mm/kasan/quarantine.c:286
   __kasan_slab_alloc+0x23/0x80 mm/kasan/common.c:329
   kasan_slab_alloc include/linux/kasan.h:247 [inline]
   slab_post_alloc_hook mm/slub.c:4086 [inline]
   slab_alloc_node mm/slub.c:4135 [inline]
   kmem_cache_alloc_noprof+0x135/0x2a0 mm/slub.c:4142
   getname_flags+0xb7/0x540 fs/namei.c:139
   do_sys_openat2+0xd2/0x1d0 fs/open.c:1409
   do_sys_open fs/open.c:1430 [inline]
   __do_sys_openat fs/open.c:1446 [inline]
   __se_sys_openat fs/open.c:1441 [inline]
   __x64_sys_openat+0x247/0x2a0 fs/open.c:1441
   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
   do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

  Memory state around the buggy address:
   ffff8880272a8900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff8880272a8980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  >ffff8880272a8a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                              ^
   ffff8880272a8a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff8880272a8b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ==================================================================

Reported-by: syzbot+8aaf2df2ef0164ffe1fb@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/66fb36b1.050a0220.aab67.003b.GAE@google.com/
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-10-17 15:22:02 +02:00
Filipe Manana
c71eb38a04 btrfs: send: fix invalid clone operation for file that got its size decreased
commit fa630df665aa9ddce3a96ce7b54e10a38e4d2a2b upstream.

During an incremental send we may end up sending an invalid clone
operation, for the last extent of a file which ends at an unaligned offset
that matches the final i_size of the file in the send snapshot, in case
the file had its initial size (the size in the parent snapshot) decreased
in the send snapshot. In this case the destination will fail to apply the
clone operation because its end offset is not sector size aligned and it
ends before the current size of the file.

Sending the truncate operation always happens when we finish processing an
inode, after we process all its extents (and xattrs, names, etc). So fix
this by ensuring the file has a valid size before we send a clone
operation for an unaligned extent that ends at the final i_size of the
file. The size we truncate to matches the start offset of the clone range
but it could be any value between that start offset and the final size of
the file since the clone operation will expand the i_size if the current
size is smaller than the end offset. The start offset of the range was
chosen because it's always sector size aligned and avoids a truncation
into the middle of a page, which results in dirtying the page due to
filling part of it with zeroes and then making the clone operation at the
receiver trigger IO.

The following test reproduces the issue:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/sdi
  MNT=/mnt/sdi

  mkfs.btrfs -f $DEV
  mount $DEV $MNT

  # Create a file with a size of 256K + 5 bytes, having two extents, one
  # with a size of 128K and another one with a size of 128K + 5 bytes.
  last_ext_size=$((128 * 1024 + 5))
  xfs_io -f -d -c "pwrite -S 0xab -b 128K 0 128K" \
         -c "pwrite -S 0xcd -b $last_ext_size 128K $last_ext_size" \
         $MNT/foo

  # Another file which we will later clone foo into, but initially with
  # a larger size than foo.
  xfs_io -f -c "pwrite -S 0xef 0 1M" $MNT/bar

  btrfs subvolume snapshot -r $MNT/ $MNT/snap1

  # Now resize bar and clone foo into it.
  xfs_io -c "truncate 0" \
         -c "reflink $MNT/foo" $MNT/bar

  btrfs subvolume snapshot -r $MNT/ $MNT/snap2

  rm -f /tmp/send-full /tmp/send-inc
  btrfs send -f /tmp/send-full $MNT/snap1
  btrfs send -p $MNT/snap1 -f /tmp/send-inc $MNT/snap2

  umount $MNT
  mkfs.btrfs -f $DEV
  mount $DEV $MNT

  btrfs receive -f /tmp/send-full $MNT
  btrfs receive -f /tmp/send-inc $MNT

  umount $MNT

Running it before this patch:

  $ ./test.sh
  (...)
  At subvol snap1
  At snapshot snap2
  ERROR: failed to clone extents to bar: Invalid argument

A test case for fstests will be sent soon.

Reported-by: Ben Millwood <thebenmachine@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAJhrHS2z+WViO2h=ojYvBPDLsATwLbg+7JaNCyYomv0fUxEpQQ@mail.gmail.com/
Fixes: 46a6e10a1ab1 ("btrfs: send: allow cloning non-aligned extent if it ends at i_size")
CC: stable@vger.kernel.org # 6.11
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-10-17 15:22:02 +02:00
Qu Wenruo
d13249c0df btrfs: fix a NULL pointer dereference when failed to start a new trasacntion
commit c3b47f49e83197e8dffd023ec568403bcdbb774b upstream.

[BUG]
Syzbot reported a NULL pointer dereference with the following crash:

  FAULT_INJECTION: forcing a failure.
   start_transaction+0x830/0x1670 fs/btrfs/transaction.c:676
   prepare_to_relocate+0x31f/0x4c0 fs/btrfs/relocation.c:3642
   relocate_block_group+0x169/0xd20 fs/btrfs/relocation.c:3678
  ...
  BTRFS info (device loop0): balance: ended with status: -12
  Oops: general protection fault, probably for non-canonical address 0xdffffc00000000cc: 0000 [#1] PREEMPT SMP KASAN NOPTI
  KASAN: null-ptr-deref in range [0x0000000000000660-0x0000000000000667]
  RIP: 0010:btrfs_update_reloc_root+0x362/0xa80 fs/btrfs/relocation.c:926
  Call Trace:
   <TASK>
   commit_fs_roots+0x2ee/0x720 fs/btrfs/transaction.c:1496
   btrfs_commit_transaction+0xfaf/0x3740 fs/btrfs/transaction.c:2430
   del_balance_item fs/btrfs/volumes.c:3678 [inline]
   reset_balance_state+0x25e/0x3c0 fs/btrfs/volumes.c:3742
   btrfs_balance+0xead/0x10c0 fs/btrfs/volumes.c:4574
   btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3673
   vfs_ioctl fs/ioctl.c:51 [inline]
   __do_sys_ioctl fs/ioctl.c:907 [inline]
   __se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893
   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
   do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

[CAUSE]
The allocation failure happens at the start_transaction() inside
prepare_to_relocate(), and during the error handling we call
unset_reloc_control(), which makes fs_info->balance_ctl to be NULL.

Then we continue the error path cleanup in btrfs_balance() by calling
reset_balance_state() which will call del_balance_item() to fully delete
the balance item in the root tree.

However during the small window between set_reloc_contrl() and
unset_reloc_control(), we can have a subvolume tree update and created a
reloc_root for that subvolume.

Then we go into the final btrfs_commit_transaction() of
del_balance_item(), and into btrfs_update_reloc_root() inside
commit_fs_roots().

That function checks if fs_info->reloc_ctl is in the merge_reloc_tree
stage, but since fs_info->reloc_ctl is NULL, it results a NULL pointer
dereference.

[FIX]
Just add extra check on fs_info->reloc_ctl inside
btrfs_update_reloc_root(), before checking
fs_info->reloc_ctl->merge_reloc_tree.

That DEAD_RELOC_TREE handling is to prevent further modification to the
reloc tree during merge stage, but since there is no reloc_ctl at all,
we do not need to bother that.

Reported-by: syzbot+283673dbc38527ef9f3d@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/66f6bfa7.050a0220.38ace9.0019.GAE@google.com/
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-10-17 15:22:01 +02:00
Filipe Manana
563df8b411 btrfs: calculate the right space for delayed refs when updating global reserve
commit f8f210dc84709804c9f952297f2bfafa6ea6b4bd upstream.

When updating the global block reserve, we account for the 6 items needed
by an unlink operation and the 6 delayed references for each one of those
items. However the calculation for the delayed references is not correct
in case we have the free space tree enabled, as in that case we need to
touch the free space tree as well and therefore need twice the number of
bytes. So use the btrfs_calc_delayed_ref_bytes() helper to calculate the
number of bytes need for the delayed references at
btrfs_update_global_block_rsv().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[Diogo: this patch has been cherry-picked from the original commit;
conflicts included lack of a define (picked from commit 5630e2bcfe223)
and lack of btrfs_calc_delayed_ref_bytes (picked from commit 0e55a54502b97)
- changed const struct -> struct for compatibility.]
Signed-off-by: Diogo Jahchan Koike <djahchankoike@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:55 +02:00
Jeff Layton
5b7e3e9c2c btrfs: update target inode's ctime on unlink
[ Upstream commit 3bc2ac2f8f0b78a13140fc72022771efe0c9b778 ]

Unlink changes the link count on the target inode. POSIX mandates that
the ctime must also change when this occurs.

According to https://pubs.opengroup.org/onlinepubs/9699919799/functions/unlink.html:

"Upon successful completion, unlink() shall mark for update the last data
 modification and last file status change timestamps of the parent
 directory. Also, if the file's link count is not 0, the last file status
 change timestamp of the file shall be marked for update."

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add link to the opengroup docs ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-09-18 19:23:02 +02:00
Filipe Manana
cd3087582e btrfs: fix race between direct IO write and fsync when using same fd
commit cd9253c23aedd61eb5ff11f37a36247cd46faf86 upstream.

If we have 2 threads that are using the same file descriptor and one of
them is doing direct IO writes while the other is doing fsync, we have a
race where we can end up either:

1) Attempt a fsync without holding the inode's lock, triggering an
   assertion failures when assertions are enabled;

2) Do an invalid memory access from the fsync task because the file private
   points to memory allocated on stack by the direct IO task and it may be
   used by the fsync task after the stack was destroyed.

The race happens like this:

1) A user space program opens a file descriptor with O_DIRECT;

2) The program spawns 2 threads using libpthread for example;

3) One of the threads uses the file descriptor to do direct IO writes,
   while the other calls fsync using the same file descriptor.

4) Call task A the thread doing direct IO writes and task B the thread
   doing fsyncs;

5) Task A does a direct IO write, and at btrfs_direct_write() sets the
   file's private to an on stack allocated private with the member
   'fsync_skip_inode_lock' set to true;

6) Task B enters btrfs_sync_file() and sees that there's a private
   structure associated to the file which has 'fsync_skip_inode_lock' set
   to true, so it skips locking the inode's VFS lock;

7) Task A completes the direct IO write, and resets the file's private to
   NULL since it had no prior private and our private was stack allocated.
   Then it unlocks the inode's VFS lock;

8) Task B enters btrfs_get_ordered_extents_for_logging(), then the
   assertion that checks the inode's VFS lock is held fails, since task B
   never locked it and task A has already unlocked it.

The stack trace produced is the following:

   assertion failed: inode_is_locked(&inode->vfs_inode), in fs/btrfs/ordered-data.c:983
   ------------[ cut here ]------------
   kernel BUG at fs/btrfs/ordered-data.c:983!
   Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
   CPU: 9 PID: 5072 Comm: worker Tainted: G     U     OE      6.10.5-1-default #1 openSUSE Tumbleweed 69f48d427608e1c09e60ea24c6c55e2ca1b049e8
   Hardware name: Acer Predator PH315-52/Covini_CFS, BIOS V1.12 07/28/2020
   RIP: 0010:btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs]
   Code: 50 d6 86 c0 e8 (...)
   RSP: 0018:ffff9e4a03dcfc78 EFLAGS: 00010246
   RAX: 0000000000000054 RBX: ffff9078a9868e98 RCX: 0000000000000000
   RDX: 0000000000000000 RSI: ffff907dce4a7800 RDI: ffff907dce4a7800
   RBP: ffff907805518800 R08: 0000000000000000 R09: ffff9e4a03dcfb38
   R10: ffff9e4a03dcfb30 R11: 0000000000000003 R12: ffff907684ae7800
   R13: 0000000000000001 R14: ffff90774646b600 R15: 0000000000000000
   FS:  00007f04b96006c0(0000) GS:ffff907dce480000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 00007f32acbfc000 CR3: 00000001fd4fa005 CR4: 00000000003726f0
   Call Trace:
    <TASK>
    ? __die_body.cold+0x14/0x24
    ? die+0x2e/0x50
    ? do_trap+0xca/0x110
    ? do_error_trap+0x6a/0x90
    ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
    ? exc_invalid_op+0x50/0x70
    ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
    ? asm_exc_invalid_op+0x1a/0x20
    ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
    ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
    btrfs_sync_file+0x21a/0x4d0 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
    ? __seccomp_filter+0x31d/0x4f0
    __x64_sys_fdatasync+0x4f/0x90
    do_syscall_64+0x82/0x160
    ? do_futex+0xcb/0x190
    ? __x64_sys_futex+0x10e/0x1d0
    ? switch_fpu_return+0x4f/0xd0
    ? syscall_exit_to_user_mode+0x72/0x220
    ? do_syscall_64+0x8e/0x160
    ? syscall_exit_to_user_mode+0x72/0x220
    ? do_syscall_64+0x8e/0x160
    ? syscall_exit_to_user_mode+0x72/0x220
    ? do_syscall_64+0x8e/0x160
    ? syscall_exit_to_user_mode+0x72/0x220
    ? do_syscall_64+0x8e/0x160
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

Another problem here is if task B grabs the private pointer and then uses
it after task A has finished, since the private was allocated in the stack
of task A, it results in some invalid memory access with a hard to predict
result.

This issue, triggering the assertion, was observed with QEMU workloads by
two users in the Link tags below.

Fix this by not relying on a file's private to pass information to fsync
that it should skip locking the inode and instead pass this information
through a special value stored in current->journal_info. This is safe
because in the relevant section of the direct IO write path we are not
holding a transaction handle, so current->journal_info is NULL.

The following C program triggers the issue:

   $ cat repro.c
   /* Get the O_DIRECT definition. */
   #ifndef _GNU_SOURCE
   #define _GNU_SOURCE
   #endif

   #include <stdio.h>
   #include <stdlib.h>
   #include <unistd.h>
   #include <stdint.h>
   #include <fcntl.h>
   #include <errno.h>
   #include <string.h>
   #include <pthread.h>

   static int fd;

   static ssize_t do_write(int fd, const void *buf, size_t count, off_t offset)
   {
       while (count > 0) {
           ssize_t ret;

           ret = pwrite(fd, buf, count, offset);
           if (ret < 0) {
               if (errno == EINTR)
                   continue;
               return ret;
           }
           count -= ret;
           buf += ret;
       }
       return 0;
   }

   static void *fsync_loop(void *arg)
   {
       while (1) {
           int ret;

           ret = fsync(fd);
           if (ret != 0) {
               perror("Fsync failed");
               exit(6);
           }
       }
   }

   int main(int argc, char *argv[])
   {
       long pagesize;
       void *write_buf;
       pthread_t fsyncer;
       int ret;

       if (argc != 2) {
           fprintf(stderr, "Use: %s <file path>\n", argv[0]);
           return 1;
       }

       fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT, 0666);
       if (fd == -1) {
           perror("Failed to open/create file");
           return 1;
       }

       pagesize = sysconf(_SC_PAGE_SIZE);
       if (pagesize == -1) {
           perror("Failed to get page size");
           return 2;
       }

       ret = posix_memalign(&write_buf, pagesize, pagesize);
       if (ret) {
           perror("Failed to allocate buffer");
           return 3;
       }

       ret = pthread_create(&fsyncer, NULL, fsync_loop, NULL);
       if (ret != 0) {
           fprintf(stderr, "Failed to create writer thread: %d\n", ret);
           return 4;
       }

       while (1) {
           ret = do_write(fd, write_buf, pagesize, 0);
           if (ret != 0) {
               perror("Write failed");
               exit(5);
           }
       }

       return 0;
   }

   $ mkfs.btrfs -f /dev/sdi
   $ mount /dev/sdi /mnt/sdi
   $ timeout 10 ./repro /mnt/sdi/foo

Usually the race is triggered within less than 1 second. A test case for
fstests will follow soon.

Reported-by: Paulo Dias <paulo.miguel.dias@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219187
Reported-by: Andreas Jahn <jahn-andi@web.de>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219199
Reported-by: syzbot+4704b3cc972bd76024f1@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/00000000000044ff540620d7dee2@google.com/
Fixes: 939b656bc8ab ("btrfs: fix corruption after buffer fault in during direct IO append write")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-12 11:10:29 +02:00
David Sterba
995057208d btrfs: initialize location to fix -Wmaybe-uninitialized in btrfs_lookup_dentry()
[ Upstream commit b8e947e9f64cac9df85a07672b658df5b2bcff07 ]

Some arch + compiler combinations report a potentially unused variable
location in btrfs_lookup_dentry(). This is a false alert as the variable
is passed by value and always valid or there's an error. The compilers
cannot probably reason about that although btrfs_inode_by_name() is in
the same file.

   >  + /kisskb/src/fs/btrfs/inode.c: error: 'location.objectid' may be used
   +uninitialized in this function [-Werror=maybe-uninitialized]:  => 5603:9
   >  + /kisskb/src/fs/btrfs/inode.c: error: 'location.type' may be used
   +uninitialized in this function [-Werror=maybe-uninitialized]:  => 5674:5

   m68k-gcc8/m68k-allmodconfig
   mips-gcc8/mips-allmodconfig
   powerpc-gcc5/powerpc-all{mod,yes}config
   powerpc-gcc5/ppc64_defconfig

Initialize it to zero, this should fix the warnings and won't change the
behaviour as btrfs_inode_by_name() accepts only a root or inode item
types, otherwise returns an error.

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/linux-btrfs/bd4e9928-17b3-9257-8ba7-6b7f9bbb639a@linux-m68k.org/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-09-12 11:10:24 +02:00
Filipe Manana
0fbac73a97 btrfs: replace BUG_ON() with error handling at update_ref_for_cow()
[ Upstream commit b56329a782314fde5b61058e2a25097af7ccb675 ]

Instead of a BUG_ON() just return an error, log an error message and
abort the transaction in case we find an extent buffer belonging to the
relocation tree that doesn't have the full backref flag set. This is
unexpected and should never happen (save for bugs or a potential bad
memory).

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-09-12 11:10:24 +02:00
Josef Bacik
9cc887ac24 btrfs: clean up our handling of refs == 0 in snapshot delete
[ Upstream commit b8ccef048354074a548f108e51d0557d6adfd3a3 ]

In reada we BUG_ON(refs == 0), which could be unkind since we aren't
holding a lock on the extent leaf and thus could get a transient
incorrect answer.  In walk_down_proc we also BUG_ON(refs == 0), which
could happen if we have extent tree corruption.  Change that to return
-EUCLEAN.  In do_walk_down() we catch this case and handle it correctly,
however we return -EIO, which -EUCLEAN is a more appropriate error code.
Finally in walk_up_proc we have the same BUG_ON(refs == 0), so convert
that to proper error handling.  Also adjust the error message so we can
actually do something with the information.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-09-12 11:10:24 +02:00
Josef Bacik
75a428e1fd btrfs: replace BUG_ON with ASSERT in walk_down_proc()
[ Upstream commit 1f9d44c0a12730a24f8bb75c5e1102207413cc9b ]

We have a couple of areas where we check to make sure the tree block is
locked before looking up or messing with references.  This is old code
so it has this as BUG_ON().  Convert this to ASSERT() for developers.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-09-12 11:10:24 +02:00
Filipe Manana
c1cc3326e2 btrfs: fix extent map use-after-free when adding pages to compressed bio
commit 8e7860543a94784d744c7ce34b78a2e11beefa5c upstream.

At add_ra_bio_pages() we are accessing the extent map to calculate
'add_size' after we dropped our reference on the extent map, resulting
in a use-after-free. Fix this by computing 'add_size' before dropping our
extent map reference.

Reported-by: syzbot+853d80cba98ce1157ae6@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/000000000000038144061c6d18f2@google.com/
Fixes: 6a40491020 ("btrfs: subpage: make add_ra_bio_pages() compatible")
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Brennan Lamoreaux <brennan.lamoreaux@broadcom.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-04 13:25:00 +02:00
Josef Bacik
91ca04cc39 btrfs: run delayed iputs when flushing delalloc
commit 2d3447261031503b181dacc549fe65ffe2d93d65 upstream.

We have transient failures with btrfs/301, specifically in the part
where we do

  for i in $(seq 0 10); do
	  write 50m to file
	  rm -f file
  done

Sometimes this will result in a transient quota error, and it's because
sometimes we start writeback on the file which results in a delayed
iput, and thus the rm doesn't actually clean the file up.  When we're
flushing the quota space we need to run the delayed iputs to make sure
all the unlinks that we think have completed have actually completed.
This removes the small window where we could fail to find enough space
in our quota.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-04 13:24:55 +02:00
Filipe Manana
0660f6c69c btrfs: send: allow cloning non-aligned extent if it ends at i_size
[ Upstream commit 46a6e10a1ab16cc71d4a3cab73e79aabadd6b8ea ]

If we a find that an extent is shared but its end offset is not sector
size aligned, then we don't clone it and issue write operations instead.
This is because the reflink (remap_file_range) operation does not allow
to clone unaligned ranges, except if the end offset of the range matches
the i_size of the source and destination files (and the start offset is
sector size aligned).

While this is not incorrect because send can only guarantee that a file
has the same data in the source and destination snapshots, it's not
optimal and generates confusion and surprising behaviour for users.

For example, running this test:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/sdi
  MNT=/mnt/sdi

  mkfs.btrfs -f $DEV
  mount $DEV $MNT

  # Use a file size not aligned to any possible sector size.
  file_size=$((1 * 1024 * 1024 + 5)) # 1MB + 5 bytes
  dd if=/dev/random of=$MNT/foo bs=$file_size count=1
  cp --reflink=always $MNT/foo $MNT/bar

  btrfs subvolume snapshot -r $MNT/ $MNT/snap
  rm -f /tmp/send-test
  btrfs send -f /tmp/send-test $MNT/snap

  umount $MNT
  mkfs.btrfs -f $DEV
  mount $DEV $MNT

  btrfs receive -vv -f /tmp/send-test $MNT

  xfs_io -r -c "fiemap -v" $MNT/snap/bar

  umount $MNT

Gives the following result:

  (...)
  mkfile o258-7-0
  rename o258-7-0 -> bar
  write bar - offset=0 length=49152
  write bar - offset=49152 length=49152
  write bar - offset=98304 length=49152
  write bar - offset=147456 length=49152
  write bar - offset=196608 length=49152
  write bar - offset=245760 length=49152
  write bar - offset=294912 length=49152
  write bar - offset=344064 length=49152
  write bar - offset=393216 length=49152
  write bar - offset=442368 length=49152
  write bar - offset=491520 length=49152
  write bar - offset=540672 length=49152
  write bar - offset=589824 length=49152
  write bar - offset=638976 length=49152
  write bar - offset=688128 length=49152
  write bar - offset=737280 length=49152
  write bar - offset=786432 length=49152
  write bar - offset=835584 length=49152
  write bar - offset=884736 length=49152
  write bar - offset=933888 length=49152
  write bar - offset=983040 length=49152
  write bar - offset=1032192 length=16389
  chown bar - uid=0, gid=0
  chmod bar - mode=0644
  utimes bar
  utimes
  BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=06d640da-9ca1-604c-b87c-3375175a8eb3, stransid=7
  /mnt/sdi/snap/bar:
   EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
     0: [0..2055]:       26624..28679      2056   0x1

There's no clone operation to clone extents from the file foo into file
bar and fiemap confirms there's no shared flag (0x2000).

So update send_write_or_clone() so that it proceeds with cloning if the
source and destination ranges end at the i_size of the respective files.

After this changes the result of the test is:

  (...)
  mkfile o258-7-0
  rename o258-7-0 -> bar
  clone bar - source=foo source offset=0 offset=0 length=1048581
  chown bar - uid=0, gid=0
  chmod bar - mode=0644
  utimes bar
  utimes
  BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=582420f3-ea7d-564e-bbe5-ce440d622190, stransid=7
  /mnt/sdi/snap/bar:
   EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
     0: [0..2055]:       26624..28679      2056 0x2001

A test case for fstests will also follow up soon.

Link: https://github.com/kdave/btrfs-progs/issues/572#issuecomment-2282841416
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29 17:30:42 +02:00
David Sterba
95eda3e46c btrfs: replace sb::s_blocksize by fs_info::sectorsize
[ Upstream commit 4e00422ee62663e31e611d7de4d2c4aa3f8555f2 ]

The block size stored in the super block is used by subsystems outside
of btrfs and it's a copy of fs_info::sectorsize. Unify that to always
use our sectorsize, with the exception of mount where we first need to
use fixed values (4K) until we read the super block and can set the
sectorsize.

Replace all uses, in most cases it's fewer pointer indirections.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 46a6e10a1ab1 ("btrfs: send: allow cloning non-aligned extent if it ends at i_size")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29 17:30:42 +02:00
David Sterba
c2adaaad83 btrfs: delete pointless BUG_ON check on quota root in btrfs_qgroup_account_extent()
[ Upstream commit f40a3ea94881f668084f68f6b9931486b1606db0 ]

The BUG_ON is deep in the qgroup code where we can expect that it
exists. A NULL pointer would cause a crash.

It was added long ago in 550d7a2ed5 ("btrfs: qgroup: Add new qgroup
calculation function btrfs_qgroup_account_extents()."). It maybe made
sense back then as the quota enable/disable state machine was not that
robust as it is nowadays, so we can just delete it.

Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29 17:30:38 +02:00
David Sterba
2df142f7b3 btrfs: change BUG_ON to assertion in tree_move_down()
[ Upstream commit 56f335e043ae73c32dbb70ba95488845dc0f1e6e ]

There's only one caller of tree_move_down() that does not pass level 0
so the assertion is better suited here.

Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29 17:30:38 +02:00
David Sterba
c5384273ce btrfs: send: handle unexpected data in header buffer in begin_cmd()
[ Upstream commit e80e3f732cf53c64b0d811e1581470d67f6c3228 ]

Change BUG_ON to a proper error handling in the unlikely case of seeing
data when the command is started. This is supposed to be reset when the
command is finished (send_cmd, send_encoded_extent).

Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29 17:30:37 +02:00
David Sterba
8695535e3e btrfs: handle invalid root reference found in may_destroy_subvol()
[ Upstream commit 6fbc6f4ac1f4907da4fc674251527e7dc79ffbf6 ]

The may_destroy_subvol() looks up a root by a key, allowing to do an
inexact search when key->offset is -1.  It's never expected to find such
item, as it would break the allowed range of a root id.

Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29 17:30:37 +02:00
David Sterba
d483da03a6 btrfs: tests: allocate dummy fs_info and root in test_find_delalloc()
[ Upstream commit b2136cc288fce2f24a92f3d656531b2d50ebec5a ]

Allocate fs_info and root to have a valid fs_info pointer in case it's
dereferenced by a helper outside of tests, like find_lock_delalloc_range().

Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29 17:30:37 +02:00
David Sterba
1f19860ad2 btrfs: change BUG_ON to assertion when checking for delayed_node root
[ Upstream commit be73f4448b607e6b7ce41cd8ef2214fdf6e7986f ]

The pointer to root is initialized in btrfs_init_delayed_node(), no need
to check for it again. Change the BUG_ON to assertion.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29 17:30:37 +02:00