a1.2
1648 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
![]() |
9caa207980 |
btrfs: rename and export __btrfs_cow_block()
[ Upstream commit 95f93bc4cbcac6121a5ee85cd5019ee8e7447e0b ] Rename and export __btrfs_cow_block() as btrfs_force_cow_block(). This is to allow to move defrag specific code out of ctree.c and into defrag.c in one of the next patches. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 44f52bbe96df ("btrfs: fix use-after-free when COWing tree bock and tracing is enabled") Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
![]() |
945559be6e |
btrfs: qgroup: fix qgroup prealloc rsv leak in subvolume operations
commit 74e97958121aa1f5854da6effba70143f051b0cd upstream.
Create subvolume, create snapshot and delete subvolume all use
btrfs_subvolume_reserve_metadata() to reserve metadata for the changes
done to the parent subvolume's fs tree, which cannot be mediated in the
normal way via start_transaction. When quota groups (squota or qgroups)
are enabled, this reserves qgroup metadata of type PREALLOC. Once the
operation is associated to a transaction, we convert PREALLOC to
PERTRANS, which gets cleared in bulk at the end of the transaction.
However, the error paths of these three operations were not implementing
this lifecycle correctly. They unconditionally converted the PREALLOC to
PERTRANS in a generic cleanup step regardless of errors or whether the
operation was fully associated to a transaction or not. This resulted in
error paths occasionally converting this rsv to PERTRANS without calling
record_root_in_trans successfully, which meant that unless that root got
recorded in the transaction by some other thread, the end of the
transaction would not free that root's PERTRANS, leaking it. Ultimately,
this resulted in hitting a WARN in CONFIG_BTRFS_DEBUG builds at unmount
for the leaked reservation.
The fix is to ensure that every qgroup PREALLOC reservation observes the
following properties:
1. any failure before record_root_in_trans is called successfully
results in freeing the PREALLOC reservation.
2. after record_root_in_trans, we convert to PERTRANS, and now the
transaction owns freeing the reservation.
This patch enforces those properties on the three operations. Without
it, generic/269 with squotas enabled at mkfs time would fail in ~5-10
runs on my system. With this patch, it ran successfully 1000 times in a
row.
Fixes:
|
||
![]() |
cd3087582e |
btrfs: fix race between direct IO write and fsync when using same fd
commit cd9253c23aedd61eb5ff11f37a36247cd46faf86 upstream. If we have 2 threads that are using the same file descriptor and one of them is doing direct IO writes while the other is doing fsync, we have a race where we can end up either: 1) Attempt a fsync without holding the inode's lock, triggering an assertion failures when assertions are enabled; 2) Do an invalid memory access from the fsync task because the file private points to memory allocated on stack by the direct IO task and it may be used by the fsync task after the stack was destroyed. The race happens like this: 1) A user space program opens a file descriptor with O_DIRECT; 2) The program spawns 2 threads using libpthread for example; 3) One of the threads uses the file descriptor to do direct IO writes, while the other calls fsync using the same file descriptor. 4) Call task A the thread doing direct IO writes and task B the thread doing fsyncs; 5) Task A does a direct IO write, and at btrfs_direct_write() sets the file's private to an on stack allocated private with the member 'fsync_skip_inode_lock' set to true; 6) Task B enters btrfs_sync_file() and sees that there's a private structure associated to the file which has 'fsync_skip_inode_lock' set to true, so it skips locking the inode's VFS lock; 7) Task A completes the direct IO write, and resets the file's private to NULL since it had no prior private and our private was stack allocated. Then it unlocks the inode's VFS lock; 8) Task B enters btrfs_get_ordered_extents_for_logging(), then the assertion that checks the inode's VFS lock is held fails, since task B never locked it and task A has already unlocked it. The stack trace produced is the following: assertion failed: inode_is_locked(&inode->vfs_inode), in fs/btrfs/ordered-data.c:983 ------------[ cut here ]------------ kernel BUG at fs/btrfs/ordered-data.c:983! Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 9 PID: 5072 Comm: worker Tainted: G U OE 6.10.5-1-default #1 openSUSE Tumbleweed 69f48d427608e1c09e60ea24c6c55e2ca1b049e8 Hardware name: Acer Predator PH315-52/Covini_CFS, BIOS V1.12 07/28/2020 RIP: 0010:btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs] Code: 50 d6 86 c0 e8 (...) RSP: 0018:ffff9e4a03dcfc78 EFLAGS: 00010246 RAX: 0000000000000054 RBX: ffff9078a9868e98 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff907dce4a7800 RDI: ffff907dce4a7800 RBP: ffff907805518800 R08: 0000000000000000 R09: ffff9e4a03dcfb38 R10: ffff9e4a03dcfb30 R11: 0000000000000003 R12: ffff907684ae7800 R13: 0000000000000001 R14: ffff90774646b600 R15: 0000000000000000 FS: 00007f04b96006c0(0000) GS:ffff907dce480000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f32acbfc000 CR3: 00000001fd4fa005 CR4: 00000000003726f0 Call Trace: <TASK> ? __die_body.cold+0x14/0x24 ? die+0x2e/0x50 ? do_trap+0xca/0x110 ? do_error_trap+0x6a/0x90 ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] ? exc_invalid_op+0x50/0x70 ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] ? asm_exc_invalid_op+0x1a/0x20 ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] btrfs_sync_file+0x21a/0x4d0 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] ? __seccomp_filter+0x31d/0x4f0 __x64_sys_fdatasync+0x4f/0x90 do_syscall_64+0x82/0x160 ? do_futex+0xcb/0x190 ? __x64_sys_futex+0x10e/0x1d0 ? switch_fpu_return+0x4f/0xd0 ? syscall_exit_to_user_mode+0x72/0x220 ? do_syscall_64+0x8e/0x160 ? syscall_exit_to_user_mode+0x72/0x220 ? do_syscall_64+0x8e/0x160 ? syscall_exit_to_user_mode+0x72/0x220 ? do_syscall_64+0x8e/0x160 ? syscall_exit_to_user_mode+0x72/0x220 ? do_syscall_64+0x8e/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e Another problem here is if task B grabs the private pointer and then uses it after task A has finished, since the private was allocated in the stack of task A, it results in some invalid memory access with a hard to predict result. This issue, triggering the assertion, was observed with QEMU workloads by two users in the Link tags below. Fix this by not relying on a file's private to pass information to fsync that it should skip locking the inode and instead pass this information through a special value stored in current->journal_info. This is safe because in the relevant section of the direct IO write path we are not holding a transaction handle, so current->journal_info is NULL. The following C program triggers the issue: $ cat repro.c /* Get the O_DIRECT definition. */ #ifndef _GNU_SOURCE #define _GNU_SOURCE #endif #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <stdint.h> #include <fcntl.h> #include <errno.h> #include <string.h> #include <pthread.h> static int fd; static ssize_t do_write(int fd, const void *buf, size_t count, off_t offset) { while (count > 0) { ssize_t ret; ret = pwrite(fd, buf, count, offset); if (ret < 0) { if (errno == EINTR) continue; return ret; } count -= ret; buf += ret; } return 0; } static void *fsync_loop(void *arg) { while (1) { int ret; ret = fsync(fd); if (ret != 0) { perror("Fsync failed"); exit(6); } } } int main(int argc, char *argv[]) { long pagesize; void *write_buf; pthread_t fsyncer; int ret; if (argc != 2) { fprintf(stderr, "Use: %s <file path>\n", argv[0]); return 1; } fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT, 0666); if (fd == -1) { perror("Failed to open/create file"); return 1; } pagesize = sysconf(_SC_PAGE_SIZE); if (pagesize == -1) { perror("Failed to get page size"); return 2; } ret = posix_memalign(&write_buf, pagesize, pagesize); if (ret) { perror("Failed to allocate buffer"); return 3; } ret = pthread_create(&fsyncer, NULL, fsync_loop, NULL); if (ret != 0) { fprintf(stderr, "Failed to create writer thread: %d\n", ret); return 4; } while (1) { ret = do_write(fd, write_buf, pagesize, 0); if (ret != 0) { perror("Write failed"); exit(5); } } return 0; } $ mkfs.btrfs -f /dev/sdi $ mount /dev/sdi /mnt/sdi $ timeout 10 ./repro /mnt/sdi/foo Usually the race is triggered within less than 1 second. A test case for fstests will follow soon. Reported-by: Paulo Dias <paulo.miguel.dias@gmail.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219187 Reported-by: Andreas Jahn <jahn-andi@web.de> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219199 Reported-by: syzbot+4704b3cc972bd76024f1@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/00000000000044ff540620d7dee2@google.com/ Fixes: 939b656bc8ab ("btrfs: fix corruption after buffer fault in during direct IO append write") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
6cae8d04d8 |
btrfs: fix corruption after buffer fault in during direct IO append write
commit 939b656bc8ab203fdbde26ccac22bcb7f0985be5 upstream. During an append (O_APPEND write flag) direct IO write if the input buffer was not previously faulted in, we can corrupt the file in a way that the final size is unexpected and it includes an unexpected hole. The problem happens like this: 1) We have an empty file, with size 0, for example; 2) We do an O_APPEND direct IO with a length of 4096 bytes and the input buffer is not currently faulted in; 3) We enter btrfs_direct_write(), lock the inode and call generic_write_checks(), which calls generic_write_checks_count(), and that function sets the iocb position to 0 with the following code: if (iocb->ki_flags & IOCB_APPEND) iocb->ki_pos = i_size_read(inode); 4) We call btrfs_dio_write() and enter into iomap, which will end up calling btrfs_dio_iomap_begin() and that calls btrfs_get_blocks_direct_write(), where we update the i_size of the inode to 4096 bytes; 5) After btrfs_dio_iomap_begin() returns, iomap will attempt to access the page of the write input buffer (at iomap_dio_bio_iter(), with a call to bio_iov_iter_get_pages()) and fail with -EFAULT, which gets returned to btrfs at btrfs_direct_write() via btrfs_dio_write(); 6) At btrfs_direct_write() we get the -EFAULT error, unlock the inode, fault in the write buffer and then goto to the label 'relock'; 7) We lock again the inode, do all the necessary checks again and call again generic_write_checks(), which calls generic_write_checks_count() again, and there we set the iocb's position to 4K, which is the current i_size of the inode, with the following code pointed above: if (iocb->ki_flags & IOCB_APPEND) iocb->ki_pos = i_size_read(inode); 8) Then we go again to btrfs_dio_write() and enter iomap and the write succeeds, but it wrote to the file range [4K, 8K), leaving a hole in the [0, 4K) range and an i_size of 8K, which goes against the expectations of having the data written to the range [0, 4K) and get an i_size of 4K. Fix this by not unlocking the inode before faulting in the input buffer, in case we get -EFAULT or an incomplete write, and not jumping to the 'relock' label after faulting in the buffer - instead jump to a location immediately before calling iomap, skipping all the write checks and relocking. This solves this problem and it's fine even in case the input buffer is memory mapped to the same file range, since only holding the range locked in the inode's io tree can cause a deadlock, it's safe to keep the inode lock (VFS lock), as was fixed and described in commit |
||
![]() |
2aa515b5b5 |
btrfs: fix infinite directory reads
commit 9b378f6ad48cfa195ed868db9123c09ee7ec5ea2 upstream. The readdir implementation currently processes always up to the last index it finds. This however can result in an infinite loop if the directory has a large number of entries such that they won't all fit in the given buffer passed to the readdir callback, that is, dir_emit() returns a non-zero value. Because in that case readdir() will be called again and if in the meanwhile new directory entries were added and we still can't put all the remaining entries in the buffer, we keep repeating this over and over. The following C program and test script reproduce the problem: $ cat /mnt/readdir_prog.c #include <sys/types.h> #include <dirent.h> #include <stdio.h> int main(int argc, char *argv[]) { DIR *dir = opendir("."); struct dirent *dd; while ((dd = readdir(dir))) { printf("%s\n", dd->d_name); rename(dd->d_name, "TEMPFILE"); rename("TEMPFILE", dd->d_name); } closedir(dir); } $ gcc -o /mnt/readdir_prog /mnt/readdir_prog.c $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV &> /dev/null #mkfs.xfs -f $DEV &> /dev/null #mkfs.ext4 -F $DEV &> /dev/null mount $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= 2000; i++)); do echo -n > $MNT/testdir/file_$i done cd $MNT/testdir /mnt/readdir_prog cd /mnt umount $MNT This behaviour is surprising to applications and it's unlike ext4, xfs, tmpfs, vfat and other filesystems, which always finish. In this case where new entries were added due to renames, some file names may be reported more than once, but this varies according to each filesystem - for example ext4 never reported the same file more than once while xfs reports the first 13 file names twice. So change our readdir implementation to track the last index number when opendir() is called and then make readdir() never process beyond that index number. This gives the same behaviour as ext4. Reported-by: Rob Landley <rob@landley.net> Link: https://lore.kernel.org/linux-btrfs/2c8c55ec-04c6-e0dc-9c5c-8c7924778c35@landley.net/ Link: https://bugzilla.kernel.org/show_bug.cgi?id=217681 CC: stable@vger.kernel.org # 5.15 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
ef54db5b5d |
btrfs: use struct fscrypt_str instead of struct qstr
[ Upstream commit 6db75318823a169e836a478ca57d6a7c0a156b77 ] While struct qstr is more natural without fscrypt, since it's provided by dentries, struct fscrypt_str is provided by the fscrypt handlers processing dentries, and is thus more natural in the fscrypt world. Replace all of the struct qstr uses with struct fscrypt_str. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 9af86694fd5d ("btrfs: file_remove_privs needs an exclusive lock in direct io write") Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
![]() |
68ad364ec8 |
btrfs: setup qstr from dentrys using fscrypt helper
[ Upstream commit ab3c5c18e8fa3f8ea116016095d25adab466cd39 ] Most places where we get a struct qstr, we are doing so from a dentry. With fscrypt, the dentry's name may be encrypted on-disk, so fscrypt provides a helper to convert a dentry name to the appropriate disk name if necessary. Convert each of the dentry name accesses to use fscrypt_setup_filename(), then convert the resulting fscrypt_name back to an unencrypted qstr. This does not work for nokey names, but the specific locations that could spawn nokey names are noted. At present, since there are no encrypted directories, nothing goes down the filename encryption paths. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 9af86694fd5d ("btrfs: file_remove_privs needs an exclusive lock in direct io write") Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
![]() |
1cf474cd47 |
btrfs: use struct qstr instead of name and namelen pairs
[ Upstream commit e43eec81c5167b655b72c781b0e75e62a05e415e ] Many functions throughout btrfs take name buffer and name length arguments. Most of these functions at the highest level are usually called with these arguments extracted from a supplied dentry's name. But the entire name can be passed instead, making each function a little more elegant. Each function whose arguments are currently the name and length extracted from a dentry is herein converted to instead take a pointer to the name in the dentry. The couple of calls to these calls without a struct dentry are converted to create an appropriate qstr to pass in. Additionally, every function which is only called with a name/len extracted directly from a qstr is also converted. This change has positive effect on stack consumption, frame of many functions is reduced but this will be used in the future for fscrypt related structures. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 9af86694fd5d ("btrfs: file_remove_privs needs an exclusive lock in direct io write") Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
![]() |
ea34b8bcc7 |
btrfs: rename BTRFS_FS_NO_OVERCOMMIT to BTRFS_FS_ACTIVE_ZONE_TRACKING
[ Upstream commit bf1f1fec2724a33b67ec12032402ea75f2a83622 ] This flag only gets set when we're doing active zone tracking, and we're going to need to use this flag for things related to this behavior. Rename the flag to represent what it actually means for the file system so it can be used in other ways and still make sense. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
![]() |
c19bd0d897 |
btrfs: zoned: enable metadata over-commit for non-ZNS setup
[ Upstream commit 85e79ec7b78f863178ca488fd8cb5b3de6347756 ] The commit |
||
![]() |
8bb808c6ad |
btrfs: don't print stack trace when transaction is aborted due to ENOMEM
Add ENOMEM among the error codes that don't print stack trace on transaction abort. We've got several reports from syzbot that detects stacks as errors but caused by limiting memory. As this is an artificial condition we don't need to know where exactly the error happens, the abort and error cleanup will continue like e.g. for EIO. As the transaction aborts code needs to be inline in a lot of code, the implementation cases about minimal bloat. The error codes are in a separate function and the WARN uses the condition directly. This increases the code size by 571 bytes on release build. Alternatives considered: add -ENOMEM among the errors, this increases size by 2340 bytes, various attempts to combine the WARN and helper calls, increase by 700 or more bytes. Example syzbot reports (error -12): - https://syzkaller.appspot.com/bug?extid=5244d35be7f589cf093e - https://syzkaller.appspot.com/bug?extid=9c37714c07194d816417 Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
8184620ae2 |
btrfs: fix lost file sync on direct IO write with nowait and dsync iocb
When doing a direct IO write using a iocb with nowait and dsync set, we
end up not syncing the file once the write completes.
This is because we tell iomap to not call generic_write_sync(), which
would result in calling btrfs_sync_file(), in order to avoid a deadlock
since iomap can call it while we are holding the inode's lock and
btrfs_sync_file() needs to acquire the inode's lock. The deadlock happens
only if the write happens synchronously, when iomap_dio_rw() calls
iomap_dio_complete() before it returns. Instead we do the sync ourselves
at btrfs_do_write_iter().
For a nowait write however we can end up not doing the sync ourselves at
at btrfs_do_write_iter() because the write could have been queued, and
therefore we get -EIOCBQUEUED returned from iomap in such case. That makes
us skip the sync call at btrfs_do_write_iter(), as we don't do it for
any error returned from btrfs_direct_write(). We can't simply do the call
even if -EIOCBQUEUED is returned, since that would block the task waiting
for IO, both for the data since there are bios still in progress as well
as potentially blocking when joining a log transaction and when syncing
the log (writing log trees, super blocks, etc).
So let iomap do the sync call itself and in order to avoid deadlocks for
the case of synchronous writes (without nowait), use __iomap_dio_rw() and
have ourselves call iomap_dio_complete() after unlocking the inode.
A test case will later be sent for fstests, after this is fixed in Linus'
tree.
Fixes:
|
||
![]() |
4c0c8cfc84 |
btrfs: move btrfs_drop_extent_cache() to extent_map.c
The function btrfs_drop_extent_cache() doesn't really belong at file.c because what it does is drop a range of extent maps for a file range. It directly allocates and manipulates extent maps, by dropping, splitting and replacing them in an extent map tree, so it should be located at extent_map.c, where all manipulations of an extent map tree and its extent maps are supposed to be done. So move it out of file.c and into extent_map.c. Additionally do the following changes: 1) Rename it into btrfs_drop_extent_map_range(), as this makes it more clear about what it does. The term "cache" is a bit confusing as it's not widely used, "extent maps" or "extent mapping" is much more common; 2) Change its 'skip_pinned' argument from int to bool; 3) Turn several of its local variables from int to bool, since they are used as booleans; 4) Move the declaration of some variables out of the function's main scope and into the scopes where they are used; 5) Remove pointless assignment of false to 'modified' early in the while loop, as later that variable is set and it's not used before that second assignment; 6) Remove checks for NULL before calling free_extent_map(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
3050dfa63e |
btrfs: remove stale prototype of btrfs_write_inode
This function no longer exists, was removed in
|
||
![]() |
80f9d24130 |
btrfs: make btrfs_check_nocow_lock nowait compatible
Now all the helpers that btrfs_check_nocow_lock uses handle nowait, add a nowait flag to btrfs_check_nocow_lock so it can be used by the write path. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Stefan Roesch <shr@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
26ce911446 |
btrfs: make can_nocow_extent nowait compatible
If we have NOWAIT specified on our IOCB and we're writing into a PREALLOC or NOCOW extent then we need to be able to tell can_nocow_extent that we don't want to wait on any locks or metadata IO. Fix can_nocow_extent to allow for NOWAIT. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
857bc13f85 |
btrfs: implement a nowait option for tree searches
For NOWAIT IOCBs we'll need a way to tell search to not wait on locks or anything. Accomplish this by adding a path->nowait flag that will use trylocks and skip reading of metadata, returning -EAGAIN in either of these cases. For now we only need this for reads, so only the read side is handled. Add an ASSERT() to catch anybody trying to use this for writes so they know they'll have to implement the write side. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
d9d88fde56 |
btrfs: move the fs_info related helpers closer to fs_info in ctree.h
This is purely cosmetic, to make it straightforward to copy and paste the definition and helpers from ctree.h into fs.h. These are helpers that act directly on the fs_info, and were scattered throughout ctree.h. Move them directly below the fs_info definition to make it easier to move them later. This includes the exclop prototypes, which shares an enum that's used in struct btrfs_fs_info as well. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
f119553fd3 |
btrfs: move btrfs_csum_ptr to inode.c
This helper is only used in inode.c, move it locally to that file instead of defining it in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
0e75f0054a |
btrfs: move fs_info forward declarations to the top of ctree.h
In order to make it more straightforward to move the fs_info struct and it's related structures, move the struct declarations to the top of ctree.h. This will make it easier to clean up after the fact. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
2103da3b0e |
btrfs: move btrfs_swapfile_pin into volumes.h
This isn't a great spot for this, but one of the swapfile helper functions is in volumes.c, so move the struct to volumes.h. In the future when we have better separation of code there will be a more natural spot for this. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
c2e79e865b |
btrfs: move btrfs_pinned_by_swapfile prototype into volumes.h
This is defined in volumes.c, move the prototype into volumes.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
43712116f8 |
btrfs: move btrfs_init_async_reclaim_work prototype to space-info.h
The code for this helper is in space-info.c, move the prototype to space-info.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
c29abab4f9 |
btrfs: move btrfs_full_stripe_locks_tree into block-group.h
This is actually embedded in struct btrfs_block_group, so move this definition to block-group.h, and then open-code the init of the tree where we init the rest of the block group instead of using a helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
16708a8898 |
btrfs: move btrfs_caching_type to block-group.h
This is a block group related definition, move it into block-group.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
6ea1a5264b |
btrfs: remove btrfs_bit_radix_cachep declaration
btrfs_bit_radix_cachep has been removed since
commit
|
||
![]() |
011b46c304 |
btrfs: skip subtree scan if it's too high to avoid low stall in btrfs_commit_transaction()
Btrfs qgroup has a long history of bringing performance penalty in btrfs_commit_transaction(). Although we tried our best to migrate such impact, there is still an unsolved call site, btrfs_drop_snapshot(). This function will find the highest shared tree block and modify its extent ownership to do a subvolume/snapshot dropping. Such change will affect the whole subtree, and cause tons of qgroup dirty extents and stall btrfs_commit_transaction(). To avoid such problem, here we introduce a new sysfs interface, /sys/fs/btrfs/<uuid>/qgroups/drop_subptree_threshold, to determine at whether and at which level we should skip qgroup accounting for subtree dropping. The default value is BTRFS_MAX_LEVEL, thus every subtree drop will go through qgroup accounting, to ensure qgroup numbers are kept as consistent as possible. While for performance sensitive cases, add a way to change the values to more reasonable values like 3, to make any subtree, which is at or higher than level 3, to mark qgroup inconsistent and skip the accounting. The cost is obvious, the qgroup number is no longer consistent, but at least performance is more reasonable, and users have the control. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
ac3c0d36a2 |
btrfs: make fiemap more efficient and accurate reporting extent sharedness
The current fiemap implementation does not scale very well with the number of extents a file has. This is both because the main algorithm to find out the extents has a high algorithmic complexity and because for each extent we have to check if it's shared. This second part, checking if an extent is shared, is significantly improved by the two previous patches in this patchset, while the first part is improved by this specific patch. Every now and then we get reports from users mentioning fiemap is too slow or even unusable for files with a very large number of extents, such as the two recent reports referred to by the Link tags at the bottom of this change log. To understand why the part of finding which extents a file has is very inefficient, consider the example of doing a full ranged fiemap against a file that has over 100K extents (normal for example for a file with more than 10G of data and using compression, which limits the extent size to 128K). When we enter fiemap at extent_fiemap(), the following happens: 1) Before entering the main loop, we call get_extent_skip_holes() to get the first extent map. This leads us to btrfs_get_extent_fiemap(), which in turn calls btrfs_get_extent(), to find the first extent map that covers the file range [0, LLONG_MAX). btrfs_get_extent() will first search the inode's extent map tree, to see if we have an extent map there that covers the range. If it does not find one, then it will search the inode's subvolume b+tree for a fitting file extent item. After finding the file extent item, it will allocate an extent map, fill it in with information extracted from the file extent item, and add it to the inode's extent map tree (which requires a search for insertion in the tree). 2) Then we enter the main loop at extent_fiemap(), emit the details of the extent, and call again get_extent_skip_holes(), with a start offset matching the end of the extent map we previously processed. We end up at btrfs_get_extent() again, will search the extent map tree and then search the subvolume b+tree for a file extent item if we could not find an extent map in the extent tree. We allocate an extent map, fill it in with the details in the file extent item, and then insert it into the extent map tree (yet another search in this tree). 3) The second step is repeated over and over, until we have processed the whole file range. Each iteration ends at btrfs_get_extent(), which does a red black tree search on the extent map tree, then searches the subvolume b+tree, allocates an extent map and then does another search in the extent map tree in order to insert the extent map. In the best scenario we have all the extent maps already in the extent tree, and so for each extent we do a single search on a red black tree, so we have a complexity of O(n log n). In the worst scenario we don't have any extent map already loaded in the extent map tree, or have very few already there. In this case the complexity is much higher since we do: - A red black tree search on the extent map tree, which has O(log n) complexity, initially very fast since the tree is empty or very small, but as we end up allocating extent maps and adding them to the tree when we don't find them there, each subsequent search on the tree gets slower, since it's getting bigger and bigger after each iteration. - A search on the subvolume b+tree, also O(log n) complexity, but it has items for all inodes in the subvolume, not just items for our inode. Plus on a filesystem with concurrent operations on other inodes, we can block doing the search due to lock contention on b+tree nodes/leaves. - Allocate an extent map - this can block, and can also fail if we are under serious memory pressure. - Do another search on the extent maps red black tree, with the goal of inserting the extent map we just allocated. Again, after every iteration this tree is getting bigger by 1 element, so after many iterations the searches are slower and slower. - We will not need the allocated extent map anymore, so it's pointless to add it to the extent map tree. It's just wasting time and memory. In short we end up searching the extent map tree multiple times, on a tree that is growing bigger and bigger after each iteration. And besides that we visit the same leaf of the subvolume b+tree many times, since a leaf with the default size of 16K can easily have more than 200 file extent items. This is very inefficient overall. This patch changes the algorithm to instead iterate over the subvolume b+tree, visiting each leaf only once, and only searching in the extent map tree for file ranges that have holes or prealloc extents, in order to figure out if we have delalloc there. It will never allocate an extent map and add it to the extent map tree. This is very similar to what was previously done for the lseek's hole and data seeking features. Also, the current implementation relying on extent maps for figuring out which extents we have is not correct. This is because extent maps can be merged even if they represent different extents - we do this to minimize memory utilization and keep extent map trees smaller. For example if we have two extents that are contiguous on disk, once we load the two extent maps, they get merged into a single one - however if only one of the extents is shared, we end up reporting both as shared or both as not shared, which is incorrect. This reproducer triggers that bug: $ cat fiemap-bug.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj mkfs.btrfs -f $DEV mount $DEV $MNT # Create a file with two 256K extents. # Since there is no other write activity, they will be contiguous, # and their extent maps merged, despite having two distinct extents. xfs_io -f -c "pwrite -S 0xab 0 256K" \ -c "fsync" \ -c "pwrite -S 0xcd 256K 256K" \ -c "fsync" \ $MNT/foo # Now clone only the second extent into another file. xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar # Filefrag will report a single 512K extent, and say it's not shared. echo filefrag -v $MNT/foo umount $MNT Running the reproducer: $ ./fiemap-bug.sh wrote 262144/262144 bytes at offset 0 256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec) wrote 262144/262144 bytes at offset 262144 256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec) linked 262144/262144 bytes at offset 0 256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec) Filesystem type is: 9123683e File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 127: 3328.. 3455: 128: last,eof /mnt/sdj/foo: 1 extent found We end up reporting that we have a single 512K that is not shared, however we have two 256K extents, and the second one is shared. Changing the reproducer to clone instead the first extent into file 'bar', makes us report a single 512K extent that is shared, which is algo incorrect since we have two 256K extents and only the first one is shared. This patch is part of a larger patchset that is comprised of the following patches: btrfs: allow hole and data seeking to be interruptible btrfs: make hole and data seeking a lot more efficient btrfs: remove check for impossible block start for an extent map at fiemap btrfs: remove zero length check when entering fiemap btrfs: properly flush delalloc when entering fiemap btrfs: allow fiemap to be interruptible btrfs: rename btrfs_check_shared() to a more descriptive name btrfs: speedup checking for extent sharedness during fiemap btrfs: skip unnecessary extent buffer sharedness checks during fiemap btrfs: make fiemap more efficient and accurate reporting extent sharedness The patchset was tested on a machine running a non-debug kernel (Debian's default config) and compared the tests below on a branch without the patchset versus the same branch with the whole patchset applied. The following test for a large compressed file without holes: $ cat fiemap-perf-test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount -o compress=lzo $DEV $MNT # 40G gives 327680 128K file extents (due to compression). xfs_io -f -c "pwrite -S 0xab -b 1M 0 20G" $MNT/foobar umount $MNT mount -o compress=lzo $DEV $MNT start=$(date +%s%N) filefrag $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "fiemap took $dur milliseconds (metadata not cached)" start=$(date +%s%N) filefrag $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "fiemap took $dur milliseconds (metadata cached)" umount $MNT Before patchset: $ ./fiemap-perf-test.sh (...) /mnt/sdi/foobar: 327680 extents found fiemap took 3597 milliseconds (metadata not cached) /mnt/sdi/foobar: 327680 extents found fiemap took 2107 milliseconds (metadata cached) After patchset: $ ./fiemap-perf-test.sh (...) /mnt/sdi/foobar: 327680 extents found fiemap took 1214 milliseconds (metadata not cached) /mnt/sdi/foobar: 327680 extents found fiemap took 684 milliseconds (metadata cached) That's a speedup of about 3x for both cases (no metadata cached and all metadata cached). The test provided by Pavel (first Link tag at the bottom), which uses files with a large number of holes, was also used to measure the gains, and it consists on a small C program and a shell script to invoke it. The C program is the following: $ cat pavels-test.c #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <fcntl.h> #include <sys/stat.h> #include <sys/time.h> #include <sys/ioctl.h> #include <linux/fs.h> #include <linux/fiemap.h> #define FILE_INTERVAL (1<<13) /* 8Kb */ long long interval(struct timeval t1, struct timeval t2) { long long val = 0; val += (t2.tv_usec - t1.tv_usec); val += (t2.tv_sec - t1.tv_sec) * 1000 * 1000; return val; } int main(int argc, char **argv) { struct fiemap fiemap = {}; struct timeval t1, t2; char data = 'a'; struct stat st; int fd, off, file_size = FILE_INTERVAL; if (argc != 3 && argc != 2) { printf("usage: %s <path> [size]\n", argv[0]); return 1; } if (argc == 3) file_size = atoi(argv[2]); if (file_size < FILE_INTERVAL) file_size = FILE_INTERVAL; file_size -= file_size % FILE_INTERVAL; fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0644); if (fd < 0) { perror("open"); return 1; } for (off = 0; off < file_size; off += FILE_INTERVAL) { if (pwrite(fd, &data, 1, off) != 1) { perror("pwrite"); close(fd); return 1; } } if (ftruncate(fd, file_size)) { perror("ftruncate"); close(fd); return 1; } if (fstat(fd, &st) < 0) { perror("fstat"); close(fd); return 1; } printf("size: %ld\n", st.st_size); printf("actual size: %ld\n", st.st_blocks * 512); fiemap.fm_length = FIEMAP_MAX_OFFSET; gettimeofday(&t1, NULL); if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) { perror("fiemap"); close(fd); return 1; } gettimeofday(&t2, NULL); printf("fiemap: fm_mapped_extents = %d\n", fiemap.fm_mapped_extents); printf("time = %lld us\n", interval(t1, t2)); close(fd); return 0; } $ gcc -o pavels_test pavels_test.c And the wrapper shell script: $ cat fiemap-pavels-test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f -O no-holes $DEV mount $DEV $MNT echo echo "*********** 256M ***********" echo ./pavels-test $MNT/testfile $((1 << 28)) echo ./pavels-test $MNT/testfile $((1 << 28)) echo echo "*********** 512M ***********" echo ./pavels-test $MNT/testfile $((1 << 29)) echo ./pavels-test $MNT/testfile $((1 << 29)) echo echo "*********** 1G ***********" echo ./pavels-test $MNT/testfile $((1 << 30)) echo ./pavels-test $MNT/testfile $((1 << 30)) umount $MNT Running his reproducer before applying the patchset: *********** 256M *********** size: 268435456 actual size: 134217728 fiemap: fm_mapped_extents = 32768 time = |
||
![]() |
12a824dc67 |
btrfs: speedup checking for extent sharedness during fiemap
One of the most expensive tasks performed during fiemap is to check if an extent is shared. This task has two major steps: 1) Check if the data extent is shared. This implies checking the extent item in the extent tree, checking delayed references, etc. If we find the data extent is directly shared, we terminate immediately; 2) If the data extent is not directly shared (its extent item has a refcount of 1), then it may be shared if we have snapshots that share subtrees of the inode's subvolume b+tree. So we check if the leaf containing the file extent item is shared, then its parent node, then the parent node of the parent node, etc, until we reach the root node or we find one of them is shared - in which case we stop immediately. During fiemap we process the extents of a file from left to right, from file offset 0 to EOF. This means that we iterate b+tree leaves from left to right, and has the implication that we keep repeating that second step above several times for the same b+tree path of the inode's subvolume b+tree. For example, if we have two file extent items in leaf X, and the path to leaf X is A -> B -> C -> X, then when we try to determine if the data extent referenced by the first extent item is shared, we check if the data extent is shared - if it's not, then we check if leaf X is shared, if not, then we check if node C is shared, if not, then check if node B is shared, if not than check if node A is shared. When we move to the next file extent item, after determining the data extent is not shared, we repeat the checks for X, C, B and A - doing all the expensive searches in the extent tree, delayed refs, etc. If we have thousands of tile extents, then we keep repeating the sharedness checks for the same paths over and over. On a file that has no shared extents or only a small portion, it's easy to see that this scales terribly with the number of extents in the file and the sizes of the extent and subvolume b+trees. This change eliminates the repeated sharedness check on extent buffers by caching the results of the last path used. The results can be used as long as no snapshots were created since they were cached (for not shared extent buffers) or no roots were dropped since they were cached (for shared extent buffers). This greatly reduces the time spent by fiemap for files with thousands of extents and/or large extent and subvolume b+trees. Example performance test: $ cat fiemap-perf-test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount -o compress=lzo $DEV $MNT # 40G gives 327680 128K file extents (due to compression). xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar umount $MNT mount -o compress=lzo $DEV $MNT start=$(date +%s%N) filefrag $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "fiemap took $dur milliseconds (metadata not cached)" start=$(date +%s%N) filefrag $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "fiemap took $dur milliseconds (metadata cached)" umount $MNT Before this patch: $ ./fiemap-perf-test.sh (...) /mnt/sdi/foobar: 327680 extents found fiemap took 3597 milliseconds (metadata not cached) /mnt/sdi/foobar: 327680 extents found fiemap took 2107 milliseconds (metadata cached) After this patch: $ ./fiemap-perf-test.sh (...) /mnt/sdi/foobar: 327680 extents found fiemap took 1646 milliseconds (metadata not cached) /mnt/sdi/foobar: 327680 extents found fiemap took 698 milliseconds (metadata cached) That's about 2.2x faster when no metadata is cached, and about 3x faster when all metadata is cached. On a real filesystem with many other files, data, directories, etc, the b+trees will be 2 or 3 levels higher, therefore this optimization will have a higher impact. Several reports of a slow fiemap show up often, the two Link tags below refer to two recent reports of such slowness. This patch, together with the next ones in the series, is meant to address that. Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/ Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
1c56ab9919 |
btrfs: separate BLOCK_GROUP_TREE compat RO flag from EXTENT_TREE_V2
The problem of long mount time caused by block group item search is already known for some time, and the solution of block group tree has been proposed. There is really no need to bound this feature into extent tree v2, just introduce compat RO flag, BLOCK_GROUP_TREE, to correctly solve the problem. All the code handling block group root is already in the upstream kernel, thus this patch really only needs to introduce the new compat RO flag. This patch introduces one extra artificial limitation on block group tree feature, that free space cache v2 and no-holes feature must be enabled to use this new compat RO feature. This artificial requirement is mostly to reduce the test combinations, and can be a guideline for future features, to mostly rely on the latest default features. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
14033b08a0 |
btrfs: don't save block group root into super block
The extent tree v2 needs a new root for storing all block group items, the whole feature hasn't been finished yet so we can afford to do some changes. My initial proposal years ago just added a new tree rootid, and load it from tree root, just like what we did for quota/free space tree/uuid/extent roots. But the extent tree v2 patches introduced a completely new way to store block group tree root into super block which is arguably wasteful. Currently there are only 3 trees stored in super blocks, and they all have their valid reasons: - Chunk root Needed for bootstrap. - Tree root Really the entry point for all trees. - Log root This is special as log root has to be updated out of existing transaction mechanism. There is not even any reason to put block group root into super blocks, the block group tree is updated at the same time as the old extent tree, no need for extra bootstrap/out-of-transaction update. So just move block group root from super block into tree root. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
8e327b9c0d |
btrfs: dump all space infos if we abort transaction due to ENOSPC
We have hit some transaction abort due to -ENOSPC internally. Normally we should always reserve enough space for metadata for every transaction, thus hitting -ENOSPC should really indicate some cases we didn't expect. But unfortunately current error reporting will only give a kernel warning and stack trace, not really helpful to debug what's causing the problem. And mount option debug_enospc can only help when user can reproduce the problem, but under most cases, such transaction abort by -ENOSPC is really hard to reproduce. So this patch will dump all space infos (data, metadata, system) when we abort the first transaction with -ENOSPC. This should at least provide some clue to us. The example of a dump would look like this: BTRFS: Transaction aborted (error -28) WARNING: CPU: 8 PID: 3366 at fs/btrfs/transaction.c:2137 btrfs_commit_transaction+0xf81/0xfb0 [btrfs] <call trace skipped> ---[ end trace 0000000000000000 ]--- BTRFS info (device dm-1: state A): dumping space info: BTRFS info (device dm-1: state A): space_info DATA has 6791168 free, is not full BTRFS info (device dm-1: state A): space_info total=8388608, used=1597440, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0 BTRFS info (device dm-1: state A): space_info METADATA has 257114112 free, is not full BTRFS info (device dm-1: state A): space_info total=268435456, used=131072, pinned=180224, reserved=65536, may_use=10878976, readonly=65536 zone_unusable=0 BTRFS info (device dm-1: state A): space_info SYSTEM has 8372224 free, is not full BTRFS info (device dm-1: state A): space_info total=8388608, used=16384, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0 BTRFS info (device dm-1: state A): global_block_rsv: size 3670016 reserved 3670016 BTRFS info (device dm-1: state A): trans_block_rsv: size 0 reserved 0 BTRFS info (device dm-1: state A): chunk_block_rsv: size 0 reserved 0 BTRFS info (device dm-1: state A): delayed_block_rsv: size 4063232 reserved 4063232 BTRFS info (device dm-1: state A): delayed_refs_rsv: size 3145728 reserved 3145728 BTRFS: error (device dm-1: state A) in btrfs_commit_transaction:2137: errno=-28 No space left BTRFS info (device dm-1: state EA): forced readonly Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
2bbc72f14f |
btrfs: don't take a bio_counter reference for cloned bios
Stop grabbing an extra bio_counter reference for each clone bio in a mirrored write and instead just release the one original reference in btrfs_end_bioc once all the bios for a single btrfs_bio have completed instead of at the end of btrfs_submit_bio once all bios have been submitted. This means the reference is now carried by the "upper" btrfs_bio only instead of each lower bio. Also remove the now unused btrfs_bio_counter_inc_noblocked helper. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
fb731430be |
btrfs: sysfs: show discard stats and tunables in non-debug build
When discard=async was introduced there were also sysfs knobs and stats for debugging and tuning, hidden under CONFIG_BTRFS_DEBUG. The defaults have been set and so far seem to satisfy all users on a range of workloads. As there are not only tunables (like iops or kbps) but also stats tracking amount of discardable bytes, that should be available when the async discard is on (otherwise it's not). The stats are moved from the per-fs debug directory, so it's under /sys/fs/btrfs/FSID/discard - discard_bitmap_bytes - amount of discarded bytes from data tracked as bitmaps - discard_extent_bytes - dtto but as extents - discard_bytes_saved - - discardable_bytes - amount of bytes that can be discarded - discardable_extents - number of extents to be discarded - iops_limit - tunable limit of number of discard IOs to be issued - kbps_limit - tunable limit of kilobytes per second issued as discard IO - max_discard_size - tunable limit for size of one IO discard request Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
38622010a6 |
btrfs: send: add support for fs-verity
Preserve the fs-verity status of a btrfs file across send/recv. There is no facility for installing the Merkle tree contents directly on the receiving filesystem, so we package up the parameters used to enable verity found in the verity descriptor. This gives the receive side enough information to properly enable verity again. Note that this means that receive will have to re-compute the whole Merkle tree, similar to how compression worked before encoded_write. Since the file becomes read-only after verity is enabled, it is important that verity is added to the send stream after any file writes. Therefore, when we process a verity item, merely note that it happened, then actually create the command in the send stream during 'finish_inode_if_needed'. This also creates V3 of the send stream format, without any format changes besides adding the new commands and attributes. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
d1f68ba069 |
btrfs: rename btrfs_insert_file_extent() to btrfs_insert_hole_extent()
btrfs_insert_file_extent() is only ever used to insert holes, so rename it and remove the redundant parameters. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
5f4403e10f |
btrfs: add lockdep annotations for the ordered extents wait event
This wait event is very similar to the pending ordered wait event in the sense that it occurs in a different context than the condition signaling for the event. The signaling occurs in btrfs_remove_ordered_extent() while the wait event is implemented in btrfs_start_ordered_extent() in fs/btrfs/ordered-data.c However, in this case a thread must not acquire the lockdep map for the ordered extents wait event when the ordered extent is related to a free space inode. That is because lockdep creates dependencies between locks acquired both in execution paths related to normal inodes and paths related to free space inodes, thus leading to false positives. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
8b53779eaa |
btrfs: add lockdep annotations for pending_ordered wait event
In contrast to the num_writers and num_extwriters wait events, the condition for the pending ordered wait event is signaled in a different context from the wait event itself. The condition signaling occurs in btrfs_remove_ordered_extent() in fs/btrfs/ordered-data.c while the wait event is implemented in btrfs_commit_transaction() in fs/btrfs/transaction.c Thus the thread signaling the condition has to acquire the lockdep map as a reader at the start of btrfs_remove_ordered_extent() and release it after it has signaled the condition. In this case some dependencies might be left out due to the placement of the annotation, but it is better than no annotation at all. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
3e738c531a |
btrfs: add lockdep annotations for transaction states wait events
Add lockdep annotations for the transaction states that have wait events; 1) TRANS_STATE_COMMIT_START 2) TRANS_STATE_UNBLOCKED 3) TRANS_STATE_SUPER_COMMITTED 4) TRANS_STATE_COMPLETED The new macros introduced here to annotate the transaction states wait events have the same effect as the generic lockdep annotation macros. With the exception of the lockdep annotation for TRANS_STATE_COMMIT_START the transaction thread has to acquire the lockdep maps for the transaction states as reader after the lockdep map for num_writers is released so that lockdep does not complain. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
5a9ba6709f |
btrfs: add lockdep annotations for num_extwriters wait event
Similarly to the num_writers wait event in fs/btrfs/transaction.c add a lockdep annotation for the num_extwriters wait event. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
e1489b4fe6 |
btrfs: add lockdep annotations for num_writers wait event
Annotate the num_writers wait event in fs/btrfs/transaction.c with lockdep in order to catch deadlocks involving this wait event. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
ab9a323f9a |
btrfs: add macros for annotating wait events with lockdep
Introduce four macros that are used to annotate wait events in btrfs code with lockdep; 1) the btrfs_lockdep_init_map 2) the btrfs_lockdep_acquire, 3) the btrfs_lockdep_release 4) the btrfs_might_wait_for_event macros. The btrfs_lockdep_init_map macro is used to initialize a lockdep map. The btrfs_lockdep_<acquire,release> macros are used by threads to take the lockdep map as readers (shared lock) and release it, respectively. The btrfs_might_wait_for_event macro is used by threads to take the lockdep map as writers (exclusive lock) and release it. In general, the lockdep annotation for wait events work as follows: The condition for a wait event can be modified and signaled at the same time by multiple threads. These threads hold the lockdep map as readers when they enter a context in which blocking would prevent signaling the condition. Frequently, this occurs when a thread violates a condition (lockdep map acquire), before restoring it and signaling it at a later point (lockdep map release). The threads that block on the wait event take the lockdep map as writers (exclusive lock). These threads have to block until all the threads that hold the lockdep map as readers signal the condition for the wait event and release the lockdep map. The lockdep annotation is used to warn about potential deadlock scenarios that involve the threads that modify and signal the wait event condition and threads that block on the wait event. A simple example is illustrated below: Without lockdep: TA TB cond = false lock(A) wait_event(w, cond) unlock(A) lock(A) cond = true signal(w) unlock(A) With lockdep: TA TB rwsem_acquire_read(lockdep_map) cond = false lock(A) rwsem_acquire(lockdep_map) rwsem_release(lockdep_map) wait_event(w, cond) unlock(A) lock(A) cond = true signal(w) unlock(A) rwsem_release(lockdep_map) In the second case, with the lockdep annotation, lockdep would warn about an ABBA deadlock, while the first case would just deadlock at some point. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
d5b81ced74 |
btrfs: zoned: fix API misuse of zone finish waiting
The commit |
||
![]() |
ced8ecf026 |
btrfs: fix space cache corruption and potential double allocations
When testing space_cache v2 on a large set of machines, we encountered a few symptoms: 1. "unable to add free space :-17" (EEXIST) errors. 2. Missing free space info items, sometimes caught with a "missing free space info for X" error. 3. Double-accounted space: ranges that were allocated in the extent tree and also marked as free in the free space tree, ranges that were marked as allocated twice in the extent tree, or ranges that were marked as free twice in the free space tree. If the latter made it onto disk, the next reboot would hit the BUG_ON() in add_new_free_space(). 4. On some hosts with no on-disk corruption or error messages, the in-memory space cache (dumped with drgn) disagreed with the free space tree. All of these symptoms have the same underlying cause: a race between caching the free space for a block group and returning free space to the in-memory space cache for pinned extents causes us to double-add a free range to the space cache. This race exists when free space is cached from the free space tree (space_cache=v2) or the extent tree (nospace_cache, or space_cache=v1 if the cache needs to be regenerated). struct btrfs_block_group::last_byte_to_unpin and struct btrfs_block_group::progress are supposed to protect against this race, but commit |
||
![]() |
b40130b23c |
btrfs: fix lockdep splat with reloc root extent buffers
We have been hitting the following lockdep splat with btrfs/187 recently WARNING: possible circular locking dependency detected 5.19.0-rc8+ #775 Not tainted ------------------------------------------------------ btrfs/752500 is trying to acquire lock: ffff97e1875a97b8 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110 but task is already holding lock: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (btrfs-tree-01/1){+.+.}-{3:3}: down_write_nested+0x41/0x80 __btrfs_tree_lock+0x24/0x110 btrfs_init_new_buffer+0x7d/0x2c0 btrfs_alloc_tree_block+0x120/0x3b0 __btrfs_cow_block+0x136/0x600 btrfs_cow_block+0x10b/0x230 btrfs_search_slot+0x53b/0xb70 btrfs_lookup_inode+0x2a/0xa0 __btrfs_update_delayed_inode+0x5f/0x280 btrfs_async_run_delayed_root+0x24c/0x290 btrfs_work_helper+0xf2/0x3e0 process_one_work+0x271/0x590 worker_thread+0x52/0x3b0 kthread+0xf0/0x120 ret_from_fork+0x1f/0x30 -> #1 (btrfs-tree-01){++++}-{3:3}: down_write_nested+0x41/0x80 __btrfs_tree_lock+0x24/0x110 btrfs_search_slot+0x3c3/0xb70 do_relocation+0x10c/0x6b0 relocate_tree_blocks+0x317/0x6d0 relocate_block_group+0x1f1/0x560 btrfs_relocate_block_group+0x23e/0x400 btrfs_relocate_chunk+0x4c/0x140 btrfs_balance+0x755/0xe40 btrfs_ioctl+0x1ea2/0x2c90 __x64_sys_ioctl+0x88/0xc0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd -> #0 (btrfs-treloc-02#2){+.+.}-{3:3}: __lock_acquire+0x1122/0x1e10 lock_acquire+0xc2/0x2d0 down_write_nested+0x41/0x80 __btrfs_tree_lock+0x24/0x110 btrfs_lock_root_node+0x31/0x50 btrfs_search_slot+0x1cb/0xb70 replace_path+0x541/0x9f0 merge_reloc_root+0x1d6/0x610 merge_reloc_roots+0xe2/0x260 relocate_block_group+0x2c8/0x560 btrfs_relocate_block_group+0x23e/0x400 btrfs_relocate_chunk+0x4c/0x140 btrfs_balance+0x755/0xe40 btrfs_ioctl+0x1ea2/0x2c90 __x64_sys_ioctl+0x88/0xc0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd other info that might help us debug this: Chain exists of: btrfs-treloc-02#2 --> btrfs-tree-01 --> btrfs-tree-01/1 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-tree-01/1); lock(btrfs-tree-01); lock(btrfs-tree-01/1); lock(btrfs-treloc-02#2); *** DEADLOCK *** 7 locks held by btrfs/752500: #0: ffff97e292fdf460 (sb_writers#12){.+.+}-{0:0}, at: btrfs_ioctl+0x208/0x2c90 #1: ffff97e284c02050 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0x55f/0xe40 #2: ffff97e284c00878 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x236/0x400 #3: ffff97e292fdf650 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xef/0x610 #4: ffff97e284c02378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0 #5: ffff97e284c023a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0 #6: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110 stack backtrace: CPU: 1 PID: 752500 Comm: btrfs Not tainted 5.19.0-rc8+ #775 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack_lvl+0x56/0x73 check_noncircular+0xd6/0x100 ? lock_is_held_type+0xe2/0x140 __lock_acquire+0x1122/0x1e10 lock_acquire+0xc2/0x2d0 ? __btrfs_tree_lock+0x24/0x110 down_write_nested+0x41/0x80 ? __btrfs_tree_lock+0x24/0x110 __btrfs_tree_lock+0x24/0x110 btrfs_lock_root_node+0x31/0x50 btrfs_search_slot+0x1cb/0xb70 ? lock_release+0x137/0x2d0 ? _raw_spin_unlock+0x29/0x50 ? release_extent_buffer+0x128/0x180 replace_path+0x541/0x9f0 merge_reloc_root+0x1d6/0x610 merge_reloc_roots+0xe2/0x260 relocate_block_group+0x2c8/0x560 btrfs_relocate_block_group+0x23e/0x400 btrfs_relocate_chunk+0x4c/0x140 btrfs_balance+0x755/0xe40 btrfs_ioctl+0x1ea2/0x2c90 ? lock_is_held_type+0xe2/0x140 ? lock_is_held_type+0xe2/0x140 ? __x64_sys_ioctl+0x88/0xc0 __x64_sys_ioctl+0x88/0xc0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd This isn't necessarily new, it's just tricky to hit in practice. There are two competing things going on here. With relocation we create a snapshot of every fs tree with a reloc tree. Any extent buffers that get initialized here are initialized with the reloc root lockdep key. However since it is a snapshot, any blocks that are currently in cache that originally belonged to the fs tree will have the normal tree lockdep key set. This creates the lock dependency of reloc tree -> normal tree for the extent buffer locking during the first phase of the relocation as we walk down the reloc root to relocate blocks. However this is problematic because the final phase of the relocation is merging the reloc root into the original fs root. This involves searching down to any keys that exist in the original fs root and then swapping the relocated block and the original fs root block. We have to search down to the fs root first, and then go search the reloc root for the block we need to replace. This creates the dependency of normal tree -> reloc tree which is why lockdep complains. Additionally even if we were to fix this particular mismatch with a different nesting for the merge case, we're still slotting in a block that has a owner of the reloc root objectid into a normal tree, so that block will have its lockdep key set to the tree reloc root, and create a lockdep splat later on when we wander into that block from the fs root. Unfortunately the only solution here is to make sure we do not set the lockdep key to the reloc tree lockdep key normally, and then reset any blocks we wander into from the reloc root when we're doing the merged. This solves the problem of having mixed tree reloc keys intermixed with normal tree keys, and then allows us to make sure in the merge case we maintain the lock order of normal tree -> reloc tree We handle this by setting a bit on the reloc root when we do the search for the block we want to relocate, and any block we search into or COW at that point gets set to the reloc tree key. This works correctly because we only ever COW down to the parent node, so we aren't resetting the key for the block we're linking into the fs root. With this patch we no longer have the lockdep splat in btrfs/187. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
81bd9328ab |
btrfs: fix repair of compressed extents
Currently the checksum of compressed extents is verified based on the compressed data and the lower btrfs_bio, but the actual repair process is driven by end_bio_extent_readpage on the upper btrfs_bio for the decompressed data. This has a bunch of issues, including not being able to properly communicate the failed mirror up in case that the I/O submission got preempted, a general loss of if an error was an I/O error or a checksum verification failure, but most importantly that this design causes btrfs_clean_io_failure to eventually write back the uncompressed good data onto the disk sectors that are supposed to contain compressed data. Fix this by moving the repair to the lower btrfs_bio. To do so, a fair amount of code has to be reshuffled: a) the lower btrfs_bio now needs a valid csum pointer. The easiest way to achieve that is to pass NULL btrfs_lookup_bio_sums and just use the btrfs_bio management of csums. For a compressed_bio that is split into multiple btrfs_bios this means additional memory allocations, but the code becomes a lot more regular. b) checksum verification now runs directly on the lower btrfs_bio instead of the compressed_bio. This actually nicely simplifies the end I/O processing. c) btrfs_repair_one_sector can't just look up the logical address for the file offset any more, as there is no corresponding relative offsets that apply to the file offset and the logic address for compressed extents. Instead require that the saved bvec_iter in the btrfs_bio is filled out for all read bios and use that, which again removes a fair amount of code. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
7959bd4411 |
btrfs: remove the start argument to check_data_csum and export
Derive the value of start from the btrfs_bio now that ->file_offset is always valid. Also export and rename the function so it's available outside of inode.c as we'll need that soon. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
2ce543f478 |
btrfs: zoned: wait until zone is finished when allocation didn't progress
When the allocated position doesn't progress, we cannot submit IOs to
finish a block group, but there should be ongoing IOs that will finish a
block group. So, in that case, we wait for a zone to be finished and retry
the allocation after that.
Introduce a new flag BTRFS_FS_NEED_ZONE_FINISH for fs_info->flags to
indicate we need a zone finish to have proceeded. The flag is set when the
allocator detected it cannot activate a new block group. And, it is cleared
once a zone is finished.
CC: stable@vger.kernel.org # 5.16+
Fixes:
|
||
![]() |
7d7672bc5d |
btrfs: convert count_max_extents() to use fs_info->max_extent_size
If count_max_extents() uses BTRFS_MAX_EXTENT_SIZE to calculate the number
of extents needed, btrfs release the metadata reservation too much on its
way to write out the data.
Now that BTRFS_MAX_EXTENT_SIZE is replaced with fs_info->max_extent_size,
convert count_max_extents() to use it instead, and fix the calculation of
the metadata reservation.
CC: stable@vger.kernel.org # 5.12+
Fixes:
|
||
![]() |
f7b12a62f0 |
btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size
On zoned filesystem, data write out is limited by max_zone_append_size,
and a large ordered extent is split according the size of a bio. OTOH,
the number of extents to be written is calculated using
BTRFS_MAX_EXTENT_SIZE, and that estimated number is used to reserve the
metadata bytes to update and/or create the metadata items.
The metadata reservation is done at e.g, btrfs_buffered_write() and then
released according to the estimation changes. Thus, if the number of extent
increases massively, the reserved metadata can run out.
The increase of the number of extents easily occurs on zoned filesystem
if BTRFS_MAX_EXTENT_SIZE > max_zone_append_size. And, it causes the
following warning on a small RAM environment with disabling metadata
over-commit (in the following patch).
[75721.498492] ------------[ cut here ]------------
[75721.505624] BTRFS: block rsv 1 returned -28
[75721.512230] WARNING: CPU: 24 PID: 2327559 at fs/btrfs/block-rsv.c:537 btrfs_use_block_rsv+0x560/0x760 [btrfs]
[75721.581854] CPU: 24 PID: 2327559 Comm: kworker/u64:10 Kdump: loaded Tainted: G W 5.18.0-rc2-BTRFS-ZNS+ #109
[75721.597200] Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021
[75721.607310] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[75721.616209] RIP: 0010:btrfs_use_block_rsv+0x560/0x760 [btrfs]
[75721.646649] RSP: 0018:ffffc9000fbdf3e0 EFLAGS: 00010286
[75721.654126] RAX: 0000000000000000 RBX: 0000000000004000 RCX: 0000000000000000
[75721.663524] RDX: 0000000000000004 RSI: 0000000000000008 RDI: fffff52001f7be6e
[75721.672921] RBP: ffffc9000fbdf420 R08: 0000000000000001 R09: ffff889f8d1fc6c7
[75721.682493] R10: ffffed13f1a3f8d8 R11: 0000000000000001 R12: ffff88980a3c0e28
[75721.692284] R13: ffff889b66590000 R14: ffff88980a3c0e40 R15: ffff88980a3c0e8a
[75721.701878] FS: 0000000000000000(0000) GS:ffff889f8d000000(0000) knlGS:0000000000000000
[75721.712601] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[75721.720726] CR2: 000055d12e05c018 CR3: 0000800193594000 CR4: 0000000000350ee0
[75721.730499] Call Trace:
[75721.735166] <TASK>
[75721.739886] btrfs_alloc_tree_block+0x1e1/0x1100 [btrfs]
[75721.747545] ? btrfs_alloc_logged_file_extent+0x550/0x550 [btrfs]
[75721.756145] ? btrfs_get_32+0xea/0x2d0 [btrfs]
[75721.762852] ? btrfs_get_32+0xea/0x2d0 [btrfs]
[75721.769520] ? push_leaf_left+0x420/0x620 [btrfs]
[75721.776431] ? memcpy+0x4e/0x60
[75721.781931] split_leaf+0x433/0x12d0 [btrfs]
[75721.788392] ? btrfs_get_token_32+0x580/0x580 [btrfs]
[75721.795636] ? push_for_double_split.isra.0+0x420/0x420 [btrfs]
[75721.803759] ? leaf_space_used+0x15d/0x1a0 [btrfs]
[75721.811156] btrfs_search_slot+0x1bc3/0x2790 [btrfs]
[75721.818300] ? lock_downgrade+0x7c0/0x7c0
[75721.824411] ? free_extent_buffer.part.0+0x107/0x200 [btrfs]
[75721.832456] ? split_leaf+0x12d0/0x12d0 [btrfs]
[75721.839149] ? free_extent_buffer.part.0+0x14f/0x200 [btrfs]
[75721.846945] ? free_extent_buffer+0x13/0x20 [btrfs]
[75721.853960] ? btrfs_release_path+0x4b/0x190 [btrfs]
[75721.861429] btrfs_csum_file_blocks+0x85c/0x1500 [btrfs]
[75721.869313] ? rcu_read_lock_sched_held+0x16/0x80
[75721.876085] ? lock_release+0x552/0xf80
[75721.881957] ? btrfs_del_csums+0x8c0/0x8c0 [btrfs]
[75721.888886] ? __kasan_check_write+0x14/0x20
[75721.895152] ? do_raw_read_unlock+0x44/0x80
[75721.901323] ? _raw_write_lock_irq+0x60/0x80
[75721.907983] ? btrfs_global_root+0xb9/0xe0 [btrfs]
[75721.915166] ? btrfs_csum_root+0x12b/0x180 [btrfs]
[75721.921918] ? btrfs_get_global_root+0x820/0x820 [btrfs]
[75721.929166] ? _raw_write_unlock+0x23/0x40
[75721.935116] ? unpin_extent_cache+0x1e3/0x390 [btrfs]
[75721.942041] btrfs_finish_ordered_io.isra.0+0xa0c/0x1dc0 [btrfs]
[75721.949906] ? try_to_wake_up+0x30/0x14a0
[75721.955700] ? btrfs_unlink_subvol+0xda0/0xda0 [btrfs]
[75721.962661] ? rcu_read_lock_sched_held+0x16/0x80
[75721.969111] ? lock_acquire+0x41b/0x4c0
[75721.974982] finish_ordered_fn+0x15/0x20 [btrfs]
[75721.981639] btrfs_work_helper+0x1af/0xa80 [btrfs]
[75721.988184] ? _raw_spin_unlock_irq+0x28/0x50
[75721.994643] process_one_work+0x815/0x1460
[75722.000444] ? pwq_dec_nr_in_flight+0x250/0x250
[75722.006643] ? do_raw_spin_trylock+0xbb/0x190
[75722.013086] worker_thread+0x59a/0xeb0
[75722.018511] kthread+0x2ac/0x360
[75722.023428] ? process_one_work+0x1460/0x1460
[75722.029431] ? kthread_complete_and_exit+0x30/0x30
[75722.036044] ret_from_fork+0x22/0x30
[75722.041255] </TASK>
[75722.045047] irq event stamp: 0
[75722.049703] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[75722.057610] hardirqs last disabled at (0): [<ffffffff8118a94a>] copy_process+0x1c1a/0x66b0
[75722.067533] softirqs last enabled at (0): [<ffffffff8118a989>] copy_process+0x1c59/0x66b0
[75722.077423] softirqs last disabled at (0): [<0000000000000000>] 0x0
[75722.085335] ---[ end trace 0000000000000000 ]---
To fix the estimation, we need to introduce fs_info->max_extent_size to
replace BTRFS_MAX_EXTENT_SIZE, which allow setting the different size for
regular vs zoned filesystem.
Set fs_info->max_extent_size to BTRFS_MAX_EXTENT_SIZE by default. On zoned
filesystem, it is set to fs_info->max_zone_append_size.
CC: stable@vger.kernel.org # 5.12+
Fixes:
|