From e2497d4c7ed6bafc9116d97ecbf39c7d75205f38 Mon Sep 17 00:00:00 2001 From: g-pl Date: Sat, 28 Jan 2017 16:43:23 +0100 Subject: [PATCH] f2fs: f2fs: Sync with upstream f2fs-stable 3.4.y Up to: "fscrypto: no support for v3.4" http://git.kernel.org/cgit/linux/kernel/git/jaegeuk/f2fs-stable.git/commit/?h=linux-3.4.y&id=2220ac23c3c583321276c85cbfd7f6378abd8f94 . . . f2fs: catch up to v4.4-rc1 commit beaa57dd986d4f398728c060692fc2452895cfd8 Author: Chao Yu Date: Thu Oct 22 18:24:12 2015 +0800 f2fs: fix to skip shrinking extent nodes In f2fs_shrink_extent_tree we should stop shrink flow if we have already shrunk enough nodes in extent cache. Signed-off-by: Jaegeuk Kim f2fs: Fix a system panic caused by f2fs_follow_link In linux 3.10, we can not make sure the return value of nd_get_link function is valid. So this patch add a check before use it. Signed-off-by: Yunlei He Signed-off-by: Shuoran Liu Signed-off-by: Jaegeuk Kim f2fs: report error of f2fs_create_root_stats f2fs_create_root_stats can fail due to no memory, report it to user. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: commit atomic written page in LFS mode We should always commit atomic written pages in LFS mode, otherwise data will become corrupted if we encounter suddent power cut after partial pages committed in IPU mode. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: support file defragment This patch introduces a new ioctl F2FS_IOC_DEFRAGMENT to support file defragment in a specified range of regular file. This ioctl can be used in very limited workload: if user expects high sequential read performance in randomly written file, this interface can be used for defragmentation, after that file can be written as continuous as possible in the device. Meanwhile, it has side-effect, it will make holes in segments where blocks located originally, so it's better to trigger GC to eliminate fragment in segments. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix memory leak of kobject in error path of fill_super f2fs_sb_info::s_kobj should be released in error path of fill_super, otherwise it will lead to memory leak. This bug was found by kmemleak: dmesg: kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak) cat /sys/kernel/debug/kmemleak unreferenced object 0xffff8800838dc358 (size 8): comm "mount", pid 4154, jiffies 4297482839 (age 1911.412s) hex dump (first 8 bytes): 7a 72 61 6d 31 00 ff ff zram1... backtrace: [] kmemleak_alloc+0x28/0x50 [] __kmalloc_track_caller+0xef/0x1c0 [] kstrdup+0x45/0x80 [] kstrdup_const+0x28/0x30 [] kvasprintf_const+0x63/0xa0 [] kobject_set_name_vargs+0x3c/0xa0 [] kobject_add_varg+0x25/0x60 [] kobject_init_and_add+0x53/0x70 [] f2fs_fill_super+0x9d9/0xc40 [f2fs] [] mount_bdev+0x192/0x1d0 [] f2fs_mount+0x15/0x20 [f2fs] [] mount_fs+0x43/0x170 [] vfs_kern_mount+0x76/0x160 [] do_mount+0x258/0xdc0 [] SyS_mount+0x7b/0xc0 [] entry_SYSCALL_64_fastpath+0x12/0x6f ... Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix to enable missing ioctl interfaces in ->compat_ioctl In 64-bit kernel f2fs can supports 32-bit ioctl system call by identifying encoded code which is converted from 32-bit one to 64-bit one in ->compat_ioctl. When we introduced new interfaces in ->ioctl, we forgot to enable them in ->compat_ioctl, so enable them for fixing. Signed-off-by: Chao Yu [Jaegeuk Kim: fix wrongly added spaces together] Signed-off-by: Jaegeuk Kim f2fs: fix to remove directory inode from dirty list If last dirty dentry page was writebacked in reclaim path, we should remove its directory inode from global dirty list to avoid unnecessary flush for this inode when doing checkpoint. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: optimize __find_rev_next_bit 1. Skip __reverse_ulong if the bitmap is empty. 2. Reduce branches and codes. According to my test, the performance of this new version is 5% higher on an empty bitmap of 64bytes, and remains about the same in the worst scenario. Signed-off-by: Fan li Signed-off-by: Jaegeuk Kim f2fs: clear page uptodate when dropping cache for atomic write We should clear uptodate flag for all pages atomic written when we drop them, otherwise before these cached pages were reclaimed or invalidated eventually, we will see invalid data when hitting them again. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix to report error in f2fs_readdir get_lock_data_page in f2fs_readdir can fail due to a lot of reasons (i.e. no memory or IO error...), it's better to report this kind of error to user rather than ignoring it. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: avoid deadlock in f2fs_shrink_extent_tree While handling extent trees, we can enter into a reclaiming path anytime. If it tries to release some extent nodes in the same extent tree, write_lock(&et->lock) would be hanged. In order to avoid the deadlock, we can just skip it. Note that, if it is an unreferenced tree, we should get write_lock(&et->lock) successfully and release all of therein nodes. Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: do not recover from previous remained wrong dnodes If device does not support discard, some obsolete dnodes can be recovered by roll-forward. This patch enhances the recovery flow. Signed-off-by: Jaegeuk Kim f2fs: clean up error path in f2fs_readdir No logic changes, just clean up the error path. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/dir.c f2fs: clean up code with __has_cursum_space Clean up codes in lookup_journal_in_cursum() with __has_cursum_space(). Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: clean up argument of recover_data In recover_data, value of argument 'type' will be CURSEG_WARM_NODE all the time, remove it for cleanup. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: kill f2fs_drop_largest_extent For direct IO, f2fs only allocate new address for the block which is not exist in the disk before, its mapping info should not exist in extent cache previously, so here we do not need to call f2fs_drop_largest_extent to drop related cache. Due to no more callers for f2fs_drop_largest_extent now, kill it. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: use sbi->blocks_per_seg to avoid unnecessary calculation Use sbi->blocks_per_seg directly to avoid unnecessary calculation when using 1 << sbi->log_blocks_per_seg. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix to convert inline inode in ->setattr In commit 3c4541452748 ("f2fs: do not trim preallocated blocks when truncating after i_size"), in order to follow the regulation: "truncate(x) where x > i_size will not trim all blocks past i_size." like other file systems, in ->setattr we invoked truncate_setsize instead of f2fs_truncate to avoid unneeded block trimming in such case, but forgot to call f2fs_convert_inline_inode keep consistency of inline data conversion rule. This patch fixes to convert inline data if necessary. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: enhance the bit operation for SSR This patch enhances the existing bit operation when f2fs allocates SSR blocks. Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: refactor f2fs_commit_super Previously, f2fs_commit_super hacks the bh->blocknr to write the broken alternate superblock. Instead of it, we should use the correct logic to retrieve its buffer head with locking it appropriately. Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: use lock_buffer when changing superblock When modifying sb contents, we need to use lock its buffer. Signed-off-by: Jaegeuk Kim f2fs: clean up node page updating flow If read_node_page return LOCKED_PAGE, in its caller it's better a) skip unneeded 'Update' flag and mapping info verfication; b) check nid value stored in footer structure of node page. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/node.c f2fs: write only the pages in range during defragment @lend of filemap_write_and_wait_range is supposed to be a "offset in bytes where the range ends (inclusive)". Subtract 1 to avoid writing an extra page. Signed-off-by: Fan li Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix to update variable correctly when skip a unmapped block map.m_len should be reduced after skip a block Signed-off-by: Fan li Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: do more integrity verification for superblock Do more sanity check for superblock during ->mount. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: add symbol to avoid any confusion with tools This patch adds MAX_VOLUME_NAME to sync with f2fs-tools. Signed-off-by: Jaegeuk Kim f2fs: rename {add,remove,release}_dirty_inode to {add,remove,release}_ino_entry remove_dirty_dir_inode will be renamed to remove_dirty_inode as a generic function in following patch for removing directory/regular/symlink inode in global dirty list. Here rename ino management related functions for readability, also in order to avoid name conflict. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: introduce dirty list node in inode info Add a new dirt list node member in inode info for linking the inode to global dirty list in superblock, instead of old implementation which allocate slab cache memory as an entry to inode. It avoids memory pressure due to slab cache allocation, and also makes codes more clean. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: introduce __remove_dirty_inode Introduce __remove_dirty_inode to clean up codes in remove_dirty_dir_inode. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix to reset variable correctlly f2fs_map_blocks will set m_flags and m_len to 0, so we don't need to reset m_flags ourselves, but have to reset m_len to correct value before use it again. Signed-off-by: Fan li Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: backup raw_super in sbi f2fs use fields of f2fs_super_block struct directly in a grabbed buffer. Once the buffer happen to be destroyed (e.g. through dd), it may bring in unpredictable effect on f2fs. This patch fixes to allocate additional buffer to store datas of super block rather than using grabbed block buffer directly. Signed-off-by: Yunlei He Signed-off-by: Jaegeuk Kim Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: don't grab super block buffer header all the time We have already got one copy of valid super block in memory, do not grab buffer header of super block all the time. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: relocate tracepoint of write_checkpoint It needs to relocate its location to see exact trace logs. Signed-off-by: Jaegeuk Kim f2fs: introduce __f2fs_commit_super Introduce __f2fs_commit_super to include duplicated codes in f2fs_commit_super for cleanup. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: record dirty status of regular/symlink inode Maintain regular/symlink inode which has dirty pages in global dirty list and record their total dirty pages count like the way of handling directory inode. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: introduce new option for controlling data flush Add a new option 'data_flush' to enable data flush functionality. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: Documentation/filesystems/f2fs.txt f2fs: stat dirty regular/symlink inodes Add to stat dirty regular and symlink inode for showing in debugfs. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: support data flush in background Previously, when finishing a checkpoint, we have persisted all fs meta info including meta inode, node inode, dentry page of directory inode, so, after a sudden power cut, f2fs can recover from last checkpoint with full directory structure. But during checkpoint, we didn't flush dirty pages of regular and symlink inode, so such dirty datas still in memory will be lost in that moment of power off. In order to reduce the chance of lost data, this patch enables f2fs_balance_fs_bg with the ability of data flushing. It will try to flush user data before starting a checkpoint. So user's data written after last checkpoint which may not be fsynced could be saved. When we mount with data_flush option, after every period of cp_interval (could be configured in sysfs: /sys/fs/f2fs/device/cp_interval) seconds user data could be flushed into device once f2fs_balance_fs_bg was called in kworker thread or gc thread. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: optimize the flow of f2fs_map_blocks check map->m_len right after it changes to avoid excess call to update dnode_of_data. Signed-off-by: Fan li Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: add a tracepoint for sync_dirty_inodes This patch adds a tracepoint for sync_dirty_inodes. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: use atomic variable for total_extent_tree It would be better to use atomic variable for total_extent_tree. Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: speed up shrinking extent tree entries If there is no candidates for shrinking slab entries, we don't need to traverse any trees at all. Reviewed-by: Chao Yu [Jaegeuk Kim: fix missing initialization reported by Yunlei He] Signed-off-by: Jaegeuk Kim f2fs: check inline_data flag at converting time We can check inode's inline_data flag when calling to convert it. Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: avoid unnecessary f2fs_gc for dir operations The f2fs_balance_fs doesn't need to cover f2fs_new_inode or f2fs_find_entry works. Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/namei.c f2fs: record node block allocation in dnode_of_data This patch introduces recording node block allocation in dnode_of_data. This information helps to figure out whether any node block is allocated during specific file operations. Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: reduce covered region of sbi->cp_rwsem in f2fs_map_blocks Only cover sbi->cp_rwsem on one dnode page's allocation and modification instead of multiple's in f2fs_map_blocks, it can reduce the covered region of cp_rwsem, then we can avoid potential long time delay for concurrent checkpointer. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: call f2fs_balance_fs only when node was changed If user tries to update or read data, we don't need to call f2fs_balance_fs which triggers f2fs_gc, which increases unnecessary long latency. Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/file.c f2fs: report error of do_checkpoint do_checkpoint and write_checkpoint can fail due to reasons like triggering in a readonly fs or encountering IO error of storage device. So it's better to report such error info to user, let user be aware of failure of doing checkpoint. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: don't convert inline inode when inline_data option is disable If inline_data option is disable, when truncating an inline inode with size which is not exceed maxinum inline size, we should not convert inline inode to regular one to avoid the overhead of synchronizing conversion. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: introduce prepare_write_begin to clean up This patch adds prepare_write_begin to clean f2fs_write_begin. The major role of this function is to convert any inline_data and allocate or find block address. Signed-off-by: Jaegeuk Kim f2fs: return early when trying to read null nid If get_node_page() gets zero nid, we can return early without getting a wrong page. For example, get_dnode_of_data() can try to do that. Signed-off-by: Jaegeuk Kim f2fs: avoid f2fs_lock_op in f2fs_write_begin If f2fs_write_begin is to update data, we can bypass calling f2fs_lock_op() in order to avoid the checkpoint latency in the write syscall. Signed-off-by: Jaegeuk Kim f2fs: declare static function The __f2fs_commit_super is static. Signed-off-by: Jaegeuk Kim f2fs: add missing f2fs_balance_fs in __recover_dot_dentries __recover_do_dentries will try to grab free space in storage, so fix to add missing f2fs_balance_fs here. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: let user being aware of IO error Sometimes we keep dumb when IO error occur in lower layer device, so user will not receive any error return value for some operation, but actually, the operation did not succeed. This sould be avoided, so this patch reports such kind of error to user. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/data.c f2fs: fix bugs and simplify codes of f2fs_fiemap fix bugs: 1. len could be updated incorrectly when start+len is beyond isize. 2. If there is a hole consisting of more than two blocks, it could fail to add FIEMAP_EXTENT_LAST flag for the last extent. 3. If there is an extent beyond isize, when we search extents in a range that ends at isize, it will also return the extent beyond isize, which is outside the range. Signed-off-by: Fan li Signed-off-by: Jaegeuk Kim f2fs: add a max block check for get_data_block_bmap This patch adds a max block check for get_data_block_bmap. Trinity test program will send a block number as parameter into ioctl_fibmap, which will be used in get_node_path(), when the block number large than f2fs max blocks, it will trigger kernel bug. Signed-off-by: Yunlei He Signed-off-by: Xue Liu [Jaegeuk Kim: fix missing condition, pointed by Chao Yu] Signed-off-by: Jaegeuk Kim f2fs: clean up f2fs_ioc_write_checkpoint Use f2fs_sync_fs to clean up codes in f2fs_ioc_write_checkpoint. Signed-off-by: Chao Yu [Jaegeuk Kim: remove unused err variable] Signed-off-by: Jaegeuk Kim f2fs: early check broken symlink length in the encrypted case If link is broken, its len is zero, and we don't need to move forward. Signed-off-by: Jaegeuk Kim f2fs: use i_size_read to get i_size We need to use i_size_read() to get inode->i_size. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/data.c f2fs: load largest extent all the time Otherwise, we can get mismatched largest extent information. One example is: 1. mount f2fs w/ extent_cache 2. make a small extent 3. umount 4. mount f2fs w/o extent_cache 5. update the largest extent 6. umount 7. mount f2fs w/ extent_cache 8. get the old extent made by #2 Signed-off-by: Jaegeuk Kim f2fs: fix to skip recovering dot dentries in a readonly fs If filesystem is readonly, leave user message info instead of recovering inline dot inode. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix f2fs_ioc_abort_volatile_write There are two rules to handle aborting volatile or atomic writes. 1. drop atomic writes - we don't need to keep any stale db data. 2. write journal data - we should keep the journal data with fsync for db recovery. Signed-off-by: Jaegeuk Kim f2fs: remove f2fs_bug_on in terms of max_depth There is no report on this bug_on case, but if malicious attacker changed this field intentionally, we can just reset it as a MAX value. Signed-off-by: Jaegeuk Kim f2fs: write pending bios when cp_error is set When testing ioc_shutdown, put_super is able to be hanged by waiting for writebacking pages as follows. INFO: task umount:2723 blocked for more than 120 seconds. Tainted: G O 4.4.0-rc3+ #8 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. umount D ffff88000859f9d8 0 2723 2110 0x00000000 ffff88000859f9d8 0000000000000000 0000000000000000 ffffffff81e11540 ffff880078c225c0 ffff8800085a0000 ffff88007fc17440 7fffffffffffffff ffffffff818239f0 ffff88000859fb48 ffff88000859f9f0 ffffffff8182310c Call Trace: [] ? bit_wait+0x50/0x50 [] schedule+0x3c/0x90 [] schedule_timeout+0x2d9/0x430 [] ? mark_held_locks+0x6f/0xa0 [] ? ktime_get+0x7d/0x140 [] ? bit_wait+0x50/0x50 [] ? kvm_clock_get_cycles+0x25/0x30 [] ? ktime_get+0xac/0x140 [] ? bit_wait+0x50/0x50 [] io_schedule_timeout+0xa4/0x110 [] bit_wait_io+0x35/0x50 [] __wait_on_bit+0x5d/0x90 [] wait_on_page_bit+0xcb/0xf0 [] ? autoremove_wake_function+0x40/0x40 [] truncate_inode_pages_range+0x4bc/0x840 [] truncate_inode_pages_final+0x4d/0x60 [] f2fs_evict_inode+0x75/0x400 [f2fs] [] evict+0xbc/0x190 [] iput+0x229/0x2c0 [] f2fs_put_super+0x105/0x1a0 [f2fs] [] generic_shutdown_super+0x6a/0xf0 [] kill_block_super+0x27/0x70 [] kill_f2fs_super+0x20/0x30 [f2fs] [] deactivate_locked_super+0x43/0x70 [] deactivate_super+0x5c/0x60 [] cleanup_mnt+0x3f/0x90 [] __cleanup_mnt+0x12/0x20 [] task_work_run+0x73/0xa0 [] exit_to_usermode_loop+0xcc/0xd0 [] syscall_return_slowpath+0xcc/0xe0 [] int_ret_from_sys_call+0x25/0x9f Signed-off-by: Jaegeuk Kim f2fs: use IPU for fdatasync This patch fixes missing IPU condition when fdatasync is called. With this patch, fdatasync is able to avoid additional node writes for recovery. Signed-off-by: Jaegeuk Kim f2fs: monitor zombie_tree count This patch adds an entry to show the number of zombie extent_tree. Signed-off-by: Jaegeuk Kim f2fs: introduce zombie list for fast shrinking extent trees This patch removes refcount, and instead, adds zombie_list to shrink directly without radix tree traverse. Signed-off-by: Jaegeuk Kim f2fs crypto: check CONFIG_F2FS_FS_XATTR for encrypted symlink Add missed CONFIG_F2FS_FS_XATTR for encrypted symlink inode in order to avoid unneeded registry of ->{get,set,remove}xattr. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: introduce max_file_blocks in sbi Introduce max_file_blocks in sbi to store max block index of file in f2fs, it could be used to avoid unneeded calculation of max block index in runtime. Signed-off-by: Chao Yu [Jaegeuk Kim: fix overflow of sbi->max_file_blocks] Signed-off-by: Jaegeuk Kim f2fs: cover more area with nat_tree_lock There was a subtle bug on nat cache management which incurs wrong nid allocation or wrong block addresses when try_to_free_nats is triggered heavily. This patch enlarges the previous coverage of nat_tree_lock to avoid data race. Signed-off-by: Jaegeuk Kim Revert "f2fs: check the node block address of newly allocated nid" Original issue is fixed by: f2fs: cover more area with nat_tree_lock This reverts commit 24928634f81b1592e83b37dcd89ed45c28f12feb. f2fs: read isize while holding i_mutex in fiemap make sure the isize we read doesn't change during the process. Signed-off-by: Fan li Signed-off-by: Jaegeuk Kim f2fs: check node id earily when readaheading node page Add node id check in ra_node_page and get_node_page_ra like get_node_page. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: introduce __get_node_page to reuse common code There are duplicated code in between get_node_page and get_node_page_ra, introduce __get_node_page to includes common parts of these two, and export get_node_page and get_node_page_ra by reusing __get_node_page. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/node.c f2fs: check the page status filled from disk After reading a page, we need to check whether there is any error. Signed-off-by: Jaegeuk Kim f2fs: avoid unnecessary f2fs_balance_fs calls Only when node page is newly dirtied, it needs to check whether we need to do f2fs_gc. Signed-off-by: Jaegeuk Kim f2fs: remove redundant calls This patch removes redundant calls. Signed-off-by: Jaegeuk Kim f2fs: clean up f2fs_balance_fs This patch adds one parameter to clean up all the callers of f2fs_balance_fs. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/namei.c f2fs: recognize encrypted data in f2fs_fiemap This patch fixes to teach f2fs_fiemap to recognize encrypted data. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: use atomic type for node count in extent tree 1. rename field in struct extent_tree from count to node_cnt for readability. 2. alter to use atomic type for node_cnt. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: skip releasing nodes in chindless extent tree If there are no nodes in extent tree, let's skip releasing step to avoid any overhead of grabbing/releasing extent tree lock. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: introduce time and interval facility This patch adds time and interval arrays to store some timing variables. Signed-off-by: Jaegeuk Kim f2fs: detect idle time depending on user behavior This patch adds last time that user requested filesystem operations. This information is used to detect whether system is idle or not later. Signed-off-by: Jaegeuk Kim Conflicts: Documentation/ABI/testing/sysfs-fs-f2fs f2fs: monitor the number of background checkpoint This patch adds to show the number of background checkpoint. Signed-off-by: Jaegeuk Kim f2fs: fix wrong memory condition check This patch fixes wrong decision for avaliable_free_memory. The return valus is already set as false, so we should consider true condition below only. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/node.c f2fs: should unset atomic flag after successful commit If there is an error during commit, we should keep the flag in order to abort it. Signed-off-by: Jaegeuk Kim f2fs: relocate is_merged_page Operations in is_merged_page is related to inner bio cache, move it to data.c. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/segment.c f2fs: flush dirty nat entries when exceeding threshold When testing f2fs with xfstest, generic/251 is stuck for long time, the case uses below serials to obtain fresh released space in device, in order to prepare for following fstrim test. 1. rm -rf /mnt/dir 2. mkdir /mnt/dir/ 3. cp -axT `pwd`/ /mnt/dir/ 4. goto 1 During preparing step, all nat entries will be cached in nat cache, most of them are dirty entries with invalid blkaddr, which means nodes related to these entries have been truncated, and they could be reused after the dirty entries been checkpointed. However, there was no checkpoint been triggered, so nid allocators (e.g. mkdir, creat) will run into long journey of iterating all NAT pages, looking for free nids in alloc_nid->build_free_nids. Here, in f2fs_balance_fs_bg we give another chance to do checkpoint to flush nat entries for reusing them in free nid cache when dirty entry count exceeds 10% of max count. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: export dirty_nats_ratio in sysfs This patch exports a new sysfs entry 'dirty_nat_ratio' to control threshold of dirty nat entries, if current ratio exceeds configured threshold, checkpoint will be triggered in f2fs_balance_fs_bg for flushing dirty nats. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: Documentation/ABI/testing/sysfs-fs-f2fs f2fs: correct search area in get_new_segment get_new_segment starts from current segment position, tries to search a free segment among its right neighbors locate in same section. But previously our search area was set as [current segment, max segment], which means we have to search to more bits in free_segmap bitmap for some worse cases. So here we correct the search area to [current segment, last segment in section] to avoid unnecessary searching. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: remove needless condition check This patch removes needless condition variable. Signed-off-by: Jaegeuk Kim f2fs: use writepages->lock for WB_SYNC_ALL If there are many writepages calls by multiple threads in background, we don't need to serialize to merge all the bios, since it's background. In such the case, it'd better to run writepages concurrently. Signed-off-by: Jaegeuk Kim f2fs: fix to overcome inline_data floods The scenario is: 1. create lots of node blocks 2. sync 3. write lots of inline_data -> got panic due to no free space In that case, we should flush node blocks when writing inline_data in #3, and trigger gc as well. Signed-off-by: Jaegeuk Kim f2fs: do f2fs_balance_fs when block is allocated We should consider data block allocation to trigger f2fs_balance_fs. Signed-off-by: Jaegeuk Kim f2fs: avoid multiple node page writes due to inline_data The sceanrio is: 1. create fully node blocks 2. flush node blocks 3. write inline_data for all the node blocks again 4. flush node blocks redundantly So, this patch tries to flush inline_data when flushing node blocks. Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: don't need to sync node page at every time In write_end, we don't need to sync inode page at every time. Instead, we can expect f2fs_write_inode will update later. Signed-off-by: Jaegeuk Kim f2fs: avoid needless sync_inode_page when reading inline_data In write_begin, if there is an inline_data, f2fs loads it into 0'th data page. Since it's the read path, we don't need to sync its inode page. Signed-off-by: Jaegeuk Kim f2fs: don't need to call set_page_dirty for io error If end_io gets an error, we don't need to set the page as dirty, since we already set f2fs_stop_checkpoint which will not flush any data. This will resolve the following warning. ====================================================== [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ] 4.4.0+ #9 Tainted: G O ------------------------------------------------------ xfs_io/26773 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: (&(&sbi->inode_lock[i])->rlock){+.+...}, at: [] update_dirty_page+0x6f/0xd0 [f2fs] and this task is already holding: (&(&q->__queue_lock)->rlock){-.-.-.}, at: [] blk_queue_bio+0x422/0x490 which would create a new lock dependency: (&(&q->__queue_lock)->rlock){-.-.-.} -> (&(&sbi->inode_lock[i])->rlock){+.+...} Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/data.c f2fs: enhance foreground GC If we configure section consist of multiple segments, foreground GC will do the garbage collection with following approach: for each segment in victim section blk_start_plug for each valid block in segment write out by OPU method submit bio cache <--- blk_finish_plug <--- There are two issue: 1) for most of the time, 'submit bio cache' will break the merging in current bio buffer from writes of next segments, making a smaller bio submitting. 2) block plug only cover IO submitting in one segment, which reduce opportunity of merging IOs in plug with multiple segments. So refactor the code as below structure to strive for biggest opportunity of merging IOs: blk_start_plug for each segment in victim section for each valid block in segment write out by OPU method submit bio cache blk_finish_plug Test method: 1. mkfs.f2fs -s 8 /dev/sdX 2. touch 32 files 3. write 2M data into each file 4. punch 1.5M data from offset 0 for each file 5. trigger foreground gc through ioctl Before patch, there are totoally 40 bios submitted. f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 65536, size = 122880 f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 65776, size = 122880 f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 66016, size = 122880 f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 66256, size = 122880 f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 66496, size = 32768 ----repeat for 8 times After patch, there are totally 35 bios submitted. f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 65536, size = 122880 ----repeat 34 times f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 73696, size = 16384 Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: use wait_for_stable_page to avoid contention In write_begin, if storage supports stable_page, we don't need to wait for writeback to update its contents. This patch introduces to use wait_for_stable_page instead of wait_on_page_writeback. Signed-off-by: Jaegeuk Kim f2fs: delete unnecessary wait for page writeback no need to wait inline file page writeback for no one use it, so this patch delete unnecessary wait. Signed-off-by: Yunlei He Signed-off-by: Jaegeuk Kim f2fs: avoid unnecessary search while finding victim in gc variable nsearched in get_victim_by_default() indicates the number of dirty segments we already checked. There are 2 problems about the way it updates: 1. When p.ofs_unit is greater than 1, the victim we find consists of multiple segments, possibly more than 1 dirty segment. But nsearched always increases by 1. 2. If segments have been found but not been chosen, nsearched won't increase. So even we have checked all dirty segments, nsearched may still less than p.max_search. All these problems could cause unnecessary search after all dirty segments have already been checked. Signed-off-by: Fan li Signed-off-by: Jaegeuk Kim f2fs: use wq_has_sleeper for cp_wait wait_queue We need to use wq_has_sleeper including smp_mb to consider cp_wait concurrency. Signed-off-by: Jaegeuk Kim f2fs: reconstruct the code to free an extent_node There are three steps to free an extent node: 1) list_del_init, 2)__detach_extent_node, 3) kmem_cache_free In path f2fs_destroy_extent_tree, 1->2->3 to free a node, But in path f2fs_update_extent_tree_range, it is 2->1->3. This patch makes all the order to be: 1->2->3 It makes sense, since in the next patch, we import a victim list in the path shrink_extent_tree, we could check if the extent_node is in the victim list by checking the list_empty(). So it is necessary to put 1) first. Signed-off-by: Hou Pengyang Signed-off-by: Jaegeuk Kim f2fs: move extent_node list operations being coupled with rbtree operation This patch moves extent_node list operations to be handled together with its rbtree operations. Signed-off-by: Jaegeuk Kim f2fs: don't set cached_en if it will be freed If en has empty list pointer, it will be freed sooner, so we don't need to set cached_en with it. Signed-off-by: Jaegeuk Kim f2fs: improve shrink performance of extent nodes On the worst case, we need to scan the whole radix tree and every rb-tree to free the victimed extent_nodes when shrinking. Pengyang initially introduced a victim_list to record the victimed extent_nodes, and free these extent_nodes by just scanning a list. Later, Chao Yu enhances the original patch to improve memory footprint by removing victim list. The policy of lru list shrinking becomes: 1) lock lru list's lock 2) trylock extent tree's lock 3) remove extent node from lru list 4) unlock lru list's lock 5) do shrink 6) repeat 1) to 5) Signed-off-by: Hou Pengyang Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: give scheduling point in shrinking path It needs to give a chance to be rescheduled while shrinking slab entries. Signed-off-by: Jaegeuk Kim f2fs: introduce lifetime write IO statistics This patch introduces lifetime IO write statistics exposed to the sysfs interface. The write IO amount is obtained from block layer, accumulated in the file system and stored in the hot node summary of checkpoint. Signed-off-by: Shuoran Liu Signed-off-by: Pengyang Hou [Jaegeuk Kim: add sysfs documentation] Signed-off-by: Jaegeuk Kim f2fs: simplify f2fs_map_blocks In f2fs_map_blocks, we use duplicated codes to handle first block mapping and the following blocks mapping, it's unnecessary. This patch simplifies f2fs_map_blocks to avoid using copied codes. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: simplify __allocate_data_blocks This patch uses existing function f2fs_map_block to simplify implementation of __allocate_data_blocks. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: remove unneeded pointer conversion There are redundant pointer conversion in following call stack: - at position a, inode was been converted to f2fs_file_info. - at position b, f2fs_file_info was been converted to inode again. - truncate_blocks(inode,..) - fi = F2FS_I(inode) ---a - ADDRS_PER_PAGE(node_page, fi) - addrs_per_inode(fi) - inode = &fi->vfs_inode ---b - f2fs_has_inline_xattr(inode) - fi = F2FS_I(inode) - is_inode_flag_set(fi,..) In order to avoid unneeded conversion, alter ADDRS_PER_PAGE and addrs_per_inode to acept parameter with type of inode pointer. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: introduce get_next_page_offset to speed up SEEK_DATA When seeking data in ->llseek, if we encounter a big hole which covers several dnode pages, we will try to seek data from index of page which is the first page of next dnode page, at most we could skip searching (ADDRS_PER_BLOCK - 1) pages. However it's still not efficient, because if our indirect/double-indirect pointer are NULL, there are no dnode page locate in the tree indirect/ double-indirect pointer point to, it's not necessary to search the whole region. This patch introduces get_next_page_offset to calculate next page offset based on current searching level and max searching level returned from get_dnode_of_data, with this, we could skip searching the entire area indirect or double-indirect node block is not exist. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: speed up handling holes in fiemap This patch makes f2fs_map_blocks supporting returning next potential page offset which skips hole region in indirect tree of inode, and use it to speed up fiemap in handling big hole case. Test method: xfs_io -f /mnt/f2fs/file -c "pwrite 1099511627776 4096" time xfs_io -f /mnt/f2fs/file -c "fiemap -v" Before: time xfs_io -f /mnt/f2fs/file -c "fiemap -v" /mnt/f2fs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2147483647]: hole 2147483648 1: [2147483648..2147483655]: 81920..81927 8 0x1 real 3m3.518s user 0m0.000s sys 3m3.456s After: time xfs_io -f /mnt/f2fs/file -c "fiemap -v" /mnt/f2fs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2147483647]: hole 2147483648 1: [2147483648..2147483655]: 81920..81927 8 0x1 real 0m0.008s user 0m0.000s sys 0m0.008s Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix endianness of on-disk summary_footer Signed-off-by: Sheng Yong Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: wait on page's writeback in writepages path Likewise f2fs_write_cache_pages, let's do for node and meta pages too. Especially, for node blocks, we should do this before marking its fsync and dentry flags. Signed-off-by: Jaegeuk Kim f2fs: flush bios to handle cp_error in put_super Sometimes, if cp_error is set, there remains under-writeback pages, resulting in kernel hang in put_super. Signed-off-by: Jaegeuk Kim f2fs: fix conflict on page->private usage This patch fixes confilct on page->private value between f2fs_trace_pid and atomic page. Signed-off-by: Jaegeuk Kim f2fs: introduce f2fs_submit_merged_bio_cond f2fs use single bio buffer per type data (META/NODE/DATA) for caching writes locating in continuous block address as many as possible, after submitting, these writes may be still cached in bio buffer, so we have to flush cached writes in bio buffer by calling f2fs_submit_merged_bio. Unfortunately, in the scenario of high concurrency, bio buffer could be flushed by someone else before we submit it as below reasons: a) there is no space in bio buffer. b) add a request of different type (SYNC, ASYNC). c) add a discontinuous block address. For this condition, f2fs_submit_merged_bio will be devastating, because it could break the following merging of writes in bio buffer, split one big bio into two smaller one. This patch introduces f2fs_submit_merged_bio_cond which can do a conditional submitting with bio buffer, before submitting it will judge whether: - page in DATA type bio buffer is matching with specified page; - page in DATA type bio buffer is belong to specified inode; - page in NODE type bio buffer is belong to specified inode; If there is no eligible page in bio buffer, we will skip submitting step, result in gaining more chance to merge consecutive block IOs in bio cache. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix missing skip pages info fix missing skip pages info in f2fs_writepages trace event. Signed-off-by: Yunlei He Signed-off-by: Jaegeuk Kim f2fs: move dio preallocation into f2fs_file_write_iter This patch moves preallocation code for direct IOs into f2fs_file_write_iter. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/data.c fs/f2fs/file.c f2fs: preallocate blocks for buffered aio writes This patch preallocates data blocks for buffered aio writes. With this patch, we can avoid redundant locking and unlocking of node pages given consecutive aio request. [For 3.10] - Add preallocationg for generic_splice_write(sendfile) for xfstests/249, 285 Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/data.c f2fs: increase i_size to avoid missing data When finsert is doing with dirting pages, we should increase i_size right away. Otherwise, the moved page is able to be dropped by the following filemap_write_and_wait_range before updating i_size. Especially, it can be done by if ((page->index >= end_index + 1) || !offset) goto out; in f2fs_write_data_page. This should resolve the below xfstests/091 failure reported by Dave. $ diff -u tests/generic/091.out /home/dave/src/xfstests-dev/results//f2fs/generic/091.out.bad --- tests/generic/091.out 2014-01-20 16:57:33.000000000 +1100 +++ /home/dave/src/xfstests-dev/results//f2fs/generic/091.out.bad 2016-02-08 15:21:02.701375087 +1100 @@ -1,7 +1,18 @@ QA output created by 091 fsx -N 10000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -W +mapped writes DISABLED +skipping insert range behind EOF +skipping insert range behind EOF +truncating to largest ever: 0x11e00 +dowrite: write: Invalid argument +LOG DUMP (7 total operations): +1( 1 mod 256): SKIPPED (no operation) +2( 2 mod 256): SKIPPED (no operation) +3( 3 mod 256): FALLOC 0x2e0f2 thru 0x3134a (0x3258 bytes) PAST_EOF +4( 4 mod 256): SKIPPED (no operation) +5( 5 mod 256): SKIPPED (no operation) +6( 6 mod 256): TRUNCATE UP from 0x0 to 0x11e00 +7( 7 mod 256): WRITE 0x73400 thru 0x79fff (0x6c00 bytes) HOLE +Log of operations saved to "/mnt/test/junk.fsxops"; replay with --replay-ops +Correct content saved for comparison +(maybe hexdump "/mnt/test/junk" vs "/mnt/test/junk.fsxgood") Reported-by: Dave Chinner Signed-off-by: Jaegeuk Kim f2fs crypto: replace some BUG_ON()'s with error checks This patch adopts: ext4 crypto: replace some BUG_ON()'s with error checks Signed-off-by: Theodore Ts'o Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/crypto_key.c f2fs crypto: fix spelling typo in comment This patch adopts: ext4 crypto: fix spelling typo in comment Signed-off-by: Laurent Navet Signed-off-by: Jaegeuk Kim f2fs crypto: f2fs_page_crypto() doesn't need a encryption context This patch adopts: ext4 crypto: ext4_page_crypto() doesn't need a encryption context Since ext4_page_crypto() doesn't need an encryption context (at least not any more), this allows us to simplify a number function signature and also allows us to avoid needing to allocate a context in ext4_block_write_begin(). It also means we no longer need a separate ext4_decrypt_one() function. Signed-off-by: Theodore Ts'o Signed-off-by: Jaegeuk Kim f2fs crypto: check for too-short encrypted file names This patch adopts: ext4 crypto: check for too-short encrypted file names An encrypted file name should never be shorter than an 16 bytes, the AES block size. The 3.10 crypto layer will oops and crash the kernel if ciphertext shorter than the block size is passed to it. Fortunately, in modern kernels the crypto layer will not crash the kernel in this scenario, but nevertheless, it represents a corrupted directory, and we should detect it and mark the file system as corrupted so that e2fsck can fix this. Signed-off-by: Theodore Ts'o Signed-off-by: Jaegeuk Kim f2fs crypto: add missing locking for keyring_key access This patch adopts: ext4 crypto: add missing locking for keyring_key access Signed-off-by: Theodore Ts'o Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/crypto_key.c f2fs: use correct errno This patch is to fix misused error number. Signed-off-by: Jaegeuk Kim f2fs: avoid garbage lenghs in dentries This patch fixes to eliminate garbage name lengths in dentries in order to provide correct answers of readdir. For example, if a valid dentry consists of: bitmap : 1 1 1 1 len : 32 0 x 0, readdir can start with second bit_pos having len = 0. Or, it can start with third bit_pos having garbage. In both of cases, we should avoid to try filling dentries. So, this patch not only removes any garbage length, but also avoid entering zero length case in readdir. Signed-off-by: Jaegeuk Kim f2fs: split drop_inmem_pages from commit_inmem_pages Split drop_inmem_pages from commit_inmem_pages for code readability, and prepare for the following modification. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: support revoking atomic written pages f2fs support atomic write with following semantics: 1. open db file 2. ioctl start atomic write 3. (write db file) * n 4. ioctl commit atomic write 5. close db file With this flow we can avoid file becoming corrupted when abnormal power cut, because we hold data of transaction in referenced pages linked in inmem_pages list of inode, but without setting them dirty, so these data won't be persisted unless we commit them in step 4. But we should still hold journal db file in memory by using volatile write, because our semantics of 'atomic write support' is incomplete, in step 4, we could fail to submit all dirty data of transaction, once partial dirty data was committed in storage, then after a checkpoint & abnormal power-cut, db file will be corrupted forever. So this patch tries to improve atomic write flow by adding a revoking flow, once inner error occurs in committing, this gives another chance to try to revoke these partial submitted data of current transaction, it makes committing operation more like aotmical one. If we're not lucky, once revoking operation was failed, EAGAIN will be reported to user for suggesting doing the recovery with held journal file, or retrying current transaction again. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs crypto: make sure the encryption info is initialized on opendir(2) This patch syncs f2fs with commit 6bc445e0ff44 ("ext4 crypto: make sure the encryption info is initialized on opendir(2)") from ext4. Signed-off-by: Theodore Ts'o Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs crypto: handle unexpected lack of encryption keys This patch syncs f2fs with commit abdd438b26b4 ("ext4 crypto: handle unexpected lack of encryption keys") from ext4. Fix up attempts by users to try to write to a file when they don't have access to the encryption key. Signed-off-by: Theodore Ts'o Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs crypto: avoid unneeded memory allocation when {en/de}crypting symlink This patch adopts f2fs with codes of ext4, it removes unneeded memory allocation in creating/accessing path of symlink. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/crypto_fname.c f2fs: introduce f2fs_journal struct to wrap journal info Introduce a new structure f2fs_journal to wrap journal info in struct f2fs_summary_block for readability. struct f2fs_journal { union { __le16 n_nats; __le16 n_sits; }; union { struct nat_journal nat_j; struct sit_journal sit_j; struct f2fs_extra_info info; }; } __packed; struct f2fs_summary_block { struct f2fs_summary entries[ENTRIES_IN_SUM]; struct f2fs_journal journal; struct summary_footer footer; } __packed; Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: enhance IO path with block plug Try to use block plug in more place as below to let process cache bios as much as possbile, in order to reduce lock overhead of queue in IO scheduler. 1) sync_meta_pages 2) ra_meta_pages 3) f2fs_balance_fs_bg Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: split journal cache from curseg cache In curseg cache, f2fs caches two different parts: - datas of current summay block, i.e. summary entries, footer info. - journal info, i.e. sparse nat/sit entries or io stat info. With this approach, 1) it may cause higher lock contention when we access or update both of the parts of cache since we use the same mutex lock curseg_mutex to protect the cache. 2) current summary block with last journal info will be writebacked into device as a normal summary block when flushing, however, we treat journal info as valid one only in current summary, so most normal summary blocks contain junk journal data, it wastes remaining space of summary block. So, in order to fix above issues, we split curseg cache into two parts: a) current summary block, protected by original mutex lock curseg_mutex b) journal cache, protected by newly introduced r/w semaphore journal_rwsem When loading curseg cache during ->mount, we store summary info and journal info into different caches; When doing checkpoint, we combine datas of two cache into current summary block for persisting. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: reorder nat cache lock in cache_nat_entry When lookuping nat entry in cache_nat_entry, if we fail to hit nat cache, we try to load nat entries a) from journal of current segment cache or b) from NAT pages for updating, during the process, write lock of nat_tree_lock will be held to avoid inconsistent condition in between nid cache and nat cache caused by racing among nat entry shrinker, checkpointer, nat entry updater. But this way may cause low efficient when updating nat cache, because it serializes accessing in journal cache or reading NAT pages. Here, we reorder lock and update flow as below to enhance accessing concurrency: - get_node_info - down_read(nat_tree_lock) - lookup nat cache --- hit -> unlock & return - lookup journal cache --- hit -> unlock & goto update - up_read(nat_tree_lock) update: - down_write(nat_tree_lock) - cache_nat_entry - lookup nat cache --- nohit -> update - up_write(nat_tree_lock) Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: slightly reorganize read_raw_super_block read_raw_super_block was introduced to help find the first valid superblock. Commit da554e48caab ("f2fs: recovering broken superblock during mount") changed the behaviour to read both of them and check whether need the recovery flag or not. So the comment before this function isn't consistent with what it actually does. Also, the origin code use two tags to round the err cases, which isn't so readable. So this patch amend the comment and slightly reorganize it. Signed-off-by: Shawn Lin Signed-off-by: Jaegeuk Kim f2fs: move sanity checking of cp into get_valid_checkpoint >From the function name of get_valid_checkpoint, it seems to return the valid cp or NULL for caller to check. If no valid one is found, f2fs_fill_super will print the err log. But if get_valid_checkpoint get one valid(the return value indicate that it's valid, however actually it is invalid after sanity checking), then print another similar err log. That seems strange. Let's keep sanity checking inside the procedure of geting valid cp. Another improvement we gained from this move is that even the large volume is supported, we check the cp in advanced to skip the following procedure if failing the sanity checking. Signed-off-by: Shawn Lin Signed-off-by: Jaegeuk Kim f2fs: detect error of update_dent_inode in ->rename Should check and show correct return value of update_dent_inode in ->rename. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix to delete old dirent in converted inline directory in ->rename When doing test with fstests/generic/068 in inline_dentry enabled f2fs, following oops dmesg will be reported: ------------[ cut here ]------------ WARNING: CPU: 5 PID: 11841 at fs/inode.c:273 drop_nlink+0x49/0x50() Modules linked in: f2fs(O) ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state CPU: 5 PID: 11841 Comm: fsstress Tainted: G O 4.5.0-rc1 #45 Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 v01.61 05/16/2013 0000000000000111 ffff88009cdf7ae8 ffffffff813e5944 0000000000002e41 0000000000000000 0000000000000111 0000000000000000 ffff88009cdf7b28 ffffffff8106a587 ffff88009cdf7b58 ffff8804078fe180 ffff880374a64e00 Call Trace: [] dump_stack+0x48/0x64 [] warn_slowpath_common+0x97/0xe0 [] warn_slowpath_null+0x1a/0x20 [] drop_nlink+0x49/0x50 [] f2fs_rename2+0xe04/0x10c0 [f2fs] [] ? lock_two_nondirectories+0x81/0x90 [] ? lockref_get+0x1d/0x30 [] vfs_rename+0x2e0/0x640 [] ? lookup_dcache+0x3b/0xd0 [] ? update_fast_ctr+0x21/0x40 [] ? security_path_rename+0xa2/0xd0 [] SYSC_renameat2+0x4b6/0x540 [] ? trace_hardirqs_off+0xd/0x10 [] ? exit_to_usermode_loop+0x7a/0xd0 [] ? int_ret_from_sys_call+0x52/0x9f [] ? trace_hardirqs_on_caller+0x100/0x1c0 [] SyS_renameat2+0xe/0x10 [] SyS_rename+0x1e/0x20 [] entry_SYSCALL_64_fastpath+0x12/0x6f ---[ end trace 2b31e17995404e42 ]--- This is because: in the same inline directory, when we renaming one file from source name to target name which is not existed, once space of inline dentry is not enough, inline conversion will be triggered, after that all data in inline dentry will be moved to normal dentry page. After attaching the new entry in coverted dentry page, still we try to remove old entry in original inline dentry, since old entry has been moved, so it obviously doesn't make any effect, result in remaining old entry in converted dentry page. Now, we have two valid dentries pointed to the same inode which has nlink value of 1, deleting them both, above warning appears. This issue can be reproduced easily as below steps: 1. mount f2fs with inline_dentry option 2. mkdir dir 3. touch 180 files named [001-180] in dir 4. rename dir/180 dir/181 5. rm dir/180 dir/181 Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: reuse read_inline_data for f2fs_convert_inline_page f2fs_convert_inline_page introduce what read_inline_data already does for copying out the inline data from inode_page. We can use read_inline_data instead to simplify the code. Signed-off-by: Shawn Lin Signed-off-by: Jaegeuk Kim f2fs: remain last victim segment number ascending order This patch avoids to remain inefficient victim segment number selected by a victim. For example, if all the dirty segments has same valid blocks, we can get the victim segments descending order due to keeping wrong last segment number. Signed-off-by: Jaegeuk Kim f2fs: fix the wrong stat count of calling gc With a partition which was formated as multi segments in one section, we stated incorrectly for count of gc operation. e.g., for a partition with segs_per_sec = 4 cat /sys/kernel/debug/f2fs/status GC calls: 208 (BG: 7) - data segments : 104 (52) - node segments : 104 (24) GC called count should be (104 (data segs) + 104 (node segs)) / 4 = 52, rather than 208. Fix it. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: show more info about superblock recovery This patch changes to show more info in message log about the recovery of the corrupted superblock during ->mount, e.g. the index of corrupted superblock and the result of recovery. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: try to flush inode after merging inline data When flushing node pages, if current node page is an inline inode page, we will try to merge inline data from data page into inline inode page, then skip flushing current node page, it will decrease the number of nodes to be flushed in batch in this round, which may lead to worse performance. This patch gives a chance to flush just merged inline inode pages for performance. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: trace old block address for CoWed page This patch enables to trace old block address of CoWed page for better debugging. f2fs_submit_page_mbio: dev = (1,0), ino = 1, page_index = 0x1d4f0, oldaddr = 0xfe8ab, newaddr = 0xfee90 rw = WRITE_SYNC, type = NODE f2fs_submit_page_mbio: dev = (1,0), ino = 1, page_index = 0x1d4f8, oldaddr = 0xfe8b0, newaddr = 0xfee91 rw = WRITE_SYNC, type = NODE f2fs_submit_page_mbio: dev = (1,0), ino = 1, page_index = 0x1d4fa, oldaddr = 0xfe8ae, newaddr = 0xfee92 rw = WRITE_SYNC, type = NODE f2fs_submit_page_mbio: dev = (1,0), ino = 134824, page_index = 0x96, oldaddr = 0xf049b, newaddr = 0x2bbe rw = WRITE, type = DATA f2fs_submit_page_mbio: dev = (1,0), ino = 134824, page_index = 0x97, oldaddr = 0xf049c, newaddr = 0x2bbf rw = WRITE, type = DATA f2fs_submit_page_mbio: dev = (1,0), ino = 134824, page_index = 0x98, oldaddr = 0xf049d, newaddr = 0x2bc0 rw = WRITE, type = DATA f2fs_submit_page_mbio: dev = (1,0), ino = 135260, page_index = 0x47, oldaddr = 0xffffffff, newaddr = 0xf2631 rw = WRITE, type = DATA f2fs_submit_page_mbio: dev = (1,0), ino = 135260, page_index = 0x48, oldaddr = 0xffffffff, newaddr = 0xf2632 rw = WRITE, type = DATA f2fs_submit_page_mbio: dev = (1,0), ino = 135260, page_index = 0x49, oldaddr = 0xffffffff, newaddr = 0xf2633 rw = WRITE, type = DATA Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: avoid hungtask problem caused by losing wake_up The D state of wait_on_all_pages_writeback should be waken by function f2fs_write_end_io when all writeback pages have been succesfully written to device. It's possible that wake_up comes between get_pages and io_schedule. Maybe in this case it will lost wake_up and still in D state even if all pages have been write back to device, and finally, the whole system will be into the hungtask state. if (!get_pages(sbi, F2FS_WRITEBACK)) break; <--------- wake_up io_schedule(); Signed-off-by: Yunlei He Signed-off-by: Biao He Signed-off-by: Jaegeuk Kim f2fs: fix incorrect upper bound when iterating inode mapping tree 1. Inode mapping tree can index page in range of [0, ULONG_MAX], however, in some places, f2fs only search or iterate page in ragne of [0, LONG_MAX], result in miss hitting in page cache. 2. filemap_fdatawait_range accepts range parameters in unit of bytes, so the max range it covers should be [0, LLONG_MAX], if we use [0, LONG_MAX] as range for waiting on writeback, big number of pages will not be covered. This patch corrects above two issues. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs crypto: fix incorrect positioning for GCing encrypted data page For now, flow of GCing an encrypted data page: 1) try to grab meta page in meta inode's mapping with index of old block address of that data page 2) load data of ciphertext into meta page 3) allocate new block address 4) write the meta page into new block address 5) update block address pointer in direct node page. Other reader/writer will use f2fs_wait_on_encrypted_page_writeback to check and wait on GCed encrypted data cached in meta page writebacked in order to avoid inconsistence among data page cache, meta page cache and data on-disk when updating. However, we will use new block address updated in step 5) as an index to lookup meta page in inner bio buffer. That would be wrong, and we will never find the GCing meta page, since we use the old block address as index of that page in step 1). This patch fixes the issue by adjust the order of step 1) and step 3), and in step 1) grab page with index generated in step 3). Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/gc.c f2fs: introduce f2fs_update_data_blkaddr for cleanup Add a new help f2fs_update_data_blkaddr to clean up redundant codes. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: introduce f2fs_flush_merged_bios for cleanup Add a new helper f2fs_flush_merged_bios to clean up redundant codes. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix to avoid deadlock when merging inline data When testing with fsstress, kworker and user threads were both blocked: INFO: task kworker/u16:1:16580 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/u16:1 D ffff8803f2595390 0 16580 2 0x00000000 Workqueue: writeback bdi_writeback_workfn (flush-251:0) ffff8802730e5760 0000000000000046 ffff880274729fc0 0000000000012440 ffff8802730e5fd8 ffff8802730e4010 0000000000012440 0000000000012440 ffff8802730e5fd8 0000000000012440 ffff880274729fc0 ffff88026eb50000 Call Trace: [] schedule+0x29/0x70 [] rwsem_down_read_failed+0xa5/0xf9 [] call_rwsem_down_read_failed+0x14/0x30 [] f2fs_write_data_page+0x31b/0x420 [f2fs] [] __f2fs_writepage+0x1a/0x50 [f2fs] [] f2fs_write_data_pages+0xe0/0x290 [f2fs] [] do_writepages+0x23/0x40 [] __writeback_single_inode+0x4e/0x250 [] writeback_sb_inodes+0x2c1/0x470 [] __writeback_inodes_wb+0x9e/0xd0 [] wb_writeback+0x1fb/0x2d0 [] wb_do_writeback+0x9c/0x220 [] bdi_writeback_workfn+0x72/0x1c0 [] process_one_work+0x1de/0x5b0 [] worker_thread+0x11f/0x3e0 [] kthread+0xde/0xf0 [] ret_from_fork+0x58/0x90 fsstress thread stack: [] sleep_on_page+0xe/0x20 [] __lock_page+0x67/0x70 [] find_lock_page+0x50/0x80 [] find_or_create_page+0x3f/0xb0 [] sync_node_pages+0x259/0x810 [f2fs] [] write_checkpoint+0x1a4/0xce0 [f2fs] [] f2fs_sync_fs+0x7c/0xd0 [f2fs] [] f2fs_sync_file+0x143/0x5f0 [f2fs] [] vfs_fsync_range+0x2b/0x40 [] vfs_fsync+0x1c/0x20 [] do_fsync+0x41/0x70 [] SyS_fdatasync+0x13/0x20 [] system_call_fastpath+0x16/0x1b [] 0xffffffffffffffff The reason of this issue is: CPU0: CPU1: - f2fs_write_data_pages - f2fs_sync_fs - write_checkpoint - block_operations - f2fs_lock_all - down_write(sbi->cp_rwsem) - lock_page(page) - f2fs_write_data_page - sync_node_pages - flush_inline_data - pagecache_get_page(page, GFP_LOCK) - f2fs_lock_op - down_read(sbi->cp_rwsem) This patch alters to use trylock_page in flush_inline_data to fix this ABBA deadlock issue. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: recovery missing dot dentries in root directory If f2fs was corrupted with missing dot dentries in root dirctory, it needs to recover them after fsck.f2fs set F2FS_INLINE_DOTS flag in directory inode when fsck.f2fs detects missing dot dentries. Signed-off-by: Xue Liu Signed-off-by: Yong Sheng Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: mutex can't be used by down_write_nest_lock() f2fs_lock_all() calls down_write_nest_lock() to acquire a rw_sem and check a mutex, but down_write_nest_lock() is designed for two rw_sem accoring to the comment in include/linux/rwsem.h. And, other than f2fs, it is just called in mm/mmap.c with two rwsem. So, it looks it is used wrongly by f2fs. And, it causes the below compile warning on -rt kernel too. In file included from fs/f2fs/xattr.c:25:0: fs/f2fs/f2fs.h: In function 'f2fs_lock_all': fs/f2fs/f2fs.h:962:34: warning: passing argument 2 of 'down_write_nest_lock' from incompatible pointer type [-Wincompatible-pointer-types] f2fs_down_write(&sbi->cp_rwsem, &sbi->cp_mutex); ^ fs/f2fs/f2fs.h:27:55: note: in definition of macro 'f2fs_down_write' #define f2fs_down_write(x, y) down_write_nest_lock(x, y) ^ In file included from include/linux/rwsem.h:22:0, from fs/f2fs/xattr.c:21: include/linux/rwsem_rt.h:138:20: note: expected 'struct rw_semaphore *' but argument is of type 'struct mutex *' static inline void down_write_nest_lock(struct rw_semaphore *sem, Signed-off-by: Yang Shi Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim fs crypto: move per-file encryption from f2fs tree to fs/crypto This patch adds the renamed functions moved from the f2fs crypto files. [Backporting to 3.10] - Removed d_is_negative() in fscrypt_d_revalidate(). 1. definitions for per-file encryption used by ext4 and f2fs. 2. crypto.c for encrypt/decrypt functions a. IO preparation: - fscrypt_get_ctx / fscrypt_release_ctx b. before IOs: - fscrypt_encrypt_page - fscrypt_decrypt_page - fscrypt_zeroout_range c. after IOs: - fscrypt_decrypt_bio_pages - fscrypt_pullback_bio_page - fscrypt_restore_control_page 3. policy.c supporting context management. a. For ioctls: - fscrypt_process_policy - fscrypt_get_policy b. For context permission - fscrypt_has_permitted_context - fscrypt_inherit_context 4. keyinfo.c to handle permissions - fscrypt_get_encryption_info - fscrypt_free_encryption_info 5. fname.c to support filename encryption a. general wrapper functions - fscrypt_fname_disk_to_usr - fscrypt_fname_usr_to_disk - fscrypt_setup_filename - fscrypt_free_filename b. specific filename handling functions - fscrypt_fname_alloc_buffer - fscrypt_fname_free_buffer 6. Makefile and Kconfig Cc: Al Viro Signed-off-by: Michael Halcrow Signed-off-by: Ildar Muslukhov Signed-off-by: Uday Savagaonkar Signed-off-by: Theodore Ts'o Signed-off-by: Arnd Bergmann Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/f2fs.h fs/f2fs/inode.c fs/f2fs/super.c include/linux/fs.h f2fs: define not-set fallocate flags This fixes building bugs in aosp. Signed-off-by: Jaegeuk Kim f2fs crypto: sync ext4_lookup and ext4_file_open This patch tries to catch up with lookup and open policies in ext4. Signed-off-by: Jaegeuk Kim f2fs: modify the readahead method in ra_node_page() ra_node_page() is used to read ahead one node page. Comparing to regular read, it's faster because it doesn't wait for IO completion. But if it is called twice for reading the same block, and the IO request from the first call hasn't been completed before the second call, the second call will have to wait until the read is over. Here use the code in __do_page_cache_readahead() to solve this problem. It does nothing when someone else already puts the page in mapping. The status of page should be assured by whoever puts it there. This implement also prevents alteration of page reference count. Signed-off-by: Fan li Signed-off-by: Jaegeuk Kim f2fs: use cryptoapi crc32 functions The crc function is done bit by bit. Optimize this by use cryptoapi crc32 function which is backed by h/w acceleration. Signed-off-by: Keith Mok Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/f2fs.h f2fs: declare static functions Just to avoid sparse warnings. Signed-off-by: Jaegeuk Kim f2fs: clean up opened code with f2fs_update_dentry Just clean up opened code with existing function, no logic change. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix to avoid unneeded unlock_new_inode During ->lookup, I_NEW state of inode was been cleared in f2fs_iget, so in error path, we don't need to clear it again. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: add missing argument to f2fs_setxattr stub The f2fs_setxattr() prototype for CONFIG_F2FS_FS_XATTR=n has been wrong for a long time, since 8ae8f1627f39 ("f2fs: support xattr security labels"), but there have never been any callers, so it did not matter. Now, the function gets called from f2fs_ioc_keyctl(), which causes a build failure: fs/f2fs/file.c: In function 'f2fs_ioc_keyctl': include/linux/stddef.h:7:14: error: passing argument 6 of 'f2fs_setxattr' makes integer from pointer without a cast [-Werror=int-conversion] #define NULL ((void *)0) ^ fs/f2fs/file.c:1599:27: note: in expansion of macro 'NULL' value, F2FS_KEY_SIZE, NULL, type); ^ In file included from ../fs/f2fs/file.c:29:0: fs/f2fs/xattr.h:129:19: note: expected 'int' but argument is of type 'void *' static inline int f2fs_setxattr(struct inode *inode, int index, ^ fs/f2fs/file.c:1597:9: error: too many arguments to function 'f2fs_setxattr' return f2fs_setxattr(inode, F2FS_XATTR_INDEX_KEY, ^ In file included from ../fs/f2fs/file.c:29:0: fs/f2fs/xattr.h:129:19: note: declared here static inline int f2fs_setxattr(struct inode *inode, int index, Thsi changes the prototype of the empty stub function to match that of the actual implementation. This will not make the key management work when F2FS_FS_XATTR is disabled, but it gets it to build at least. Signed-off-by: Arnd Bergmann Signed-off-by: Jaegeuk Kim f2fs: submit node page write bios when really required If many threads calls fsync with data writes, we don't need to flush every bios having node page writes. The f2fs_wait_on_page_writeback will flush its bios when the page is really needed. Signed-off-by: Jaegeuk Kim f2fs/crypto: fix xts_tweak initialization Commit 0b81d07790726 ("fs crypto: move per-file encryption from f2fs tree to fs/crypto") moved the f2fs crypto files to fs/crypto/ and renamed the symbol prefixes from "f2fs_" to "fscrypt_" (and from "F2FS_" to just "FS" for preprocessor symbols). Because of the symbol renaming, it's a bit hard to see it as a file move: use git show -M30 0b81d07790726 to lower the rename detection to just 30% similarity and make git show the files as renamed (the header file won't be shown as a rename even then - since all it contains is symbol definitions, it looks almost completely different). Even with the renames showing as renames, the diffs are not all that easy to read, since so much is just the renames. But Eric Biggers noticed that it's not just all renames: the initialization of the xts_tweak had been broken too, using the inode number rather than the page offset. That's not right - it makes the xfs_tweak the same for all pages of each inode. It _might_ make sense to make the xfs_tweak contain both the offset _and_ the inode number, but not just the inode number. Reported-by: Eric Biggers Cc: Jaegeuk Kim Signed-off-by: Linus Torvalds f2fs: cover large section in sanity check of super This patch fixes the bug which does not cover a large section case when checking the sanity of superblock. If f2fs detects misalignment, it will fix the superblock during the mount time, so it doesn't need to trigger fsck.f2fs further. Reported-by: Matthias Prager Reported-by: David Gnedt Cc: stable 4.5+ Signed-off-by: Jaegeuk Kim f2fs crypto: fix corrupted symlink in encrypted case In the encrypted symlink case, we should check its corrupted symname after decrypting it. Otherwise, we can report -ENOENT incorrectly, if encrypted symname starts with '\0'. Cc: stable 4.5+ Signed-off-by: Jaegeuk Kim f2fs: retrieve IO write stat from the right place In the following patch, f2fs: split journal cache from curseg cache journal cache is split from curseg cache. So IO write statistics should be retrived from journal cache but not curseg->sum_blk. Otherwise, it will get 0, and the stat is lost. Signed-off-by: Shuoran Liu Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim fscrypto: use dget_parent() in fscrypt_d_revalidate() This patch updates fscrypto along with the below ext4 crypto change. Fixes: 3d43bcfef5f0 ("ext4 crypto: use dget_parent() in ext4_d_revalidate()") Cc: Theodore Ts'o Signed-off-by: Jaegeuk Kim Conflicts: fs/crypto/crypto.c f2fs: use dget_parent and file_dentry in f2fs_file_open This patch synced with the below two ext4 crypto fixes together. In 4.6-rc1, f2fs newly introduced accessing f_path.dentry which crashes overlayfs. To fix, now we need to use file_dentry() to access that field. [Backport NOTE] - Over 4.2, it should use file_dentry Fixes: c0a37d487884 ("ext4: use file_dentry()") Fixes: 9dd78d8c9a7b ("ext4: use dget_parent() in ext4_file_open()") Cc: Miklos Szeredi Cc: Theodore Ts'o Signed-off-by: Jaegeuk Kim fscrypto: don't let data integrity writebacks fail with ENOMEM This patch fixes the issue introduced by the ext4 crypto fix in a same manner. For F2FS, however, we flush the pending IOs and wait for a while to acquire free memory. Fixes: c9af28fdd4492 ("ext4 crypto: don't let data integrity writebacks fail with ENOMEM") Cc: Theodore Ts'o Signed-off-by: Jaegeuk Kim Conflicts: fs/crypto/crypto.c ext4/fscrypto: avoid RCU lookup in d_revalidate As Al pointed, d_revalidate should return RCU lookup before using d_inode. This was originally introduced by: commit 34286d666230 ("fs: rcu-walk aware d_revalidate method"). Reported-by: Al Viro Signed-off-by: Jaegeuk Kim Cc: Theodore Ts'o Cc: stable Conflicts: fs/ext4/crypto.c Revert "f2fs: use cryptoapi crc32 functions" This reverts commit fe125885875e39a693c9152d5eb5fe1d40ed34ea. f2fs: give RO message when recovering superblock When one of superblocks is missing, f2fs recovers it with the valid one. But, even if f2fs is mounted as RO, we'd better notify that too. Signed-off-by: Jaegeuk Kim f2fs: recover superblock at RW remounts This patch adds a sbi flag, SBI_NEED_SB_WRITE, which indicates it needs to recover superblock when (re)mounting as RW. This is set only when f2fs is mounted as RO. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/super.c f2fs: give -EINVAL for norecovery and rw mount Once detecting something to recover, f2fs should stop mounting, given norecovery and rw mount options. Signed-off-by: Jaegeuk Kim f2fs: treat as a normal umount when remounting ro When user remounts f2fs as read-only, we can mark the checkpoint as umount. Signed-off-by: Jaegeuk Kim f2fs: show current mount status This patch remains the current mount status to f2fs status info. Signed-off-by: Jaegeuk Kim f2fs: fix to convert inline directory correctly With below serials, we will lose parts of dirents: 1) mount f2fs with inline_dentry option 2) echo 1 > /sys/fs/f2fs/sdX/dir_level 3) mkdir dir 4) touch 180 files named [1-180] in dir 5) touch 181 in dir 6) echo 3 > /proc/sys/vm/drop_caches 7) ll dir ls: cannot access 2: No such file or directory ls: cannot access 4: No such file or directory ls: cannot access 5: No such file or directory ls: cannot access 6: No such file or directory ls: cannot access 8: No such file or directory ls: cannot access 9: No such file or directory ... total 360 drwxr-xr-x 2 root root 4096 Feb 19 15:12 ./ drwxr-xr-x 3 root root 4096 Feb 19 15:11 ../ -rw-r--r-- 1 root root 0 Feb 19 15:12 1 -rw-r--r-- 1 root root 0 Feb 19 15:12 10 -rw-r--r-- 1 root root 0 Feb 19 15:12 100 -????????? ? ? ? ? ? 101 -????????? ? ? ? ? ? 102 -????????? ? ? ? ? ? 103 ... The reason is: when doing the inline dir conversion, we didn't consider that directory has hierarchical hash structure which can be configured through sysfs interface 'dir_level'. By default, dir_level of directory inode is 0, it means we have one bucket in hash table located in first level, all dirents will be hashed in this bucket, so it has no problem for us to do the duplication simply between inline dentry page and converted normal dentry page. However, if we configured dir_level with the value N (greater than 0), it will expand the bucket number of first level hash table by 2^N - 1, it hashs dirents into different buckets according their hash value, if we still move all dirents to first bucket, it makes incorrent locating for inline dirents, the result is, although we can iterate all dirents through ->readdir, we can't stat some of them in ->lookup which based on hash table searching. This patch fixes this issue by rehashing dirents into correct position when converting inline directory. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/dir.c fs/f2fs/f2fs.h f2fs: add BUG_ON to avoid unnecessary flow This patch adds BUG_ON instead of retrying loop. In the case of node pages, we already got this inode page, but unlocked it. By the fact that we don't truncate any node pages in operations, the page's mapping should be unchangeable. Signed-off-by: Jaegeuk Kim f2fs: fix dropping inmemory pages in a wrong time When one reader closes its file while the other writer is doing atomic writes, f2fs_release_file drops atomic data resulting in an empty commit. This patch fixes this wrong commit problem by checking openess of the file. Process0 Process1 open file start atomic write write data read data close file f2fs_release_file() clear atomic data commit atomic write Reported-by: Miao Xie Signed-off-by: Jaegeuk Kim f2fs: unset atomic/volatile flag in f2fs_release_file The atomic/volatile operation should be done in pair of start and commit ioctl. For example, if a killed process remains open-ended atomic operation, we should drop its flag as well as its atomic data. Otherwise, if sqlite initiates another operation which doesn't require atomic writes, it will lose every data, since f2fs still treats with them as atomic writes; nobody will trigger its commit. Reported-by: Miao Xie Signed-off-by: Jaegeuk Kim f2fs: remove redundant condition check This patch resolves the redundant condition check reported by David. Reported-by: David Binderman Signed-off-by: Jaegeuk Kim f2fs: give -E2BIG for no space in xattr This patch returns -E2BIG if there is no space to add an xattr entry. This should fix generic/026 in xfstests as well. Signed-off-by: Jaegeuk Kim f2fs: don't invalidate atomic page if successful If we committed atomic write successfully, we don't need to invalidate pages. Signed-off-by: Jaegeuk Kim f2fs: flush dirty pages before starting atomic writes If somebody wrote some data before atomic writes, we should flush them in order to handle atomic data in a right period. Signed-off-by: Jaegeuk Kim f2fs: avoid needless lock for node pages when fsyncing a file When fsync is called, sync_node_pages finds a proper direct node pages to flush. But, it locks unrelated direct node pages together unnecessarily. Acked-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: avoid writing 0'th page in volatile writes The first page of volatile writes usually contains a sort of header information which will be used for recovery. (e.g., journal header of sqlite) If this is written without other journal data, user needs to handle the stale journal information. Acked-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: split sync_node_pages with fsync_node_pages This patch splits the existing sync_node_pages into (f)sync_node_pages. The fsync_node_pages is used for f2fs_sync_file only. Acked-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: report unwritten status in fsync_node_pages The fsync_node_pages should return pass or failure so that user could know fsync is completed or not. Acked-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: set fsync mark only for the last dnode In order to give atomic writes, we should consider power failure during sync_node_pages in fsync. So, this patch marks fsync flag only in the last dnode block. Acked-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: issue cache flush on direct IO Under direct IO path with O_(D)SYNC, it needs to set proper APPEND or UPDATE flags, so taht f2fs_sync_file can make its data safe. Acked-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/data.c f2fs: be aware of invalid filename length The filename length in dirent of may become zero-sized after random junk data injection, once encounter such dirent, find_target_dentry or f2fs_add_inline_entries will run into an infinite loop. So let f2fs being aware of that to avoid deadloop. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: move node pages only in victim section during GC For foreground GC, we cache node blocks in victim section and set them dirty, then we call sync_node_pages to flush these node pages, but meanwhile, those node pages which does not locate in victim section will be flushed together, so more bandwidth and continuous free space would be occupied. So for this condition, it's better to leave those unrelated node page in cache for further write hit, and let CP or VM to flush them afterward. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix to return 0 if err == -ENOENT in f2fs_readdir Commit 57b62d29ad5b384775974973087d47755a8c6fcc ("f2fs: fix to report error in f2fs_readdir") causes f2fs_readdir to return -ENOENT when get_lock_data_page returns -ENOENT. However, the original logic is to continue when get_lock_data_page returns -ENOENT, but it forgets to reset err to 0. This will cause getdents64 incorretly return -ENOENT when lastdirent is NULL in getdents64. This will lead to a wrong return value for syscall caller. Signed-off-by: Yunlong Song Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix to clear private data in page Private data in page should be removed during ->releasepage or ->invalidatepage, otherwise garbage data would be remained in that page. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix to clear page private flag Commit 28bc106b2346 ("f2fs: support revoking atomic written pages") forgot to clear page private flag correctly, fix it. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: factor out fsync inode entry operations Factor out fsync inode entry operations into {add,del}_fsync_inode. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: introduce macros for proc entries This adds macros to be used multiple proc entries. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/super.c f2fs: add proc entry to show valid block bitmap This patch adds a new proc entry to show segment information in more detail. Signed-off-by: Jaegeuk Kim f2fs: introduce f2fs_kmalloc to wrap kmalloc This patch adds f2fs_kmalloc. Signed-off-by: Jaegeuk Kim f2fs: use f2fs_grab_cache_page instead of grab_cache_page This patch converts grab_cache_page to f2fs_grab_cache_page. Signed-off-by: Jaegeuk Kim f2fs: add mount option to select fault injection ratio This patch adds a mount option to select fault ratio. Signed-off-by: Jaegeuk Kim f2fs: inject kmalloc failure This patch injects kmalloc failure given a fault injection rate. Signed-off-by: Jaegeuk Kim f2fs: inject page allocation failures This patch adds page allocation failures. Signed-off-by: Jaegeuk Kim f2fs: inject ENOSPC failures This patch injects ENOSPC failures. Signed-off-by: Jaegeuk Kim f2fs: revisit error handling flows This patch fixes a couple of bugs regarding to orphan inodes when handling errors. This tries to - call alloc_nid_done with add_orphan_inode in handle_failed_inode - let truncate blocks in f2fs_evict_inode - not make a bad inode due to i_mode change Signed-off-by: Jaegeuk Kim f2fs: fix leak of orphan inode objects When unmounting filesystem, we should release all the ino entries. Signed-off-by: Jaegeuk Kim f2fs: retry to truncate blocks in -ENOMEM case This patch modifies to retry truncating node blocks in -ENOMEM case. Signed-off-by: Hou Pengyang Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/inode.c f2fs: remove unneeded readahead in find_fsync_dnodes In find_fsync_dnodes, get_tmp_page will read dnode page synchronously, previously, ra_meta_page did the same work, which is redundant, remove it. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: remove unneeded memset when updating xattr Each of fields in struct f2fs_xattr_entry will be assigned later, so previously we don't need to memset the struct. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: reuse get_extent_info Reuse get_extent_info for readability. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: shrink size of struct seg_entry Restructure struct seg_entry to eliminate holes in it, after that, in 32-bits machine, it reduces size from 32 bytes to 24 bytes; in 64-bits machine, it reduces size from 56 bytes to 40 bytes. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: don't worry about inode leak in evict_inode Even if an inode failed to release its blocks, it should be kept in an orphan inode list, so it will be released later. Signed-off-by: Jaegeuk Kim f2fs: remove an obsolete variable This patch removes an obsolete variable used in add_free_nid. Signed-off-by: Jaegeuk Kim f2fs: fix incorrect mapping in ->bmap Currently, generic_block_bmap is used in f2fs_bmap, its semantics is when the mapping is been found, return position of target physical block, otherwise return zero. But, previously, when there is no mapping info for specified logical block, f2fs_bmap will map target physical block to a uninitialized variable, which should be wrong. Fix it. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: avoid panic when truncating to max filesize The following panic occurs when truncating inode which has inline xattr to max filesize. [] get_dnode_of_data+0x4e/0x580 [f2fs] [] ? read_node_page+0x51/0x90 [f2fs] [] ? get_node_page.part.34+0xb9/0x170 [f2fs] [] truncate_blocks+0x131/0x3f0 [f2fs] [] f2fs_truncate+0x73/0x100 [f2fs] [] f2fs_setattr+0x62/0x2a0 [f2fs] [] notify_change+0x158/0x300 [] do_truncate+0x6b/0xa0 [] ? __sb_start_write+0x49/0x100 [] do_sys_ftruncate.constprop.12+0x118/0x170 [] SyS_ftruncate+0xe/0x10 [] tracesys+0xe1/0xe6 [] get_node_path+0x210/0x220 [f2fs] --[ end trace 5fea664dfbcc6625 ]--- The reason is truncate_blocks tries to truncate all node and data blocks start from specified block offset with value of (max filesize / block size), but actually, our valid max block offset is (max filesize / block size) - 1, so f2fs detects such invalid block offset with BUG_ON in truncation path. This patch lets f2fs skip truncating data which is exceeding max filesize. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim fscrypto/f2fs: allow fs-specific key prefix for fs encryption This patch allows fscrypto to handle a second key prefix given by filesystem. The main reason is to provide backward compatibility, since previously f2fs used "f2fs:" as a crypto prefix instead of "fscrypt:". Later, ext4 should also provide key_prefix() to give "ext4:". One concern decribed by Ted would be kinda double check overhead of prefixes. In x86, for example, validate_user_key consumes 8 ms after boot-up, which turns out derive_key_aes() consumed most of the time to load specific crypto module. After such the cold miss, it shows almost zero latencies, which treats as a negligible overhead. Note that request_key() detects wrong prefix in prior to derive_key_aes() even. Cc: Ted Tso Cc: stable@vger.kernel.org # v4.6 Signed-off-by: Jaegeuk Kim Conflicts: fs/crypto/keyinfo.c f2fs: fix inode cache leak When testing f2fs with inline_dentry option, generic/342 reports: VFS: Busy inodes after unmount of dm-0. Self-destruct in 5 seconds. Have a nice day... After rmmod f2fs module, kenrel shows following dmesg: ============================================================================= BUG f2fs_inode_cache (Tainted: G O ): Objects remaining in f2fs_inode_cache on __kmem_cache_shutdown() ----------------------------------------------------------------------------- Disabling lock debugging due to kernel taint INFO: Slab 0xf51ca0e0 objects=22 used=1 fp=0xd1e6fc60 flags=0x40004080 CPU: 3 PID: 7455 Comm: rmmod Tainted: G B O 4.6.0-rc4+ #16 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 00000086 00000086 d062fe18 c13a83a0 f51ca0e0 d062fe38 d062fea4 c11c7276 c1981040 f51ca0e0 00000016 00000001 d1e6fc60 40004080 656a624f 20737463 616d6572 6e696e69 6e692067 66326620 6e695f73 5f65646f 68636163 6e6f2065 Call Trace: [] dump_stack+0x5f/0x8f [] slab_err+0x76/0x80 [] ? __kmem_cache_shutdown+0x100/0x2f0 [] ? __kmem_cache_shutdown+0x100/0x2f0 [] __kmem_cache_shutdown+0x125/0x2f0 [] kmem_cache_destroy+0x158/0x1f0 [] ? mutex_unlock+0xd/0x10 [] exit_f2fs_fs+0x4b/0x5a8 [f2fs] [] SyS_delete_module+0x16c/0x1d0 [] ? do_fast_syscall_32+0x30/0x1c0 [] ? __this_cpu_preempt_check+0xf/0x20 [] ? trace_hardirqs_on_caller+0xdd/0x210 [] ? trace_hardirqs_off+0xb/0x10 [] do_fast_syscall_32+0xa1/0x1c0 [] sysenter_past_esp+0x45/0x74 INFO: Object 0xd1e6d9e0 @offset=6624 kmem_cache_destroy f2fs_inode_cache: Slab cache still has objects CPU: 3 PID: 7455 Comm: rmmod Tainted: G B O 4.6.0-rc4+ #16 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 00000286 00000286 d062fef4 c13a83a0 f174b000 d062ff14 d062ff28 c1198ac7 c197fe18 f3c5b980 d062ff20 000d04f2 d062ff0c d062ff0c d062ff14 d062ff14 f8f20dc0 fffffff5 d062e000 d062ff30 f8f15aa3 d062ff7c c10f596c 73663266 Call Trace: [] dump_stack+0x5f/0x8f [] kmem_cache_destroy+0x1e7/0x1f0 [] exit_f2fs_fs+0x4b/0x5a8 [f2fs] [] SyS_delete_module+0x16c/0x1d0 [] ? do_fast_syscall_32+0x30/0x1c0 [] ? __this_cpu_preempt_check+0xf/0x20 [] ? trace_hardirqs_on_caller+0xdd/0x210 [] ? trace_hardirqs_off+0xb/0x10 [] do_fast_syscall_32+0xa1/0x1c0 [] sysenter_past_esp+0x45/0x74 The reason is: in recovery flow, we use delayed iput mechanism for directory which has recovered dentry block. It means the reference of inode will be held until last dirty dentry page being writebacked. But when we mount f2fs with inline_dentry option, during recovery, dirent may only be recovered into dir inode page rather than dentry page, so there are no chance for us to release inode reference in ->writepage when writebacking last dentry page. We can call paired iget/iput explicityly for inline_dentry case, but for non-inline_dentry case, iput will call writeback_single_inode to write all data pages synchronously, but during recovery, ->writepages of f2fs skips writing all pages, result in losing dirent. This patch fixes this issue by obsoleting old mechanism, and introduce a new dir_list to hold all directory inodes which has recovered datas until finishing recovery. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fallocate data blocks in single locked node page This patch is to improve the expand_inode speed in fallocate by allocating data blocks as many as possible in single locked node page. In SSD, # time fallocate -l 500G $MNT/testfile Before : 1m 33.410 s After : 24.758 s Reported-by: Stephen Bates Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/file.c f2fs: read node blocks ahead when truncating blocks This patch enables reading node blocks in advance when truncating large data blocks. > time rm $MNT/testfile (500GB) after drop_cachees Before : 9.422 s After : 4.821 s Reported-by: Stephen Bates Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/node.c f2fs: do not preallocate block unaligned to 4KB Previously f2fs_preallocate_blocks() tries to allocate unaligned blocks. In f2fs_write_begin(), however, prepare_write_begin() does not skip its allocation due to (len != 4KB). So, it needs locking node page twice unexpectedly. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/data.c f2fs: use mnt_{want,drop}_write_file in ioctl In interfaces of ioctl, mnt_{want,drop}_write_file should be used for: - get exclusion against file system freezing which may used by lvm snapshot. - do telling filesystem that a write is about to be performed on it, and make sure that the writes are permitted. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: make atomic/volatile operation exclusive atomic/volatile ioctl interfaces are exposed to user like other file operation interface, it needs to make them getting exclusion against to each other to avoid potential conflict among these operations in concurrent scenario. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: support in batch multi blocks preallocation This patch introduces reserve_new_blocks to make preallocation of multi blocks as in batch operation, so it can avoid lots of redundant operation, result in better performance. In virtual machine, with rotational device: time fallocate -l 32G /mnt/f2fs/file Before: real 0m4.584s user 0m0.000s sys 0m4.580s After: real 0m0.292s user 0m0.000s sys 0m0.272s In x86, with SSD: time fallocate -l 500G $MNT/testfile Before : 24.758 s After : 1.604 s Signed-off-by: Chao Yu [Jaegeuk Kim: fix bugs and add performance numbers measured in x86.] Signed-off-by: Jaegeuk Kim f2fs: support in batch fzero in dnode page This patch tries to speedup fzero_range by making space preallocation and address removal of blocks in one dnode page as in batch operation. In virtual machine, with zram driver: dd if=/dev/zero of=/mnt/f2fs/file bs=1M count=4096 time xfs_io -f /mnt/f2fs/file -c "fzero 0 4096M" Before: real 0m3.276s user 0m0.008s sys 0m3.260s After: real 0m1.568s user 0m0.000s sys 0m1.564s Signed-off-by: Chao Yu [Jaegeuk Kim: consider ENOSPC case] Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/file.c f2fs: show # of orphan inodes This adds debug information for # of orphan inodes. Signed-off-by: Jaegeuk Kim f2fs: avoid f2fs_bug_on during recovery We don't need to use f2fs_bug_on() to treat with any error case when allocating a block during recovery. Signed-off-by: Jaegeuk Kim f2fs: fix deadlock when flush inline data Below backtrace info was reported by Yunlei He: Call Trace: [] schedule+0x35/0x80 [] rwsem_down_read_failed+0xed/0x130 [] call_rwsem_down_read_failed+0x18/0x [] down_read+0x20/0x30 [] f2fs_evict_inode+0x242/0x3a0 [f2fs] [] evict+0xc7/0x1a0 [] iput+0x196/0x200 [] __dentry_kill+0x179/0x1e0 [] dput+0x199/0x1f0 [] __fput+0x18b/0x220 [] ____fput+0xe/0x10 [] task_work_run+0x77/0x90 [] exit_to_usermode_loop+0x73/0xa2 [] do_syscall_64+0xfa/0x110 [] entry_SYSCALL64_slow_path+0x25/0x25 Call Trace: [] schedule+0x35/0x80 [] __wait_on_freeing_inode+0xa3/0xd0 [] ? autoremove_wake_function+0x40/0x4 [] find_inode_fast+0x7d/0xb0 [] ilookup+0x6a/0xd0 [] sync_node_pages+0x210/0x650 [f2fs] [] ? do_fsync+0x70/0x70 [] block_operations+0x9e/0xf0 [f2fs] [] ? bio_endio+0x55/0x60 [] write_checkpoint+0x92/0xba0 [f2fs] [] ? mempool_free_slab+0x17/0x20 [] ? mempool_free+0x2b/0x80 [] ? do_fsync+0x70/0x70 [] f2fs_sync_fs+0x63/0xd0 [f2fs] [] ? ext4_sync_fs+0xbf/0x190 [] sync_fs_one_sb+0x20/0x30 [] iterate_supers+0xb9/0x110 [] sys_sync+0x55/0x90 [] do_syscall_64+0x69/0x110 [] entry_SYSCALL64_slow_path+0x25/0x25 With following excuting serials, we will set inline_node in inode page after inode was unlinked, result in a deadloop described as below: 1. open file 2. write file 3. unlink file 4. write file 5. close file Thread A Thread B - dput - iput_final - inode->i_state |= I_FREEING - evict - f2fs_evict_inode - f2fs_sync_fs - write_checkpoint - block_operations - f2fs_lock_all (down_write(cp_rwsem)) - f2fs_lock_op (down_read(cp_rwsem)) - sync_node_pages - ilookup - find_inode_fast - __wait_on_freeing_inode (wait on I_FREEING clear) Here, we change to set inline_node flag only for linked inode for fixing. Reported-by: Yunlei He Signed-off-by: Chao Yu Tested-by: Jaegeuk Kim Cc: stable@vger.kernel.org # v4.6 Signed-off-by: Jaegeuk Kim f2fs: correct return value type of f2fs_fill_super Signed-off-by: Sheng Yong Signed-off-by: Jaegeuk Kim f2fs: fix i_current_depth during inline dentry conversion With below steps, we will see that dentry page becoming unaccessable later. This is because we forget updating i_current_depth in inode during inline dentry conversion, after that, once we failed at somewhere, it will leave i_current_depth as 0 in non-inline directory. Then, during ->lookup, the current_depth value makes all dentry pages in first level invisible. Fix it. 1) mount f2fs with inline_dentry option 2) mkdir dir 3) touch 180 files named [0-179] in dir 4) touch 180 in dir (fail after inline dir conversion) 5) ll dir ls: cannot access /mnt/f2fs/dir/0: No such file or directory ls: cannot access /mnt/f2fs/dir/1: No such file or directory ls: cannot access /mnt/f2fs/dir/2: No such file or directory ls: cannot access /mnt/f2fs/dir/3: No such file or directory ls: cannot access /mnt/f2fs/dir/4: No such file or directory drwxr-xr-x 2 root root 4096 may 13 21:47 ./ drwxr-xr-x 3 root root 4096 may 13 21:46 ../ -????????? ? ? ? ? ? 0 -????????? ? ? ? ? ? 1 -????????? ? ? ? ? ? 10 -????????? ? ? ? ? ? 100 -????????? ? ? ? ? ? 101 -????????? ? ? ? ? ? 102 Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/inline.c f2fs: fix incorrect error path handling in f2fs_move_rehashed_dirents Fix two bugs in error path of f2fs_move_rehashed_dirents: - release dir's inode page if fail to call kmalloc - recover i_current_depth if fail to converting Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: no need inc dirty pages under inode lock No need inc dirty pages under inode lock Signed-off-by: Yunlei He Signed-off-by: Jaegeuk Kim f2fs: add fault injection to sysfs This patch introduces a new struct f2fs_fault_info and a global f2fs_fault to save fault injection status. Fault injection entries are created in /sys/fs/f2fs/fault_injection/ during initializing f2fs module. Signed-off-by: Sheng Yong Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/super.c f2fs: manipulate dirty file inodes when DATA_FLUSH is set It needs to maintain dirty file inodes only if DATA_FLUSH is set. Otherwise, let's avoid its overhead. Signed-off-by: Jaegeuk Kim f2fs: use bio count instead of F2FS_WRITEBACK page count This can reduce page counting overhead. Signed-off-by: Jaegeuk Kim f2fs: use percpu_counter for page counters This patch substitutes percpu_counter for atomic_counter when counting various types of pages. Signed-off-by: Jaegeuk Kim f2fs: use percpu_counter for # of dirty pages in inode This patch adds percpu_counter for # of dirty pages in inode. Signed-off-by: Jaegeuk Kim f2fs: use percpu_counter for alloc_valid_block_count This patch uses percpu_count for sbi->alloc_valid_block_count. Signed-off-by: Jaegeuk Kim f2fs: use percpu_counter for total_valid_inode_count This patch uses percpu_counter to avoid stat_lock. Signed-off-by: Jaegeuk Kim f2fs: make exit_f2fs_fs more clear init_f2fs_fs does: 1) f2fs_build_trace_ios 2) init_inodecache 3) create_node_manager_caches 4) create_segment_manager_caches 5) create_checkpoint_caches 6) create_extent_cache 7) kset_create_and_add 8) kobject_init_and_add 9) register_shrinker 10) register_filesystem 11) f2fs_create_root_stats 12) proc_mkdir exit_f2fs_fs should do cleanup in the reverse order to make the code more clear. Signed-off-by: Tiezhu Yang Signed-off-by: Jaegeuk Kim f2fs: avoid ENOSPC fault in the recovery process This patch avoids impossible error injection, ENOSPC, during recovery process. Signed-off-by: Jaegeuk Kim f2fs: flush pending bios right away when error occurs Given errors, this patch flushes pending bios as soon as possible. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/file.c f2fs: fix to update dirty page count correctly Once we failed to merge inline data into inode page during flushing inline inode, we will skip invoking inode_dec_dirty_pages, which makes dirty page count incorrect, result in panic in ->evict_inode, Fix it. ------------[ cut here ]------------ kernel BUG at /home/yuchao/git/devf2fs/inode.c:336! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 3 PID: 10004 Comm: umount Tainted: G O 4.6.0-rc5+ #17 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 task: f0c33000 ti: c5212000 task.ti: c5212000 EIP: 0060:[] EFLAGS: 00010202 CPU: 3 EIP is at f2fs_evict_inode+0x85/0x490 [f2fs] EAX: 00000001 EBX: c4529ea0 ECX: 00000001 EDX: 00000000 ESI: c0131000 EDI: f89dd0a0 EBP: c5213e9c ESP: c5213e78 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 CR0: 80050033 CR2: b75878c0 CR3: 1a36a700 CR4: 000406f0 Stack: c4529ea0 c4529ef4 c5213e8c c176d45c c4529ef4 00000000 c4529ea0 c4529fac f89dd0a0 c5213eb0 c1204a68 c5213ed8 c452a2b4 c6680930 c5213ec0 c1204b64 c6680d44 c6680620 c5213eec c120588d ee84b000 ee84b5c0 c5214000 ee84b5e0 Call Trace: [] ? _raw_spin_unlock+0x2c/0x50 [] evict+0xa8/0x170 [] dispose_list+0x34/0x50 [] evict_inodes+0x10d/0x130 [] generic_shutdown_super+0x41/0xe0 [] ? unregister_shrinker+0x40/0x50 [] ? unregister_shrinker+0x40/0x50 [] kill_block_super+0x22/0x70 [] kill_f2fs_super+0x1e/0x20 [f2fs] [] deactivate_locked_super+0x3d/0x70 [] deactivate_super+0x43/0x60 [] cleanup_mnt+0x39/0x80 [] __cleanup_mnt+0x10/0x20 [] task_work_run+0x71/0x90 [] exit_to_usermode_loop+0x72/0x9e [] do_fast_syscall_32+0x19c/0x1c0 [] sysenter_past_esp+0x45/0x74 EIP: [] f2fs_evict_inode+0x85/0x490 [f2fs] SS:ESP 0068:c5213e78 ---[ end trace d30536330b7fdc58 ]--- Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: adjust other changes This patch changes: - PAGE_CACHE_* - inode_lock Signed-off-by: Jaegeuk Kim Revert "f2fs: no need inc dirty pages under inode lock" This reverts commit b951a4ec165af4973b2bd9c80fb5845fbd840435. Conflicts: fs/f2fs/checkpoint.c f2fs: use inode pointer for {set, clear}_inode_flag This patch refactors to use inode pointer for set_inode_flag and clear_inode_flag. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/acl.c fs/f2fs/file.c fs/f2fs/namei.c Conflicts: fs/f2fs/inode.c f2fs: introduce f2fs_i_size_write with mark_inode_dirty_sync This patch introduces f2fs_i_size_write() to call mark_inode_dirty_sync() with i_size_write(). Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/super.c f2fs: introduce f2fs_i_blocks_write with mark_inode_dirty_sync This patch introduces f2fs_i_blocks_write() to call mark_inode_dirty_sync() when changing inode->i_blocks. Signed-off-by: Jaegeuk Kim f2fs: introduce f2fs_i_links_write with mark_inode_dirty_sync This patch introduces f2fs_i_links_write() to call mark_inode_dirty_sync() when changing inode->i_links. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/namei.c f2fs: call mark_inode_dirty_sync for i_field changes This patch calls mark_inode_dirty_sync() for the following on-disk inode changes. -> largest -> ctime/mtime/atime -> i_current_depth -> i_xattr_nid -> i_pino -> i_advise -> i_flags -> i_mode Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/acl.c fs/f2fs/inode.c fs/f2fs/namei.c Conflicts: fs/f2fs/inode.c f2fs: flush inode metadata when checkpoint is doing This patch registers all the inodes which have dirty metadata to sync when checkpoint is doing. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/inode.c f2fs: remove syncing inode page in all the cases This patch reduces to call them across the whole tree. - sync_inode_page() - update_inode_page() - update_inode() - f2fs_write_inode() Instead, checkpoint will flush all the dirty inode metadata before syncing node pages. Note that, this is doable, since we call mark_inode_dirty_sync() for all inode's field change which needs to update on-disk inode as well. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/namei.c f2fs: avoid unnecessary updating inode during fsync If roll-forward recovery can recover i_size, we don't need to update inode's metadata during fsync. Signed-off-by: Jaegeuk Kim f2fs: detect congestion of flush command issues If flush commands do not incur any congestion, we don't need to throw that to dispatching queue which causes unnecessary latency. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/segment.c f2fs: set flush_merge by default This patch sets flush_merge by default. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/super.c f2fs: remove writepages lock This patch removes writepages lock. We can improve multi-threading performance. tiobench, 32 threads, 4KB write per fsync on SSD Before: 25.88 MB/s After: 28.03 MB/s Signed-off-by: Jaegeuk Kim f2fs: propagate error given by f2fs_find_entry If we get ENOMEM or EIO in f2fs_find_entry, we should stop right away. Otherwise, for example, we can get duplicate directory entry by ->chash and ->clevel. Signed-off-by: Jaegeuk Kim f2fs: inject to produce some orphan inodes Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/inode.c f2fs: do not skip writing data pages For data pages, let's try to flush as much as possible in background. On /dev/pmem0, 1. dd if=/dev/zero of=/mnt/test/testfile bs=1M count=2048 conv=fsync Before : 800 MB/s After : 1.1 GB/s 2. dd if=/dev/zero of=/mnt/test/testfile bs=1M count=2048 Before : 1.3 GB/s After : 2.2 GB/s Signed-off-by: Jaegeuk Kim f2fs: remove two steps to flush dirty data pages If there is no cold page, we don't need to do a loop to flush dirty data pages. On /dev/pmem0, 1. dd if=/dev/zero of=/mnt/test/testfile bs=1M count=2048 conv=fsync Before : 1.1 GB/s After : 1.2 GB/s 2. dd if=/dev/zero of=/mnt/test/testfile bs=1M count=2048 Before : 2.2 GB/s After : 2.3 GB/s Signed-off-by: Jaegeuk Kim f2fs: return the errno to the caller to avoid using a wrong page Commit aaf9607516ed38825268515ef4d773289a44f429 ("f2fs: check node page contents all the time") pointed out that "sometimes it was reported that its contents was missing", so it checks the page's mapping and contents. When "nid != nid_of_node(page)", ERR_PTR(-EIO) will be returned to the caller. However, commit e1c51b9f1df2f9efc2ec11488717e40cd12015f9 ("f2fs: clean up node page updating flow") moves "nid != nid_of_node(page)" test to "f2fs_bug_on(sbi, nid != nid_of_node(page))", this will return a wrong page to the caller when F2FS_CHECK_FS is off when "sometimes it was reported that its contents was missing" happens. This patch restores to check node page contents all the time, and returns the errno to make the caller known something is wrong and avoid to use the page. This patch also moves f2fs_bug_on to its proper location. Signed-off-by: Yunlong Song Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/node.c f2fs: return error of f2fs_lookup Now we can report an error to f2fs_lookup given by f2fs_find_entry. Suggested-by: He YunLei Signed-off-by: Jaegeuk Kim f2fs: handle writepage correctly Previously, f2fs_write_data_pages() calls __f2fs_writepage() which calls f2fs_write_data_page(). If f2fs_write_data_page() returns AOP_WRITEPAGE_ACTIVATE, __f2fs_writepage() calls mapping_set_error(). But, this should not happen at every time, since sometimes f2fs_write_data_page() tries to skip writing pages without error. For example, volatile_write() gives EIO all the time, as Shuoran Liu pointed out. Reported-by: Shuoran Liu Signed-off-by: Jaegeuk Kim f2fs: remove deprecated parameter Remove deprecated paramter. Signed-off-by: Jaegeuk Kim f2fs: avoid wrong count on dirty inodes The number should be covered by spin_lock. Otherwise we can see wrong count in f2fs_stat. Signed-off-by: Jaegeuk Kim f2fs: remove obsolete parameter in f2fs_truncate We don't need lock parameter, which is always true. Signed-off-by: Jaegeuk Kim f2fs: avoid data race between FI_DIRTY_INODE flag and update_inode FI_DIRTY_INODE flag is not covered by inode page lock, so it can be unset at any time like below. Thread #1 Thread #2 - lock_page(ipage) - update i_fields - update i_size/i_blocks/and so on - set FI_DIRTY_INODE - reset FI_DIRTY_INODE - set_page_dirty(ipage) In this case, we can lose the latest i_field information. Signed-off-by: Jaegeuk Kim f2fs: fix wrong percentage This should be 1%, 10MB / 1GB. Signed-off-by: Jaegeuk Kim f2fs: control not to exceed # of cached nat entries This is to avoid cache entry management overhead including radix tree. Signed-off-by: Jaegeuk Kim f2fs: set mapping error for EIO If EIO occurred, we need to set all the mapping to avoid any further IOs. Signed-off-by: Jaegeuk Kim f2fs: avoid reverse IO order for NODE and DATA There is a data race between allocate_data_block() and f2fs_sbumit_page_mbio(), which incur unnecessary reversed bio submission. Signed-off-by: Jaegeuk Kim f2fs: drop any block plugging In f2fs, we don't need to keep block plugging for NODE and DATA writes, since we already merged bios as much as possible. Signed-off-by: Jaegeuk Kim f2fs: skip clean segment for gc If a segment in a section is clean or prefreed, we don't need to get its summary and do gc. Signed-off-by: Jaegeuk Kim f2fs: introduce mode=lfs mount option This mount option is to enable original log-structured filesystem forcefully. So, there should be no random writes for main area. Especially, this supports host-managed SMR device. Signed-off-by: Jaegeuk Kim f2fs: fix deadlock in add_link failure mkdir sync_dirty_inode - init_inode_metadata - lock_page(node) - make_empty_dir - filemap_fdatawrite() - do_writepages - lock_page(data) - write_page(data) - lock_page(node) - f2fs_init_acl - error - truncate_inode_pages - lock_page(data) So, we don't need to truncate data pages in this error case, which will be done by f2fs_evict_inode. Signed-off-by: Jaegeuk Kim f2fs: find parent dentry correctly If dotdot directory is corrupted, its slot may be ocupied by another file. In this case, dentry[1] is not the parent directory. Rename and cross-rename will update the inode in dentry[1] incorrectly. This patch finds dotdot dentry by name. Signed-off-by: Sheng Yong [Jaegeuk Kim: remove wron bug_on] Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/f2fs.h f2fs: report error for f2fs_parent_dir If there is no dentry, we can report its error correctly. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/namei.c f2fs: call update_inode_page for orphan inodes Let's store orphan inode pages right away. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/namei.c Conflicts: fs/f2fs/inode.c f2fs: detect host-managed SMR by feature flag If mkfs.f2fs gives a feature flag for host-managed SMR, we can set mode=lfs by default. Signed-off-by: Jaegeuk Kim f2fs: produce more nids and reduce readahead nats The readahead nat pages are more likely to be reclaimed quickly, so it'd better to gather more free nids in advance. And, let's keep some free nids as much as possible. Signed-off-by: Jaegeuk Kim f2fs: avoid writing node/metapages during writes Let's keep more node/meta pages in run time. Signed-off-by: Jaegeuk Kim f2fs: avoid latency-critical readahead of node pages The f2fs_map_blocks is very related to the performance, so let's avoid any latency to read ahead node pages. Signed-off-by: Jaegeuk Kim f2fs: fix to avoid reading out encrypted data in page cache For encrypted inode, if user overwrites data of the inode, f2fs will read encrypted data into page cache, and then do the decryption. However reader can race with overwriter, and it will see encrypted data which has not been decrypted by overwriter yet. Fix it by moving decrypting work to background and keep page non-uptodated until data is decrypted. Thread A Thread B - f2fs_file_write_iter - __generic_file_write_iter - generic_perform_write - f2fs_write_begin - f2fs_submit_page_bio - generic_file_read_iter - do_generic_file_read - lock_page_killable - unlock_page - copy_page_to_iter hit the encrypted data in updated page - lock_page - fscrypt_decrypt_page Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/data.c f2fs: fix to detect truncation prior rather than EIO during read In procedure of synchonized read, after sending out the read request, reader will try to lock the page for waiting device to finish the read jobs and unlock the page, but meanwhile, truncater will race with reader, so after reader get lock of the page, it should check page's mapping to detect whether someone has truncated the page in advance, then reader has the chance to do the retry if truncation was done, otherwise read can be failed due to previous condition check. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix to redirty page if fail to gc data page If we fail to move data page during foreground GC, we should give another chance to writeback that page which was set dirty previously by writer. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: add nodiscard mount option This patch adds 'nodiscard' mount option. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: Documentation/filesystems/f2fs.txt f2fs: remove unnecessary goto statement When base_addr is NULL, there is no need to call kzfree, it should return -ENOMEM directly. Additionally, it is better to initialize variable 'error' with 0. Signed-off-by: Tiezhu Yang Signed-off-by: Jaegeuk Kim f2fs: introduce f2fs_set_page_dirty_nobuffer This patch adds f2fs_set_page_dirty_nobuffer() copied from __set_page_dirty_buffer. When appending 4KB blocks in f2fs on pmem with multiple cores, this improves the overall performance. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/f2fs.h f2fs: call SetPageUptodate if needed SetPageUptodate() issues memory barrier, resulting in performance degrdation. Let's avoid that. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/data.c f2fs: shrink critical region in spin_lock This patch shrinks the critical region in spin_lock. Signed-off-by: Jaegeuk Kim f2fs: skip to check the block address of node page If the node page is up-to-date, it should be alive. Signed-off-by: Jaegeuk Kim f2fs: fix incorrect f_bfree calculation in ->statfs As manual described, f_bfree indicates total free blocks in fs, in f2fs, it includes two parts: visible free blocks and over-provision blocks. This patch corrrects the calculation. fsblkcnt_t f_bfree; /* free blocks in fs */ Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: avoid mismatching block range for discard This patch skip discard block range smaller than trim_minlen, and can not be merged by neighbour Signed-off-by: Yunlei He Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix to avoid redundant discard during fstrim With below test steps, f2fs will issue redundant discard when doing fstrim, the reason is that we issue discards for both prefree segments and consecutive freed region user wants to trim, part regions they covered are overlapped, here, we change to do not to issue any discards for prefree segments in trimmed range. 1. mount -t f2fs -o discard /dev/zram0 /mnt/f2fs 2. fstrim -o 0 -l 3221225472 -m 2097152 -v /mnt/f2fs/ 3. dd if=/dev/zero of=/mnt/f2fs/a bs=2M count=1 4. dd if=/dev/zero of=/mnt/f2fs/b bs=1M count=1 5. sync 6. rm /mnt/f2fs/a /mnt/f2fs/b 7. fstrim -o 0 -l 3221225472 -m 2097152 -v /mnt/f2fs/ Before: <...>-5428 [001] ...1 9511.052125: f2fs_issue_discard: dev = (251,0), blkstart = 0x2200, blklen = 0x200 <...>-5428 [001] ...1 9511.052787: f2fs_issue_discard: dev = (251,0), blkstart = 0x2200, blklen = 0x300 After: <...>-6764 [000] ...1 9720.382504: f2fs_issue_discard: dev = (251,0), blkstart = 0x2200, blklen = 0x300 Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: move i_size_write in f2fs_write_end We don't need to do i_size_write under page lock. Signed-off-by: Jaegeuk Kim f2fs: avoid mark_inode_dirty Let's check inode's dirtiness before calling mark_inode_dirty. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/inode.c f2fs: fix ERR_PTR returned by bio This is to fix wrong error pointer handling flow reported by Dan. Reported-by: Dan Carpenter Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: refactor __exchange_data_block for speed up This reduces the elapsed time to do xfstests/generic/017. Before: 715 s After: 458 s Signed-off-by: Jaegeuk Kim f2fs: disable extent_cache for fcollapse/finsert inodes This reduces the elapsed time to do xfstests/generic/017. Before: 458 s After: 390 s Signed-off-by: Jaegeuk Kim f2fs: add maximum prefree segments In 1TB storage, we need to admit 22841 prefree segments, which can consume too much segments. This patch sets 8GB in max. prefree segments in that case. Signed-off-by: Jaegeuk Kim f2fs: fix to avoid data update racing between GC and DIO Datas in file can be operated by GC and DIO simultaneously, so we will face race case as below: For write case: Thread A Thread B - generic_file_direct_write - invalidate_inode_pages2_range - f2fs_direct_IO - do_blockdev_direct_IO - do_direct_IO - get_more_blocks - f2fs_gc - do_garbage_collect - gc_data_segment - move_data_page - do_write_data_page migrate data block to new block address - dio_bio_submit update user data to old block address For read case: Thread A Thread B - generic_file_direct_write - invalidate_inode_pages2_range - f2fs_direct_IO - do_blockdev_direct_IO - do_direct_IO - get_more_blocks - f2fs_balance_fs - f2fs_gc - do_garbage_collect - gc_data_segment - move_data_page - do_write_data_page migrate data block to new block address - write_checkpoint - do_checkpoint - clear_prefree_segments - f2fs_issue_discard discard old block adress - dio_bio_submit update user buffer from obsolete block address In order to fix this, for one file, we should let DIO and GC getting exclusion against with each other. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/data.c f2fs: use blk_plug in all the possible paths This patch reverts 19a5f5e2ef37 (f2fs: drop any block plugging), and adds blk_plug in write paths additionally. The main reason is that blk_start_plug can be used to wake up from low-power mode before submitting further bios. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/file.c f2fs: reset default idle interval value The default value of idle interval is 2 mins, but for most time when screen shutdown, there are still operations during the 2 mins interval, and gc's sleep time is about 30 secs to 60 secs, so there is almost no chance for GC thread to do garbage collecting. Set default value of idle interval value from 2 mins to 5 secs for fixing. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: avoid memory allocation failure due to a long length We need to avoid ENOMEM due to unexpected long length. Signed-off-by: Jaegeuk Kim f2fs: fix to report error number of f2fs_find_entry This patch fixes to report the right error number of f2fs_find_entry to its caller. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/namei.c f2fs: avoid data race when deciding checkpoin in f2fs_sync_file When fs utilization is almost full, f2fs_sync_file should do checkpoint if there is not enough space for roll-forward later. (i.e. space_for_roll_forward) So, currently we have no lock for sbi->alloc_valid_block_count, resulting in race condition. In rare case, we can get -ENOSPC when doing roll-forward which triggers if (is_valid_blkaddr(sbi, dest, META_POR)) { if (src == NULL_ADDR) { err = reserve_new_block(&dn); f2fs_bug_on(sbi, err); ... } ... } in do_recover_data. So, this patch avoids that situation in advance. Signed-off-by: Jaegeuk Kim f2fs: handle error case with f2fs_bug_on It's enough to show BUG or WARN by f2fs_bug_on for error case. Then, we don't need to remain corrupted filesystem. Signed-off-by: Jaegeuk Kim f2fs: get victim segment again after new cp Previous selected segment may become free after write_checkpoint, if we do garbage collect on this segment, and then new_curseg happen to reuse it, it may cause f2fs_bug_on as below. panic+0x154/0x29c do_garbage_collect+0x15c/0xaf4 f2fs_gc+0x2dc/0x444 f2fs_balance_fs.part.22+0xcc/0x14c f2fs_balance_fs+0x28/0x34 f2fs_map_blocks+0x5ec/0x790 f2fs_preallocate_blocks+0xe0/0x100 f2fs_file_write_iter+0x64/0x11c new_sync_write+0xac/0x11c vfs_write+0x144/0x1e4 SyS_write+0x60/0xc0 Here, maybe we check sit and ssa type during reset_curseg. So, we check segment is stale or not, and select a new victim to avoid this. Signed-off-by: Yunlei He Signed-off-by: Jaegeuk Kim f2fs: adjust other changes This patch changes: - d_inode - file_dentry - inode_nohighmem - ... Signed-off-by: Jaegeuk Kim fscrypto: no support for v3.4 Encryption is now disabled. Change-Id: I06a44489285d4a46870660727c560589c8ae8932 Signed-off-by: Jaegeuk Kim commit beaa57dd986d4f398728c060692fc2452895cfd8 Author: Chao Yu Date: Thu Oct 22 18:24:12 2015 +0800 f2fs: fix to skip shrinking extent nodes In f2fs_shrink_extent_tree we should stop shrink flow if we have already shrunk enough nodes in extent cache. Signed-off-by: Jaegeuk Kim f2fs: Fix a system panic caused by f2fs_follow_link In linux 3.10, we can not make sure the return value of nd_get_link function is valid. So this patch add a check before use it. Signed-off-by: Yunlei He Signed-off-by: Shuoran Liu Signed-off-by: Jaegeuk Kim f2fs: report error of f2fs_create_root_stats f2fs_create_root_stats can fail due to no memory, report it to user. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: commit atomic written page in LFS mode We should always commit atomic written pages in LFS mode, otherwise data will become corrupted if we encounter suddent power cut after partial pages committed in IPU mode. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: support file defragment This patch introduces a new ioctl F2FS_IOC_DEFRAGMENT to support file defragment in a specified range of regular file. This ioctl can be used in very limited workload: if user expects high sequential read performance in randomly written file, this interface can be used for defragmentation, after that file can be written as continuous as possible in the device. Meanwhile, it has side-effect, it will make holes in segments where blocks located originally, so it's better to trigger GC to eliminate fragment in segments. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix memory leak of kobject in error path of fill_super f2fs_sb_info::s_kobj should be released in error path of fill_super, otherwise it will lead to memory leak. This bug was found by kmemleak: dmesg: kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak) cat /sys/kernel/debug/kmemleak unreferenced object 0xffff8800838dc358 (size 8): comm "mount", pid 4154, jiffies 4297482839 (age 1911.412s) hex dump (first 8 bytes): 7a 72 61 6d 31 00 ff ff zram1... backtrace: [] kmemleak_alloc+0x28/0x50 [] __kmalloc_track_caller+0xef/0x1c0 [] kstrdup+0x45/0x80 [] kstrdup_const+0x28/0x30 [] kvasprintf_const+0x63/0xa0 [] kobject_set_name_vargs+0x3c/0xa0 [] kobject_add_varg+0x25/0x60 [] kobject_init_and_add+0x53/0x70 [] f2fs_fill_super+0x9d9/0xc40 [f2fs] [] mount_bdev+0x192/0x1d0 [] f2fs_mount+0x15/0x20 [f2fs] [] mount_fs+0x43/0x170 [] vfs_kern_mount+0x76/0x160 [] do_mount+0x258/0xdc0 [] SyS_mount+0x7b/0xc0 [] entry_SYSCALL_64_fastpath+0x12/0x6f ... Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix to enable missing ioctl interfaces in ->compat_ioctl In 64-bit kernel f2fs can supports 32-bit ioctl system call by identifying encoded code which is converted from 32-bit one to 64-bit one in ->compat_ioctl. When we introduced new interfaces in ->ioctl, we forgot to enable them in ->compat_ioctl, so enable them for fixing. Signed-off-by: Chao Yu [Jaegeuk Kim: fix wrongly added spaces together] Signed-off-by: Jaegeuk Kim f2fs: fix to remove directory inode from dirty list If last dirty dentry page was writebacked in reclaim path, we should remove its directory inode from global dirty list to avoid unnecessary flush for this inode when doing checkpoint. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: optimize __find_rev_next_bit 1. Skip __reverse_ulong if the bitmap is empty. 2. Reduce branches and codes. According to my test, the performance of this new version is 5% higher on an empty bitmap of 64bytes, and remains about the same in the worst scenario. Signed-off-by: Fan li Signed-off-by: Jaegeuk Kim f2fs: clear page uptodate when dropping cache for atomic write We should clear uptodate flag for all pages atomic written when we drop them, otherwise before these cached pages were reclaimed or invalidated eventually, we will see invalid data when hitting them again. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix to report error in f2fs_readdir get_lock_data_page in f2fs_readdir can fail due to a lot of reasons (i.e. no memory or IO error...), it's better to report this kind of error to user rather than ignoring it. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: avoid deadlock in f2fs_shrink_extent_tree While handling extent trees, we can enter into a reclaiming path anytime. If it tries to release some extent nodes in the same extent tree, write_lock(&et->lock) would be hanged. In order to avoid the deadlock, we can just skip it. Note that, if it is an unreferenced tree, we should get write_lock(&et->lock) successfully and release all of therein nodes. Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: do not recover from previous remained wrong dnodes If device does not support discard, some obsolete dnodes can be recovered by roll-forward. This patch enhances the recovery flow. Signed-off-by: Jaegeuk Kim f2fs: clean up error path in f2fs_readdir No logic changes, just clean up the error path. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/dir.c f2fs: clean up code with __has_cursum_space Clean up codes in lookup_journal_in_cursum() with __has_cursum_space(). Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: clean up argument of recover_data In recover_data, value of argument 'type' will be CURSEG_WARM_NODE all the time, remove it for cleanup. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: kill f2fs_drop_largest_extent For direct IO, f2fs only allocate new address for the block which is not exist in the disk before, its mapping info should not exist in extent cache previously, so here we do not need to call f2fs_drop_largest_extent to drop related cache. Due to no more callers for f2fs_drop_largest_extent now, kill it. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: use sbi->blocks_per_seg to avoid unnecessary calculation Use sbi->blocks_per_seg directly to avoid unnecessary calculation when using 1 << sbi->log_blocks_per_seg. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix to convert inline inode in ->setattr In commit 3c4541452748 ("f2fs: do not trim preallocated blocks when truncating after i_size"), in order to follow the regulation: "truncate(x) where x > i_size will not trim all blocks past i_size." like other file systems, in ->setattr we invoked truncate_setsize instead of f2fs_truncate to avoid unneeded block trimming in such case, but forgot to call f2fs_convert_inline_inode keep consistency of inline data conversion rule. This patch fixes to convert inline data if necessary. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: enhance the bit operation for SSR This patch enhances the existing bit operation when f2fs allocates SSR blocks. Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: refactor f2fs_commit_super Previously, f2fs_commit_super hacks the bh->blocknr to write the broken alternate superblock. Instead of it, we should use the correct logic to retrieve its buffer head with locking it appropriately. Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: use lock_buffer when changing superblock When modifying sb contents, we need to use lock its buffer. Signed-off-by: Jaegeuk Kim f2fs: clean up node page updating flow If read_node_page return LOCKED_PAGE, in its caller it's better a) skip unneeded 'Update' flag and mapping info verfication; b) check nid value stored in footer structure of node page. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/node.c f2fs: write only the pages in range during defragment @lend of filemap_write_and_wait_range is supposed to be a "offset in bytes where the range ends (inclusive)". Subtract 1 to avoid writing an extra page. Signed-off-by: Fan li Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix to update variable correctly when skip a unmapped block map.m_len should be reduced after skip a block Signed-off-by: Fan li Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: do more integrity verification for superblock Do more sanity check for superblock during ->mount. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: add symbol to avoid any confusion with tools This patch adds MAX_VOLUME_NAME to sync with f2fs-tools. Signed-off-by: Jaegeuk Kim f2fs: rename {add,remove,release}_dirty_inode to {add,remove,release}_ino_entry remove_dirty_dir_inode will be renamed to remove_dirty_inode as a generic function in following patch for removing directory/regular/symlink inode in global dirty list. Here rename ino management related functions for readability, also in order to avoid name conflict. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: introduce dirty list node in inode info Add a new dirt list node member in inode info for linking the inode to global dirty list in superblock, instead of old implementation which allocate slab cache memory as an entry to inode. It avoids memory pressure due to slab cache allocation, and also makes codes more clean. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: introduce __remove_dirty_inode Introduce __remove_dirty_inode to clean up codes in remove_dirty_dir_inode. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix to reset variable correctlly f2fs_map_blocks will set m_flags and m_len to 0, so we don't need to reset m_flags ourselves, but have to reset m_len to correct value before use it again. Signed-off-by: Fan li Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: backup raw_super in sbi f2fs use fields of f2fs_super_block struct directly in a grabbed buffer. Once the buffer happen to be destroyed (e.g. through dd), it may bring in unpredictable effect on f2fs. This patch fixes to allocate additional buffer to store datas of super block rather than using grabbed block buffer directly. Signed-off-by: Yunlei He Signed-off-by: Jaegeuk Kim Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: don't grab super block buffer header all the time We have already got one copy of valid super block in memory, do not grab buffer header of super block all the time. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: relocate tracepoint of write_checkpoint It needs to relocate its location to see exact trace logs. Signed-off-by: Jaegeuk Kim f2fs: introduce __f2fs_commit_super Introduce __f2fs_commit_super to include duplicated codes in f2fs_commit_super for cleanup. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: record dirty status of regular/symlink inode Maintain regular/symlink inode which has dirty pages in global dirty list and record their total dirty pages count like the way of handling directory inode. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: introduce new option for controlling data flush Add a new option 'data_flush' to enable data flush functionality. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: Documentation/filesystems/f2fs.txt f2fs: stat dirty regular/symlink inodes Add to stat dirty regular and symlink inode for showing in debugfs. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: support data flush in background Previously, when finishing a checkpoint, we have persisted all fs meta info including meta inode, node inode, dentry page of directory inode, so, after a sudden power cut, f2fs can recover from last checkpoint with full directory structure. But during checkpoint, we didn't flush dirty pages of regular and symlink inode, so such dirty datas still in memory will be lost in that moment of power off. In order to reduce the chance of lost data, this patch enables f2fs_balance_fs_bg with the ability of data flushing. It will try to flush user data before starting a checkpoint. So user's data written after last checkpoint which may not be fsynced could be saved. When we mount with data_flush option, after every period of cp_interval (could be configured in sysfs: /sys/fs/f2fs/device/cp_interval) seconds user data could be flushed into device once f2fs_balance_fs_bg was called in kworker thread or gc thread. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: optimize the flow of f2fs_map_blocks check map->m_len right after it changes to avoid excess call to update dnode_of_data. Signed-off-by: Fan li Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: add a tracepoint for sync_dirty_inodes This patch adds a tracepoint for sync_dirty_inodes. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: use atomic variable for total_extent_tree It would be better to use atomic variable for total_extent_tree. Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: speed up shrinking extent tree entries If there is no candidates for shrinking slab entries, we don't need to traverse any trees at all. Reviewed-by: Chao Yu [Jaegeuk Kim: fix missing initialization reported by Yunlei He] Signed-off-by: Jaegeuk Kim f2fs: check inline_data flag at converting time We can check inode's inline_data flag when calling to convert it. Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: avoid unnecessary f2fs_gc for dir operations The f2fs_balance_fs doesn't need to cover f2fs_new_inode or f2fs_find_entry works. Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/namei.c f2fs: record node block allocation in dnode_of_data This patch introduces recording node block allocation in dnode_of_data. This information helps to figure out whether any node block is allocated during specific file operations. Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: reduce covered region of sbi->cp_rwsem in f2fs_map_blocks Only cover sbi->cp_rwsem on one dnode page's allocation and modification instead of multiple's in f2fs_map_blocks, it can reduce the covered region of cp_rwsem, then we can avoid potential long time delay for concurrent checkpointer. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: call f2fs_balance_fs only when node was changed If user tries to update or read data, we don't need to call f2fs_balance_fs which triggers f2fs_gc, which increases unnecessary long latency. Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/file.c f2fs: report error of do_checkpoint do_checkpoint and write_checkpoint can fail due to reasons like triggering in a readonly fs or encountering IO error of storage device. So it's better to report such error info to user, let user be aware of failure of doing checkpoint. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: don't convert inline inode when inline_data option is disable If inline_data option is disable, when truncating an inline inode with size which is not exceed maxinum inline size, we should not convert inline inode to regular one to avoid the overhead of synchronizing conversion. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: introduce prepare_write_begin to clean up This patch adds prepare_write_begin to clean f2fs_write_begin. The major role of this function is to convert any inline_data and allocate or find block address. Signed-off-by: Jaegeuk Kim f2fs: return early when trying to read null nid If get_node_page() gets zero nid, we can return early without getting a wrong page. For example, get_dnode_of_data() can try to do that. Signed-off-by: Jaegeuk Kim f2fs: avoid f2fs_lock_op in f2fs_write_begin If f2fs_write_begin is to update data, we can bypass calling f2fs_lock_op() in order to avoid the checkpoint latency in the write syscall. Signed-off-by: Jaegeuk Kim f2fs: declare static function The __f2fs_commit_super is static. Signed-off-by: Jaegeuk Kim f2fs: add missing f2fs_balance_fs in __recover_dot_dentries __recover_do_dentries will try to grab free space in storage, so fix to add missing f2fs_balance_fs here. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: let user being aware of IO error Sometimes we keep dumb when IO error occur in lower layer device, so user will not receive any error return value for some operation, but actually, the operation did not succeed. This sould be avoided, so this patch reports such kind of error to user. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/data.c f2fs: fix bugs and simplify codes of f2fs_fiemap fix bugs: 1. len could be updated incorrectly when start+len is beyond isize. 2. If there is a hole consisting of more than two blocks, it could fail to add FIEMAP_EXTENT_LAST flag for the last extent. 3. If there is an extent beyond isize, when we search extents in a range that ends at isize, it will also return the extent beyond isize, which is outside the range. Signed-off-by: Fan li Signed-off-by: Jaegeuk Kim f2fs: add a max block check for get_data_block_bmap This patch adds a max block check for get_data_block_bmap. Trinity test program will send a block number as parameter into ioctl_fibmap, which will be used in get_node_path(), when the block number large than f2fs max blocks, it will trigger kernel bug. Signed-off-by: Yunlei He Signed-off-by: Xue Liu [Jaegeuk Kim: fix missing condition, pointed by Chao Yu] Signed-off-by: Jaegeuk Kim f2fs: clean up f2fs_ioc_write_checkpoint Use f2fs_sync_fs to clean up codes in f2fs_ioc_write_checkpoint. Signed-off-by: Chao Yu [Jaegeuk Kim: remove unused err variable] Signed-off-by: Jaegeuk Kim f2fs: early check broken symlink length in the encrypted case If link is broken, its len is zero, and we don't need to move forward. Signed-off-by: Jaegeuk Kim f2fs: use i_size_read to get i_size We need to use i_size_read() to get inode->i_size. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/data.c f2fs: load largest extent all the time Otherwise, we can get mismatched largest extent information. One example is: 1. mount f2fs w/ extent_cache 2. make a small extent 3. umount 4. mount f2fs w/o extent_cache 5. update the largest extent 6. umount 7. mount f2fs w/ extent_cache 8. get the old extent made by #2 Signed-off-by: Jaegeuk Kim f2fs: fix to skip recovering dot dentries in a readonly fs If filesystem is readonly, leave user message info instead of recovering inline dot inode. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix f2fs_ioc_abort_volatile_write There are two rules to handle aborting volatile or atomic writes. 1. drop atomic writes - we don't need to keep any stale db data. 2. write journal data - we should keep the journal data with fsync for db recovery. Signed-off-by: Jaegeuk Kim f2fs: remove f2fs_bug_on in terms of max_depth There is no report on this bug_on case, but if malicious attacker changed this field intentionally, we can just reset it as a MAX value. Signed-off-by: Jaegeuk Kim f2fs: write pending bios when cp_error is set When testing ioc_shutdown, put_super is able to be hanged by waiting for writebacking pages as follows. INFO: task umount:2723 blocked for more than 120 seconds. Tainted: G O 4.4.0-rc3+ #8 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. umount D ffff88000859f9d8 0 2723 2110 0x00000000 ffff88000859f9d8 0000000000000000 0000000000000000 ffffffff81e11540 ffff880078c225c0 ffff8800085a0000 ffff88007fc17440 7fffffffffffffff ffffffff818239f0 ffff88000859fb48 ffff88000859f9f0 ffffffff8182310c Call Trace: [] ? bit_wait+0x50/0x50 [] schedule+0x3c/0x90 [] schedule_timeout+0x2d9/0x430 [] ? mark_held_locks+0x6f/0xa0 [] ? ktime_get+0x7d/0x140 [] ? bit_wait+0x50/0x50 [] ? kvm_clock_get_cycles+0x25/0x30 [] ? ktime_get+0xac/0x140 [] ? bit_wait+0x50/0x50 [] io_schedule_timeout+0xa4/0x110 [] bit_wait_io+0x35/0x50 [] __wait_on_bit+0x5d/0x90 [] wait_on_page_bit+0xcb/0xf0 [] ? autoremove_wake_function+0x40/0x40 [] truncate_inode_pages_range+0x4bc/0x840 [] truncate_inode_pages_final+0x4d/0x60 [] f2fs_evict_inode+0x75/0x400 [f2fs] [] evict+0xbc/0x190 [] iput+0x229/0x2c0 [] f2fs_put_super+0x105/0x1a0 [f2fs] [] generic_shutdown_super+0x6a/0xf0 [] kill_block_super+0x27/0x70 [] kill_f2fs_super+0x20/0x30 [f2fs] [] deactivate_locked_super+0x43/0x70 [] deactivate_super+0x5c/0x60 [] cleanup_mnt+0x3f/0x90 [] __cleanup_mnt+0x12/0x20 [] task_work_run+0x73/0xa0 [] exit_to_usermode_loop+0xcc/0xd0 [] syscall_return_slowpath+0xcc/0xe0 [] int_ret_from_sys_call+0x25/0x9f Signed-off-by: Jaegeuk Kim f2fs: use IPU for fdatasync This patch fixes missing IPU condition when fdatasync is called. With this patch, fdatasync is able to avoid additional node writes for recovery. Signed-off-by: Jaegeuk Kim f2fs: monitor zombie_tree count This patch adds an entry to show the number of zombie extent_tree. Signed-off-by: Jaegeuk Kim f2fs: introduce zombie list for fast shrinking extent trees This patch removes refcount, and instead, adds zombie_list to shrink directly without radix tree traverse. Signed-off-by: Jaegeuk Kim f2fs crypto: check CONFIG_F2FS_FS_XATTR for encrypted symlink Add missed CONFIG_F2FS_FS_XATTR for encrypted symlink inode in order to avoid unneeded registry of ->{get,set,remove}xattr. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: introduce max_file_blocks in sbi Introduce max_file_blocks in sbi to store max block index of file in f2fs, it could be used to avoid unneeded calculation of max block index in runtime. Signed-off-by: Chao Yu [Jaegeuk Kim: fix overflow of sbi->max_file_blocks] Signed-off-by: Jaegeuk Kim f2fs: cover more area with nat_tree_lock There was a subtle bug on nat cache management which incurs wrong nid allocation or wrong block addresses when try_to_free_nats is triggered heavily. This patch enlarges the previous coverage of nat_tree_lock to avoid data race. Signed-off-by: Jaegeuk Kim Revert "f2fs: check the node block address of newly allocated nid" Original issue is fixed by: f2fs: cover more area with nat_tree_lock This reverts commit 24928634f81b1592e83b37dcd89ed45c28f12feb. f2fs: read isize while holding i_mutex in fiemap make sure the isize we read doesn't change during the process. Signed-off-by: Fan li Signed-off-by: Jaegeuk Kim f2fs: check node id earily when readaheading node page Add node id check in ra_node_page and get_node_page_ra like get_node_page. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: introduce __get_node_page to reuse common code There are duplicated code in between get_node_page and get_node_page_ra, introduce __get_node_page to includes common parts of these two, and export get_node_page and get_node_page_ra by reusing __get_node_page. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/node.c f2fs: check the page status filled from disk After reading a page, we need to check whether there is any error. Signed-off-by: Jaegeuk Kim f2fs: avoid unnecessary f2fs_balance_fs calls Only when node page is newly dirtied, it needs to check whether we need to do f2fs_gc. Signed-off-by: Jaegeuk Kim f2fs: remove redundant calls This patch removes redundant calls. Signed-off-by: Jaegeuk Kim f2fs: clean up f2fs_balance_fs This patch adds one parameter to clean up all the callers of f2fs_balance_fs. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/namei.c f2fs: recognize encrypted data in f2fs_fiemap This patch fixes to teach f2fs_fiemap to recognize encrypted data. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: use atomic type for node count in extent tree 1. rename field in struct extent_tree from count to node_cnt for readability. 2. alter to use atomic type for node_cnt. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: skip releasing nodes in chindless extent tree If there are no nodes in extent tree, let's skip releasing step to avoid any overhead of grabbing/releasing extent tree lock. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: introduce time and interval facility This patch adds time and interval arrays to store some timing variables. Signed-off-by: Jaegeuk Kim f2fs: detect idle time depending on user behavior This patch adds last time that user requested filesystem operations. This information is used to detect whether system is idle or not later. Signed-off-by: Jaegeuk Kim Conflicts: Documentation/ABI/testing/sysfs-fs-f2fs f2fs: monitor the number of background checkpoint This patch adds to show the number of background checkpoint. Signed-off-by: Jaegeuk Kim f2fs: fix wrong memory condition check This patch fixes wrong decision for avaliable_free_memory. The return valus is already set as false, so we should consider true condition below only. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/node.c f2fs: should unset atomic flag after successful commit If there is an error during commit, we should keep the flag in order to abort it. Signed-off-by: Jaegeuk Kim f2fs: relocate is_merged_page Operations in is_merged_page is related to inner bio cache, move it to data.c. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/segment.c f2fs: flush dirty nat entries when exceeding threshold When testing f2fs with xfstest, generic/251 is stuck for long time, the case uses below serials to obtain fresh released space in device, in order to prepare for following fstrim test. 1. rm -rf /mnt/dir 2. mkdir /mnt/dir/ 3. cp -axT `pwd`/ /mnt/dir/ 4. goto 1 During preparing step, all nat entries will be cached in nat cache, most of them are dirty entries with invalid blkaddr, which means nodes related to these entries have been truncated, and they could be reused after the dirty entries been checkpointed. However, there was no checkpoint been triggered, so nid allocators (e.g. mkdir, creat) will run into long journey of iterating all NAT pages, looking for free nids in alloc_nid->build_free_nids. Here, in f2fs_balance_fs_bg we give another chance to do checkpoint to flush nat entries for reusing them in free nid cache when dirty entry count exceeds 10% of max count. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: export dirty_nats_ratio in sysfs This patch exports a new sysfs entry 'dirty_nat_ratio' to control threshold of dirty nat entries, if current ratio exceeds configured threshold, checkpoint will be triggered in f2fs_balance_fs_bg for flushing dirty nats. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: Documentation/ABI/testing/sysfs-fs-f2fs f2fs: correct search area in get_new_segment get_new_segment starts from current segment position, tries to search a free segment among its right neighbors locate in same section. But previously our search area was set as [current segment, max segment], which means we have to search to more bits in free_segmap bitmap for some worse cases. So here we correct the search area to [current segment, last segment in section] to avoid unnecessary searching. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: remove needless condition check This patch removes needless condition variable. Signed-off-by: Jaegeuk Kim f2fs: use writepages->lock for WB_SYNC_ALL If there are many writepages calls by multiple threads in background, we don't need to serialize to merge all the bios, since it's background. In such the case, it'd better to run writepages concurrently. Signed-off-by: Jaegeuk Kim f2fs: fix to overcome inline_data floods The scenario is: 1. create lots of node blocks 2. sync 3. write lots of inline_data -> got panic due to no free space In that case, we should flush node blocks when writing inline_data in #3, and trigger gc as well. Signed-off-by: Jaegeuk Kim f2fs: do f2fs_balance_fs when block is allocated We should consider data block allocation to trigger f2fs_balance_fs. Signed-off-by: Jaegeuk Kim f2fs: avoid multiple node page writes due to inline_data The sceanrio is: 1. create fully node blocks 2. flush node blocks 3. write inline_data for all the node blocks again 4. flush node blocks redundantly So, this patch tries to flush inline_data when flushing node blocks. Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: don't need to sync node page at every time In write_end, we don't need to sync inode page at every time. Instead, we can expect f2fs_write_inode will update later. Signed-off-by: Jaegeuk Kim f2fs: avoid needless sync_inode_page when reading inline_data In write_begin, if there is an inline_data, f2fs loads it into 0'th data page. Since it's the read path, we don't need to sync its inode page. Signed-off-by: Jaegeuk Kim f2fs: don't need to call set_page_dirty for io error If end_io gets an error, we don't need to set the page as dirty, since we already set f2fs_stop_checkpoint which will not flush any data. This will resolve the following warning. ====================================================== [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ] 4.4.0+ #9 Tainted: G O ------------------------------------------------------ xfs_io/26773 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: (&(&sbi->inode_lock[i])->rlock){+.+...}, at: [] update_dirty_page+0x6f/0xd0 [f2fs] and this task is already holding: (&(&q->__queue_lock)->rlock){-.-.-.}, at: [] blk_queue_bio+0x422/0x490 which would create a new lock dependency: (&(&q->__queue_lock)->rlock){-.-.-.} -> (&(&sbi->inode_lock[i])->rlock){+.+...} Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/data.c f2fs: enhance foreground GC If we configure section consist of multiple segments, foreground GC will do the garbage collection with following approach: for each segment in victim section blk_start_plug for each valid block in segment write out by OPU method submit bio cache <--- blk_finish_plug <--- There are two issue: 1) for most of the time, 'submit bio cache' will break the merging in current bio buffer from writes of next segments, making a smaller bio submitting. 2) block plug only cover IO submitting in one segment, which reduce opportunity of merging IOs in plug with multiple segments. So refactor the code as below structure to strive for biggest opportunity of merging IOs: blk_start_plug for each segment in victim section for each valid block in segment write out by OPU method submit bio cache blk_finish_plug Test method: 1. mkfs.f2fs -s 8 /dev/sdX 2. touch 32 files 3. write 2M data into each file 4. punch 1.5M data from offset 0 for each file 5. trigger foreground gc through ioctl Before patch, there are totoally 40 bios submitted. f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 65536, size = 122880 f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 65776, size = 122880 f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 66016, size = 122880 f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 66256, size = 122880 f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 66496, size = 32768 ----repeat for 8 times After patch, there are totally 35 bios submitted. f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 65536, size = 122880 ----repeat 34 times f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 73696, size = 16384 Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: use wait_for_stable_page to avoid contention In write_begin, if storage supports stable_page, we don't need to wait for writeback to update its contents. This patch introduces to use wait_for_stable_page instead of wait_on_page_writeback. Signed-off-by: Jaegeuk Kim f2fs: delete unnecessary wait for page writeback no need to wait inline file page writeback for no one use it, so this patch delete unnecessary wait. Signed-off-by: Yunlei He Signed-off-by: Jaegeuk Kim f2fs: avoid unnecessary search while finding victim in gc variable nsearched in get_victim_by_default() indicates the number of dirty segments we already checked. There are 2 problems about the way it updates: 1. When p.ofs_unit is greater than 1, the victim we find consists of multiple segments, possibly more than 1 dirty segment. But nsearched always increases by 1. 2. If segments have been found but not been chosen, nsearched won't increase. So even we have checked all dirty segments, nsearched may still less than p.max_search. All these problems could cause unnecessary search after all dirty segments have already been checked. Signed-off-by: Fan li Signed-off-by: Jaegeuk Kim f2fs: use wq_has_sleeper for cp_wait wait_queue We need to use wq_has_sleeper including smp_mb to consider cp_wait concurrency. Signed-off-by: Jaegeuk Kim f2fs: reconstruct the code to free an extent_node There are three steps to free an extent node: 1) list_del_init, 2)__detach_extent_node, 3) kmem_cache_free In path f2fs_destroy_extent_tree, 1->2->3 to free a node, But in path f2fs_update_extent_tree_range, it is 2->1->3. This patch makes all the order to be: 1->2->3 It makes sense, since in the next patch, we import a victim list in the path shrink_extent_tree, we could check if the extent_node is in the victim list by checking the list_empty(). So it is necessary to put 1) first. Signed-off-by: Hou Pengyang Signed-off-by: Jaegeuk Kim f2fs: move extent_node list operations being coupled with rbtree operation This patch moves extent_node list operations to be handled together with its rbtree operations. Signed-off-by: Jaegeuk Kim f2fs: don't set cached_en if it will be freed If en has empty list pointer, it will be freed sooner, so we don't need to set cached_en with it. Signed-off-by: Jaegeuk Kim f2fs: improve shrink performance of extent nodes On the worst case, we need to scan the whole radix tree and every rb-tree to free the victimed extent_nodes when shrinking. Pengyang initially introduced a victim_list to record the victimed extent_nodes, and free these extent_nodes by just scanning a list. Later, Chao Yu enhances the original patch to improve memory footprint by removing victim list. The policy of lru list shrinking becomes: 1) lock lru list's lock 2) trylock extent tree's lock 3) remove extent node from lru list 4) unlock lru list's lock 5) do shrink 6) repeat 1) to 5) Signed-off-by: Hou Pengyang Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: give scheduling point in shrinking path It needs to give a chance to be rescheduled while shrinking slab entries. Signed-off-by: Jaegeuk Kim f2fs: introduce lifetime write IO statistics This patch introduces lifetime IO write statistics exposed to the sysfs interface. The write IO amount is obtained from block layer, accumulated in the file system and stored in the hot node summary of checkpoint. Signed-off-by: Shuoran Liu Signed-off-by: Pengyang Hou [Jaegeuk Kim: add sysfs documentation] Signed-off-by: Jaegeuk Kim f2fs: simplify f2fs_map_blocks In f2fs_map_blocks, we use duplicated codes to handle first block mapping and the following blocks mapping, it's unnecessary. This patch simplifies f2fs_map_blocks to avoid using copied codes. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: simplify __allocate_data_blocks This patch uses existing function f2fs_map_block to simplify implementation of __allocate_data_blocks. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: remove unneeded pointer conversion There are redundant pointer conversion in following call stack: - at position a, inode was been converted to f2fs_file_info. - at position b, f2fs_file_info was been converted to inode again. - truncate_blocks(inode,..) - fi = F2FS_I(inode) ---a - ADDRS_PER_PAGE(node_page, fi) - addrs_per_inode(fi) - inode = &fi->vfs_inode ---b - f2fs_has_inline_xattr(inode) - fi = F2FS_I(inode) - is_inode_flag_set(fi,..) In order to avoid unneeded conversion, alter ADDRS_PER_PAGE and addrs_per_inode to acept parameter with type of inode pointer. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: introduce get_next_page_offset to speed up SEEK_DATA When seeking data in ->llseek, if we encounter a big hole which covers several dnode pages, we will try to seek data from index of page which is the first page of next dnode page, at most we could skip searching (ADDRS_PER_BLOCK - 1) pages. However it's still not efficient, because if our indirect/double-indirect pointer are NULL, there are no dnode page locate in the tree indirect/ double-indirect pointer point to, it's not necessary to search the whole region. This patch introduces get_next_page_offset to calculate next page offset based on current searching level and max searching level returned from get_dnode_of_data, with this, we could skip searching the entire area indirect or double-indirect node block is not exist. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: speed up handling holes in fiemap This patch makes f2fs_map_blocks supporting returning next potential page offset which skips hole region in indirect tree of inode, and use it to speed up fiemap in handling big hole case. Test method: xfs_io -f /mnt/f2fs/file -c "pwrite 1099511627776 4096" time xfs_io -f /mnt/f2fs/file -c "fiemap -v" Before: time xfs_io -f /mnt/f2fs/file -c "fiemap -v" /mnt/f2fs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2147483647]: hole 2147483648 1: [2147483648..2147483655]: 81920..81927 8 0x1 real 3m3.518s user 0m0.000s sys 3m3.456s After: time xfs_io -f /mnt/f2fs/file -c "fiemap -v" /mnt/f2fs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2147483647]: hole 2147483648 1: [2147483648..2147483655]: 81920..81927 8 0x1 real 0m0.008s user 0m0.000s sys 0m0.008s Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix endianness of on-disk summary_footer Signed-off-by: Sheng Yong Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: wait on page's writeback in writepages path Likewise f2fs_write_cache_pages, let's do for node and meta pages too. Especially, for node blocks, we should do this before marking its fsync and dentry flags. Signed-off-by: Jaegeuk Kim f2fs: flush bios to handle cp_error in put_super Sometimes, if cp_error is set, there remains under-writeback pages, resulting in kernel hang in put_super. Signed-off-by: Jaegeuk Kim f2fs: fix conflict on page->private usage This patch fixes confilct on page->private value between f2fs_trace_pid and atomic page. Signed-off-by: Jaegeuk Kim f2fs: introduce f2fs_submit_merged_bio_cond f2fs use single bio buffer per type data (META/NODE/DATA) for caching writes locating in continuous block address as many as possible, after submitting, these writes may be still cached in bio buffer, so we have to flush cached writes in bio buffer by calling f2fs_submit_merged_bio. Unfortunately, in the scenario of high concurrency, bio buffer could be flushed by someone else before we submit it as below reasons: a) there is no space in bio buffer. b) add a request of different type (SYNC, ASYNC). c) add a discontinuous block address. For this condition, f2fs_submit_merged_bio will be devastating, because it could break the following merging of writes in bio buffer, split one big bio into two smaller one. This patch introduces f2fs_submit_merged_bio_cond which can do a conditional submitting with bio buffer, before submitting it will judge whether: - page in DATA type bio buffer is matching with specified page; - page in DATA type bio buffer is belong to specified inode; - page in NODE type bio buffer is belong to specified inode; If there is no eligible page in bio buffer, we will skip submitting step, result in gaining more chance to merge consecutive block IOs in bio cache. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix missing skip pages info fix missing skip pages info in f2fs_writepages trace event. Signed-off-by: Yunlei He Signed-off-by: Jaegeuk Kim f2fs: move dio preallocation into f2fs_file_write_iter This patch moves preallocation code for direct IOs into f2fs_file_write_iter. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/data.c fs/f2fs/file.c f2fs: preallocate blocks for buffered aio writes This patch preallocates data blocks for buffered aio writes. With this patch, we can avoid redundant locking and unlocking of node pages given consecutive aio request. [For 3.10] - Add preallocationg for generic_splice_write(sendfile) for xfstests/249, 285 Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/data.c f2fs: increase i_size to avoid missing data When finsert is doing with dirting pages, we should increase i_size right away. Otherwise, the moved page is able to be dropped by the following filemap_write_and_wait_range before updating i_size. Especially, it can be done by if ((page->index >= end_index + 1) || !offset) goto out; in f2fs_write_data_page. This should resolve the below xfstests/091 failure reported by Dave. $ diff -u tests/generic/091.out /home/dave/src/xfstests-dev/results//f2fs/generic/091.out.bad --- tests/generic/091.out 2014-01-20 16:57:33.000000000 +1100 +++ /home/dave/src/xfstests-dev/results//f2fs/generic/091.out.bad 2016-02-08 15:21:02.701375087 +1100 @@ -1,7 +1,18 @@ QA output created by 091 fsx -N 10000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -W +mapped writes DISABLED +skipping insert range behind EOF +skipping insert range behind EOF +truncating to largest ever: 0x11e00 +dowrite: write: Invalid argument +LOG DUMP (7 total operations): +1( 1 mod 256): SKIPPED (no operation) +2( 2 mod 256): SKIPPED (no operation) +3( 3 mod 256): FALLOC 0x2e0f2 thru 0x3134a (0x3258 bytes) PAST_EOF +4( 4 mod 256): SKIPPED (no operation) +5( 5 mod 256): SKIPPED (no operation) +6( 6 mod 256): TRUNCATE UP from 0x0 to 0x11e00 +7( 7 mod 256): WRITE 0x73400 thru 0x79fff (0x6c00 bytes) HOLE +Log of operations saved to "/mnt/test/junk.fsxops"; replay with --replay-ops +Correct content saved for comparison +(maybe hexdump "/mnt/test/junk" vs "/mnt/test/junk.fsxgood") Reported-by: Dave Chinner Signed-off-by: Jaegeuk Kim f2fs crypto: replace some BUG_ON()'s with error checks This patch adopts: ext4 crypto: replace some BUG_ON()'s with error checks Signed-off-by: Theodore Ts'o Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/crypto_key.c f2fs crypto: fix spelling typo in comment This patch adopts: ext4 crypto: fix spelling typo in comment Signed-off-by: Laurent Navet Signed-off-by: Jaegeuk Kim f2fs crypto: f2fs_page_crypto() doesn't need a encryption context This patch adopts: ext4 crypto: ext4_page_crypto() doesn't need a encryption context Since ext4_page_crypto() doesn't need an encryption context (at least not any more), this allows us to simplify a number function signature and also allows us to avoid needing to allocate a context in ext4_block_write_begin(). It also means we no longer need a separate ext4_decrypt_one() function. Signed-off-by: Theodore Ts'o Signed-off-by: Jaegeuk Kim f2fs crypto: check for too-short encrypted file names This patch adopts: ext4 crypto: check for too-short encrypted file names An encrypted file name should never be shorter than an 16 bytes, the AES block size. The 3.10 crypto layer will oops and crash the kernel if ciphertext shorter than the block size is passed to it. Fortunately, in modern kernels the crypto layer will not crash the kernel in this scenario, but nevertheless, it represents a corrupted directory, and we should detect it and mark the file system as corrupted so that e2fsck can fix this. Signed-off-by: Theodore Ts'o Signed-off-by: Jaegeuk Kim f2fs crypto: add missing locking for keyring_key access This patch adopts: ext4 crypto: add missing locking for keyring_key access Signed-off-by: Theodore Ts'o Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/crypto_key.c f2fs: use correct errno This patch is to fix misused error number. Signed-off-by: Jaegeuk Kim f2fs: avoid garbage lenghs in dentries This patch fixes to eliminate garbage name lengths in dentries in order to provide correct answers of readdir. For example, if a valid dentry consists of: bitmap : 1 1 1 1 len : 32 0 x 0, readdir can start with second bit_pos having len = 0. Or, it can start with third bit_pos having garbage. In both of cases, we should avoid to try filling dentries. So, this patch not only removes any garbage length, but also avoid entering zero length case in readdir. Signed-off-by: Jaegeuk Kim f2fs: split drop_inmem_pages from commit_inmem_pages Split drop_inmem_pages from commit_inmem_pages for code readability, and prepare for the following modification. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: support revoking atomic written pages f2fs support atomic write with following semantics: 1. open db file 2. ioctl start atomic write 3. (write db file) * n 4. ioctl commit atomic write 5. close db file With this flow we can avoid file becoming corrupted when abnormal power cut, because we hold data of transaction in referenced pages linked in inmem_pages list of inode, but without setting them dirty, so these data won't be persisted unless we commit them in step 4. But we should still hold journal db file in memory by using volatile write, because our semantics of 'atomic write support' is incomplete, in step 4, we could fail to submit all dirty data of transaction, once partial dirty data was committed in storage, then after a checkpoint & abnormal power-cut, db file will be corrupted forever. So this patch tries to improve atomic write flow by adding a revoking flow, once inner error occurs in committing, this gives another chance to try to revoke these partial submitted data of current transaction, it makes committing operation more like aotmical one. If we're not lucky, once revoking operation was failed, EAGAIN will be reported to user for suggesting doing the recovery with held journal file, or retrying current transaction again. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs crypto: make sure the encryption info is initialized on opendir(2) This patch syncs f2fs with commit 6bc445e0ff44 ("ext4 crypto: make sure the encryption info is initialized on opendir(2)") from ext4. Signed-off-by: Theodore Ts'o Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs crypto: handle unexpected lack of encryption keys This patch syncs f2fs with commit abdd438b26b4 ("ext4 crypto: handle unexpected lack of encryption keys") from ext4. Fix up attempts by users to try to write to a file when they don't have access to the encryption key. Signed-off-by: Theodore Ts'o Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs crypto: avoid unneeded memory allocation when {en/de}crypting symlink This patch adopts f2fs with codes of ext4, it removes unneeded memory allocation in creating/accessing path of symlink. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/crypto_fname.c f2fs: introduce f2fs_journal struct to wrap journal info Introduce a new structure f2fs_journal to wrap journal info in struct f2fs_summary_block for readability. struct f2fs_journal { union { __le16 n_nats; __le16 n_sits; }; union { struct nat_journal nat_j; struct sit_journal sit_j; struct f2fs_extra_info info; }; } __packed; struct f2fs_summary_block { struct f2fs_summary entries[ENTRIES_IN_SUM]; struct f2fs_journal journal; struct summary_footer footer; } __packed; Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: enhance IO path with block plug Try to use block plug in more place as below to let process cache bios as much as possbile, in order to reduce lock overhead of queue in IO scheduler. 1) sync_meta_pages 2) ra_meta_pages 3) f2fs_balance_fs_bg Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: split journal cache from curseg cache In curseg cache, f2fs caches two different parts: - datas of current summay block, i.e. summary entries, footer info. - journal info, i.e. sparse nat/sit entries or io stat info. With this approach, 1) it may cause higher lock contention when we access or update both of the parts of cache since we use the same mutex lock curseg_mutex to protect the cache. 2) current summary block with last journal info will be writebacked into device as a normal summary block when flushing, however, we treat journal info as valid one only in current summary, so most normal summary blocks contain junk journal data, it wastes remaining space of summary block. So, in order to fix above issues, we split curseg cache into two parts: a) current summary block, protected by original mutex lock curseg_mutex b) journal cache, protected by newly introduced r/w semaphore journal_rwsem When loading curseg cache during ->mount, we store summary info and journal info into different caches; When doing checkpoint, we combine datas of two cache into current summary block for persisting. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: reorder nat cache lock in cache_nat_entry When lookuping nat entry in cache_nat_entry, if we fail to hit nat cache, we try to load nat entries a) from journal of current segment cache or b) from NAT pages for updating, during the process, write lock of nat_tree_lock will be held to avoid inconsistent condition in between nid cache and nat cache caused by racing among nat entry shrinker, checkpointer, nat entry updater. But this way may cause low efficient when updating nat cache, because it serializes accessing in journal cache or reading NAT pages. Here, we reorder lock and update flow as below to enhance accessing concurrency: - get_node_info - down_read(nat_tree_lock) - lookup nat cache --- hit -> unlock & return - lookup journal cache --- hit -> unlock & goto update - up_read(nat_tree_lock) update: - down_write(nat_tree_lock) - cache_nat_entry - lookup nat cache --- nohit -> update - up_write(nat_tree_lock) Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: slightly reorganize read_raw_super_block read_raw_super_block was introduced to help find the first valid superblock. Commit da554e48caab ("f2fs: recovering broken superblock during mount") changed the behaviour to read both of them and check whether need the recovery flag or not. So the comment before this function isn't consistent with what it actually does. Also, the origin code use two tags to round the err cases, which isn't so readable. So this patch amend the comment and slightly reorganize it. Signed-off-by: Shawn Lin Signed-off-by: Jaegeuk Kim f2fs: move sanity checking of cp into get_valid_checkpoint >From the function name of get_valid_checkpoint, it seems to return the valid cp or NULL for caller to check. If no valid one is found, f2fs_fill_super will print the err log. But if get_valid_checkpoint get one valid(the return value indicate that it's valid, however actually it is invalid after sanity checking), then print another similar err log. That seems strange. Let's keep sanity checking inside the procedure of geting valid cp. Another improvement we gained from this move is that even the large volume is supported, we check the cp in advanced to skip the following procedure if failing the sanity checking. Signed-off-by: Shawn Lin Signed-off-by: Jaegeuk Kim f2fs: detect error of update_dent_inode in ->rename Should check and show correct return value of update_dent_inode in ->rename. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix to delete old dirent in converted inline directory in ->rename When doing test with fstests/generic/068 in inline_dentry enabled f2fs, following oops dmesg will be reported: ------------[ cut here ]------------ WARNING: CPU: 5 PID: 11841 at fs/inode.c:273 drop_nlink+0x49/0x50() Modules linked in: f2fs(O) ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state CPU: 5 PID: 11841 Comm: fsstress Tainted: G O 4.5.0-rc1 #45 Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 v01.61 05/16/2013 0000000000000111 ffff88009cdf7ae8 ffffffff813e5944 0000000000002e41 0000000000000000 0000000000000111 0000000000000000 ffff88009cdf7b28 ffffffff8106a587 ffff88009cdf7b58 ffff8804078fe180 ffff880374a64e00 Call Trace: [] dump_stack+0x48/0x64 [] warn_slowpath_common+0x97/0xe0 [] warn_slowpath_null+0x1a/0x20 [] drop_nlink+0x49/0x50 [] f2fs_rename2+0xe04/0x10c0 [f2fs] [] ? lock_two_nondirectories+0x81/0x90 [] ? lockref_get+0x1d/0x30 [] vfs_rename+0x2e0/0x640 [] ? lookup_dcache+0x3b/0xd0 [] ? update_fast_ctr+0x21/0x40 [] ? security_path_rename+0xa2/0xd0 [] SYSC_renameat2+0x4b6/0x540 [] ? trace_hardirqs_off+0xd/0x10 [] ? exit_to_usermode_loop+0x7a/0xd0 [] ? int_ret_from_sys_call+0x52/0x9f [] ? trace_hardirqs_on_caller+0x100/0x1c0 [] SyS_renameat2+0xe/0x10 [] SyS_rename+0x1e/0x20 [] entry_SYSCALL_64_fastpath+0x12/0x6f ---[ end trace 2b31e17995404e42 ]--- This is because: in the same inline directory, when we renaming one file from source name to target name which is not existed, once space of inline dentry is not enough, inline conversion will be triggered, after that all data in inline dentry will be moved to normal dentry page. After attaching the new entry in coverted dentry page, still we try to remove old entry in original inline dentry, since old entry has been moved, so it obviously doesn't make any effect, result in remaining old entry in converted dentry page. Now, we have two valid dentries pointed to the same inode which has nlink value of 1, deleting them both, above warning appears. This issue can be reproduced easily as below steps: 1. mount f2fs with inline_dentry option 2. mkdir dir 3. touch 180 files named [001-180] in dir 4. rename dir/180 dir/181 5. rm dir/180 dir/181 Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: reuse read_inline_data for f2fs_convert_inline_page f2fs_convert_inline_page introduce what read_inline_data already does for copying out the inline data from inode_page. We can use read_inline_data instead to simplify the code. Signed-off-by: Shawn Lin Signed-off-by: Jaegeuk Kim f2fs: remain last victim segment number ascending order This patch avoids to remain inefficient victim segment number selected by a victim. For example, if all the dirty segments has same valid blocks, we can get the victim segments descending order due to keeping wrong last segment number. Signed-off-by: Jaegeuk Kim f2fs: fix the wrong stat count of calling gc With a partition which was formated as multi segments in one section, we stated incorrectly for count of gc operation. e.g., for a partition with segs_per_sec = 4 cat /sys/kernel/debug/f2fs/status GC calls: 208 (BG: 7) - data segments : 104 (52) - node segments : 104 (24) GC called count should be (104 (data segs) + 104 (node segs)) / 4 = 52, rather than 208. Fix it. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: show more info about superblock recovery This patch changes to show more info in message log about the recovery of the corrupted superblock during ->mount, e.g. the index of corrupted superblock and the result of recovery. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: try to flush inode after merging inline data When flushing node pages, if current node page is an inline inode page, we will try to merge inline data from data page into inline inode page, then skip flushing current node page, it will decrease the number of nodes to be flushed in batch in this round, which may lead to worse performance. This patch gives a chance to flush just merged inline inode pages for performance. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: trace old block address for CoWed page This patch enables to trace old block address of CoWed page for better debugging. f2fs_submit_page_mbio: dev = (1,0), ino = 1, page_index = 0x1d4f0, oldaddr = 0xfe8ab, newaddr = 0xfee90 rw = WRITE_SYNC, type = NODE f2fs_submit_page_mbio: dev = (1,0), ino = 1, page_index = 0x1d4f8, oldaddr = 0xfe8b0, newaddr = 0xfee91 rw = WRITE_SYNC, type = NODE f2fs_submit_page_mbio: dev = (1,0), ino = 1, page_index = 0x1d4fa, oldaddr = 0xfe8ae, newaddr = 0xfee92 rw = WRITE_SYNC, type = NODE f2fs_submit_page_mbio: dev = (1,0), ino = 134824, page_index = 0x96, oldaddr = 0xf049b, newaddr = 0x2bbe rw = WRITE, type = DATA f2fs_submit_page_mbio: dev = (1,0), ino = 134824, page_index = 0x97, oldaddr = 0xf049c, newaddr = 0x2bbf rw = WRITE, type = DATA f2fs_submit_page_mbio: dev = (1,0), ino = 134824, page_index = 0x98, oldaddr = 0xf049d, newaddr = 0x2bc0 rw = WRITE, type = DATA f2fs_submit_page_mbio: dev = (1,0), ino = 135260, page_index = 0x47, oldaddr = 0xffffffff, newaddr = 0xf2631 rw = WRITE, type = DATA f2fs_submit_page_mbio: dev = (1,0), ino = 135260, page_index = 0x48, oldaddr = 0xffffffff, newaddr = 0xf2632 rw = WRITE, type = DATA f2fs_submit_page_mbio: dev = (1,0), ino = 135260, page_index = 0x49, oldaddr = 0xffffffff, newaddr = 0xf2633 rw = WRITE, type = DATA Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: avoid hungtask problem caused by losing wake_up The D state of wait_on_all_pages_writeback should be waken by function f2fs_write_end_io when all writeback pages have been succesfully written to device. It's possible that wake_up comes between get_pages and io_schedule. Maybe in this case it will lost wake_up and still in D state even if all pages have been write back to device, and finally, the whole system will be into the hungtask state. if (!get_pages(sbi, F2FS_WRITEBACK)) break; <--------- wake_up io_schedule(); Signed-off-by: Yunlei He Signed-off-by: Biao He Signed-off-by: Jaegeuk Kim f2fs: fix incorrect upper bound when iterating inode mapping tree 1. Inode mapping tree can index page in range of [0, ULONG_MAX], however, in some places, f2fs only search or iterate page in ragne of [0, LONG_MAX], result in miss hitting in page cache. 2. filemap_fdatawait_range accepts range parameters in unit of bytes, so the max range it covers should be [0, LLONG_MAX], if we use [0, LONG_MAX] as range for waiting on writeback, big number of pages will not be covered. This patch corrects above two issues. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs crypto: fix incorrect positioning for GCing encrypted data page For now, flow of GCing an encrypted data page: 1) try to grab meta page in meta inode's mapping with index of old block address of that data page 2) load data of ciphertext into meta page 3) allocate new block address 4) write the meta page into new block address 5) update block address pointer in direct node page. Other reader/writer will use f2fs_wait_on_encrypted_page_writeback to check and wait on GCed encrypted data cached in meta page writebacked in order to avoid inconsistence among data page cache, meta page cache and data on-disk when updating. However, we will use new block address updated in step 5) as an index to lookup meta page in inner bio buffer. That would be wrong, and we will never find the GCing meta page, since we use the old block address as index of that page in step 1). This patch fixes the issue by adjust the order of step 1) and step 3), and in step 1) grab page with index generated in step 3). Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/gc.c f2fs: introduce f2fs_update_data_blkaddr for cleanup Add a new help f2fs_update_data_blkaddr to clean up redundant codes. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: introduce f2fs_flush_merged_bios for cleanup Add a new helper f2fs_flush_merged_bios to clean up redundant codes. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix to avoid deadlock when merging inline data When testing with fsstress, kworker and user threads were both blocked: INFO: task kworker/u16:1:16580 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/u16:1 D ffff8803f2595390 0 16580 2 0x00000000 Workqueue: writeback bdi_writeback_workfn (flush-251:0) ffff8802730e5760 0000000000000046 ffff880274729fc0 0000000000012440 ffff8802730e5fd8 ffff8802730e4010 0000000000012440 0000000000012440 ffff8802730e5fd8 0000000000012440 ffff880274729fc0 ffff88026eb50000 Call Trace: [] schedule+0x29/0x70 [] rwsem_down_read_failed+0xa5/0xf9 [] call_rwsem_down_read_failed+0x14/0x30 [] f2fs_write_data_page+0x31b/0x420 [f2fs] [] __f2fs_writepage+0x1a/0x50 [f2fs] [] f2fs_write_data_pages+0xe0/0x290 [f2fs] [] do_writepages+0x23/0x40 [] __writeback_single_inode+0x4e/0x250 [] writeback_sb_inodes+0x2c1/0x470 [] __writeback_inodes_wb+0x9e/0xd0 [] wb_writeback+0x1fb/0x2d0 [] wb_do_writeback+0x9c/0x220 [] bdi_writeback_workfn+0x72/0x1c0 [] process_one_work+0x1de/0x5b0 [] worker_thread+0x11f/0x3e0 [] kthread+0xde/0xf0 [] ret_from_fork+0x58/0x90 fsstress thread stack: [] sleep_on_page+0xe/0x20 [] __lock_page+0x67/0x70 [] find_lock_page+0x50/0x80 [] find_or_create_page+0x3f/0xb0 [] sync_node_pages+0x259/0x810 [f2fs] [] write_checkpoint+0x1a4/0xce0 [f2fs] [] f2fs_sync_fs+0x7c/0xd0 [f2fs] [] f2fs_sync_file+0x143/0x5f0 [f2fs] [] vfs_fsync_range+0x2b/0x40 [] vfs_fsync+0x1c/0x20 [] do_fsync+0x41/0x70 [] SyS_fdatasync+0x13/0x20 [] system_call_fastpath+0x16/0x1b [] 0xffffffffffffffff The reason of this issue is: CPU0: CPU1: - f2fs_write_data_pages - f2fs_sync_fs - write_checkpoint - block_operations - f2fs_lock_all - down_write(sbi->cp_rwsem) - lock_page(page) - f2fs_write_data_page - sync_node_pages - flush_inline_data - pagecache_get_page(page, GFP_LOCK) - f2fs_lock_op - down_read(sbi->cp_rwsem) This patch alters to use trylock_page in flush_inline_data to fix this ABBA deadlock issue. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: recovery missing dot dentries in root directory If f2fs was corrupted with missing dot dentries in root dirctory, it needs to recover them after fsck.f2fs set F2FS_INLINE_DOTS flag in directory inode when fsck.f2fs detects missing dot dentries. Signed-off-by: Xue Liu Signed-off-by: Yong Sheng Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: mutex can't be used by down_write_nest_lock() f2fs_lock_all() calls down_write_nest_lock() to acquire a rw_sem and check a mutex, but down_write_nest_lock() is designed for two rw_sem accoring to the comment in include/linux/rwsem.h. And, other than f2fs, it is just called in mm/mmap.c with two rwsem. So, it looks it is used wrongly by f2fs. And, it causes the below compile warning on -rt kernel too. In file included from fs/f2fs/xattr.c:25:0: fs/f2fs/f2fs.h: In function 'f2fs_lock_all': fs/f2fs/f2fs.h:962:34: warning: passing argument 2 of 'down_write_nest_lock' from incompatible pointer type [-Wincompatible-pointer-types] f2fs_down_write(&sbi->cp_rwsem, &sbi->cp_mutex); ^ fs/f2fs/f2fs.h:27:55: note: in definition of macro 'f2fs_down_write' #define f2fs_down_write(x, y) down_write_nest_lock(x, y) ^ In file included from include/linux/rwsem.h:22:0, from fs/f2fs/xattr.c:21: include/linux/rwsem_rt.h:138:20: note: expected 'struct rw_semaphore *' but argument is of type 'struct mutex *' static inline void down_write_nest_lock(struct rw_semaphore *sem, Signed-off-by: Yang Shi Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim fs crypto: move per-file encryption from f2fs tree to fs/crypto This patch adds the renamed functions moved from the f2fs crypto files. [Backporting to 3.10] - Removed d_is_negative() in fscrypt_d_revalidate(). 1. definitions for per-file encryption used by ext4 and f2fs. 2. crypto.c for encrypt/decrypt functions a. IO preparation: - fscrypt_get_ctx / fscrypt_release_ctx b. before IOs: - fscrypt_encrypt_page - fscrypt_decrypt_page - fscrypt_zeroout_range c. after IOs: - fscrypt_decrypt_bio_pages - fscrypt_pullback_bio_page - fscrypt_restore_control_page 3. policy.c supporting context management. a. For ioctls: - fscrypt_process_policy - fscrypt_get_policy b. For context permission - fscrypt_has_permitted_context - fscrypt_inherit_context 4. keyinfo.c to handle permissions - fscrypt_get_encryption_info - fscrypt_free_encryption_info 5. fname.c to support filename encryption a. general wrapper functions - fscrypt_fname_disk_to_usr - fscrypt_fname_usr_to_disk - fscrypt_setup_filename - fscrypt_free_filename b. specific filename handling functions - fscrypt_fname_alloc_buffer - fscrypt_fname_free_buffer 6. Makefile and Kconfig Cc: Al Viro Signed-off-by: Michael Halcrow Signed-off-by: Ildar Muslukhov Signed-off-by: Uday Savagaonkar Signed-off-by: Theodore Ts'o Signed-off-by: Arnd Bergmann Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/f2fs.h fs/f2fs/inode.c fs/f2fs/super.c include/linux/fs.h f2fs: define not-set fallocate flags This fixes building bugs in aosp. Signed-off-by: Jaegeuk Kim f2fs crypto: sync ext4_lookup and ext4_file_open This patch tries to catch up with lookup and open policies in ext4. Signed-off-by: Jaegeuk Kim f2fs: modify the readahead method in ra_node_page() ra_node_page() is used to read ahead one node page. Comparing to regular read, it's faster because it doesn't wait for IO completion. But if it is called twice for reading the same block, and the IO request from the first call hasn't been completed before the second call, the second call will have to wait until the read is over. Here use the code in __do_page_cache_readahead() to solve this problem. It does nothing when someone else already puts the page in mapping. The status of page should be assured by whoever puts it there. This implement also prevents alteration of page reference count. Signed-off-by: Fan li Signed-off-by: Jaegeuk Kim f2fs: use cryptoapi crc32 functions The crc function is done bit by bit. Optimize this by use cryptoapi crc32 function which is backed by h/w acceleration. Signed-off-by: Keith Mok Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/f2fs.h f2fs: declare static functions Just to avoid sparse warnings. Signed-off-by: Jaegeuk Kim f2fs: clean up opened code with f2fs_update_dentry Just clean up opened code with existing function, no logic change. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix to avoid unneeded unlock_new_inode During ->lookup, I_NEW state of inode was been cleared in f2fs_iget, so in error path, we don't need to clear it again. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: add missing argument to f2fs_setxattr stub The f2fs_setxattr() prototype for CONFIG_F2FS_FS_XATTR=n has been wrong for a long time, since 8ae8f1627f39 ("f2fs: support xattr security labels"), but there have never been any callers, so it did not matter. Now, the function gets called from f2fs_ioc_keyctl(), which causes a build failure: fs/f2fs/file.c: In function 'f2fs_ioc_keyctl': include/linux/stddef.h:7:14: error: passing argument 6 of 'f2fs_setxattr' makes integer from pointer without a cast [-Werror=int-conversion] #define NULL ((void *)0) ^ fs/f2fs/file.c:1599:27: note: in expansion of macro 'NULL' value, F2FS_KEY_SIZE, NULL, type); ^ In file included from ../fs/f2fs/file.c:29:0: fs/f2fs/xattr.h:129:19: note: expected 'int' but argument is of type 'void *' static inline int f2fs_setxattr(struct inode *inode, int index, ^ fs/f2fs/file.c:1597:9: error: too many arguments to function 'f2fs_setxattr' return f2fs_setxattr(inode, F2FS_XATTR_INDEX_KEY, ^ In file included from ../fs/f2fs/file.c:29:0: fs/f2fs/xattr.h:129:19: note: declared here static inline int f2fs_setxattr(struct inode *inode, int index, Thsi changes the prototype of the empty stub function to match that of the actual implementation. This will not make the key management work when F2FS_FS_XATTR is disabled, but it gets it to build at least. Signed-off-by: Arnd Bergmann Signed-off-by: Jaegeuk Kim f2fs: submit node page write bios when really required If many threads calls fsync with data writes, we don't need to flush every bios having node page writes. The f2fs_wait_on_page_writeback will flush its bios when the page is really needed. Signed-off-by: Jaegeuk Kim f2fs/crypto: fix xts_tweak initialization Commit 0b81d07790726 ("fs crypto: move per-file encryption from f2fs tree to fs/crypto") moved the f2fs crypto files to fs/crypto/ and renamed the symbol prefixes from "f2fs_" to "fscrypt_" (and from "F2FS_" to just "FS" for preprocessor symbols). Because of the symbol renaming, it's a bit hard to see it as a file move: use git show -M30 0b81d07790726 to lower the rename detection to just 30% similarity and make git show the files as renamed (the header file won't be shown as a rename even then - since all it contains is symbol definitions, it looks almost completely different). Even with the renames showing as renames, the diffs are not all that easy to read, since so much is just the renames. But Eric Biggers noticed that it's not just all renames: the initialization of the xts_tweak had been broken too, using the inode number rather than the page offset. That's not right - it makes the xfs_tweak the same for all pages of each inode. It _might_ make sense to make the xfs_tweak contain both the offset _and_ the inode number, but not just the inode number. Reported-by: Eric Biggers Cc: Jaegeuk Kim Signed-off-by: Linus Torvalds f2fs: cover large section in sanity check of super This patch fixes the bug which does not cover a large section case when checking the sanity of superblock. If f2fs detects misalignment, it will fix the superblock during the mount time, so it doesn't need to trigger fsck.f2fs further. Reported-by: Matthias Prager Reported-by: David Gnedt Cc: stable 4.5+ Signed-off-by: Jaegeuk Kim f2fs crypto: fix corrupted symlink in encrypted case In the encrypted symlink case, we should check its corrupted symname after decrypting it. Otherwise, we can report -ENOENT incorrectly, if encrypted symname starts with '\0'. Cc: stable 4.5+ Signed-off-by: Jaegeuk Kim f2fs: retrieve IO write stat from the right place In the following patch, f2fs: split journal cache from curseg cache journal cache is split from curseg cache. So IO write statistics should be retrived from journal cache but not curseg->sum_blk. Otherwise, it will get 0, and the stat is lost. Signed-off-by: Shuoran Liu Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim fscrypto: use dget_parent() in fscrypt_d_revalidate() This patch updates fscrypto along with the below ext4 crypto change. Fixes: 3d43bcfef5f0 ("ext4 crypto: use dget_parent() in ext4_d_revalidate()") Cc: Theodore Ts'o Signed-off-by: Jaegeuk Kim Conflicts: fs/crypto/crypto.c f2fs: use dget_parent and file_dentry in f2fs_file_open This patch synced with the below two ext4 crypto fixes together. In 4.6-rc1, f2fs newly introduced accessing f_path.dentry which crashes overlayfs. To fix, now we need to use file_dentry() to access that field. [Backport NOTE] - Over 4.2, it should use file_dentry Fixes: c0a37d487884 ("ext4: use file_dentry()") Fixes: 9dd78d8c9a7b ("ext4: use dget_parent() in ext4_file_open()") Cc: Miklos Szeredi Cc: Theodore Ts'o Signed-off-by: Jaegeuk Kim fscrypto: don't let data integrity writebacks fail with ENOMEM This patch fixes the issue introduced by the ext4 crypto fix in a same manner. For F2FS, however, we flush the pending IOs and wait for a while to acquire free memory. Fixes: c9af28fdd4492 ("ext4 crypto: don't let data integrity writebacks fail with ENOMEM") Cc: Theodore Ts'o Signed-off-by: Jaegeuk Kim Conflicts: fs/crypto/crypto.c ext4/fscrypto: avoid RCU lookup in d_revalidate As Al pointed, d_revalidate should return RCU lookup before using d_inode. This was originally introduced by: commit 34286d666230 ("fs: rcu-walk aware d_revalidate method"). Reported-by: Al Viro Signed-off-by: Jaegeuk Kim Cc: Theodore Ts'o Cc: stable Conflicts: fs/ext4/crypto.c Revert "f2fs: use cryptoapi crc32 functions" This reverts commit fe125885875e39a693c9152d5eb5fe1d40ed34ea. f2fs: give RO message when recovering superblock When one of superblocks is missing, f2fs recovers it with the valid one. But, even if f2fs is mounted as RO, we'd better notify that too. Signed-off-by: Jaegeuk Kim f2fs: recover superblock at RW remounts This patch adds a sbi flag, SBI_NEED_SB_WRITE, which indicates it needs to recover superblock when (re)mounting as RW. This is set only when f2fs is mounted as RO. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/super.c f2fs: give -EINVAL for norecovery and rw mount Once detecting something to recover, f2fs should stop mounting, given norecovery and rw mount options. Signed-off-by: Jaegeuk Kim f2fs: treat as a normal umount when remounting ro When user remounts f2fs as read-only, we can mark the checkpoint as umount. Signed-off-by: Jaegeuk Kim f2fs: show current mount status This patch remains the current mount status to f2fs status info. Signed-off-by: Jaegeuk Kim f2fs: fix to convert inline directory correctly With below serials, we will lose parts of dirents: 1) mount f2fs with inline_dentry option 2) echo 1 > /sys/fs/f2fs/sdX/dir_level 3) mkdir dir 4) touch 180 files named [1-180] in dir 5) touch 181 in dir 6) echo 3 > /proc/sys/vm/drop_caches 7) ll dir ls: cannot access 2: No such file or directory ls: cannot access 4: No such file or directory ls: cannot access 5: No such file or directory ls: cannot access 6: No such file or directory ls: cannot access 8: No such file or directory ls: cannot access 9: No such file or directory ... total 360 drwxr-xr-x 2 root root 4096 Feb 19 15:12 ./ drwxr-xr-x 3 root root 4096 Feb 19 15:11 ../ -rw-r--r-- 1 root root 0 Feb 19 15:12 1 -rw-r--r-- 1 root root 0 Feb 19 15:12 10 -rw-r--r-- 1 root root 0 Feb 19 15:12 100 -????????? ? ? ? ? ? 101 -????????? ? ? ? ? ? 102 -????????? ? ? ? ? ? 103 ... The reason is: when doing the inline dir conversion, we didn't consider that directory has hierarchical hash structure which can be configured through sysfs interface 'dir_level'. By default, dir_level of directory inode is 0, it means we have one bucket in hash table located in first level, all dirents will be hashed in this bucket, so it has no problem for us to do the duplication simply between inline dentry page and converted normal dentry page. However, if we configured dir_level with the value N (greater than 0), it will expand the bucket number of first level hash table by 2^N - 1, it hashs dirents into different buckets according their hash value, if we still move all dirents to first bucket, it makes incorrent locating for inline dirents, the result is, although we can iterate all dirents through ->readdir, we can't stat some of them in ->lookup which based on hash table searching. This patch fixes this issue by rehashing dirents into correct position when converting inline directory. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/dir.c fs/f2fs/f2fs.h f2fs: add BUG_ON to avoid unnecessary flow This patch adds BUG_ON instead of retrying loop. In the case of node pages, we already got this inode page, but unlocked it. By the fact that we don't truncate any node pages in operations, the page's mapping should be unchangeable. Signed-off-by: Jaegeuk Kim f2fs: fix dropping inmemory pages in a wrong time When one reader closes its file while the other writer is doing atomic writes, f2fs_release_file drops atomic data resulting in an empty commit. This patch fixes this wrong commit problem by checking openess of the file. Process0 Process1 open file start atomic write write data read data close file f2fs_release_file() clear atomic data commit atomic write Reported-by: Miao Xie Signed-off-by: Jaegeuk Kim f2fs: unset atomic/volatile flag in f2fs_release_file The atomic/volatile operation should be done in pair of start and commit ioctl. For example, if a killed process remains open-ended atomic operation, we should drop its flag as well as its atomic data. Otherwise, if sqlite initiates another operation which doesn't require atomic writes, it will lose every data, since f2fs still treats with them as atomic writes; nobody will trigger its commit. Reported-by: Miao Xie Signed-off-by: Jaegeuk Kim f2fs: remove redundant condition check This patch resolves the redundant condition check reported by David. Reported-by: David Binderman Signed-off-by: Jaegeuk Kim f2fs: give -E2BIG for no space in xattr This patch returns -E2BIG if there is no space to add an xattr entry. This should fix generic/026 in xfstests as well. Signed-off-by: Jaegeuk Kim f2fs: don't invalidate atomic page if successful If we committed atomic write successfully, we don't need to invalidate pages. Signed-off-by: Jaegeuk Kim f2fs: flush dirty pages before starting atomic writes If somebody wrote some data before atomic writes, we should flush them in order to handle atomic data in a right period. Signed-off-by: Jaegeuk Kim f2fs: avoid needless lock for node pages when fsyncing a file When fsync is called, sync_node_pages finds a proper direct node pages to flush. But, it locks unrelated direct node pages together unnecessarily. Acked-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: avoid writing 0'th page in volatile writes The first page of volatile writes usually contains a sort of header information which will be used for recovery. (e.g., journal header of sqlite) If this is written without other journal data, user needs to handle the stale journal information. Acked-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: split sync_node_pages with fsync_node_pages This patch splits the existing sync_node_pages into (f)sync_node_pages. The fsync_node_pages is used for f2fs_sync_file only. Acked-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: report unwritten status in fsync_node_pages The fsync_node_pages should return pass or failure so that user could know fsync is completed or not. Acked-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: set fsync mark only for the last dnode In order to give atomic writes, we should consider power failure during sync_node_pages in fsync. So, this patch marks fsync flag only in the last dnode block. Acked-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: issue cache flush on direct IO Under direct IO path with O_(D)SYNC, it needs to set proper APPEND or UPDATE flags, so taht f2fs_sync_file can make its data safe. Acked-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/data.c f2fs: be aware of invalid filename length The filename length in dirent of may become zero-sized after random junk data injection, once encounter such dirent, find_target_dentry or f2fs_add_inline_entries will run into an infinite loop. So let f2fs being aware of that to avoid deadloop. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: move node pages only in victim section during GC For foreground GC, we cache node blocks in victim section and set them dirty, then we call sync_node_pages to flush these node pages, but meanwhile, those node pages which does not locate in victim section will be flushed together, so more bandwidth and continuous free space would be occupied. So for this condition, it's better to leave those unrelated node page in cache for further write hit, and let CP or VM to flush them afterward. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix to return 0 if err == -ENOENT in f2fs_readdir Commit 57b62d29ad5b384775974973087d47755a8c6fcc ("f2fs: fix to report error in f2fs_readdir") causes f2fs_readdir to return -ENOENT when get_lock_data_page returns -ENOENT. However, the original logic is to continue when get_lock_data_page returns -ENOENT, but it forgets to reset err to 0. This will cause getdents64 incorretly return -ENOENT when lastdirent is NULL in getdents64. This will lead to a wrong return value for syscall caller. Signed-off-by: Yunlong Song Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix to clear private data in page Private data in page should be removed during ->releasepage or ->invalidatepage, otherwise garbage data would be remained in that page. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix to clear page private flag Commit 28bc106b2346 ("f2fs: support revoking atomic written pages") forgot to clear page private flag correctly, fix it. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: factor out fsync inode entry operations Factor out fsync inode entry operations into {add,del}_fsync_inode. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: introduce macros for proc entries This adds macros to be used multiple proc entries. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/super.c f2fs: add proc entry to show valid block bitmap This patch adds a new proc entry to show segment information in more detail. Signed-off-by: Jaegeuk Kim f2fs: introduce f2fs_kmalloc to wrap kmalloc This patch adds f2fs_kmalloc. Signed-off-by: Jaegeuk Kim f2fs: use f2fs_grab_cache_page instead of grab_cache_page This patch converts grab_cache_page to f2fs_grab_cache_page. Signed-off-by: Jaegeuk Kim f2fs: add mount option to select fault injection ratio This patch adds a mount option to select fault ratio. Signed-off-by: Jaegeuk Kim f2fs: inject kmalloc failure This patch injects kmalloc failure given a fault injection rate. Signed-off-by: Jaegeuk Kim f2fs: inject page allocation failures This patch adds page allocation failures. Signed-off-by: Jaegeuk Kim f2fs: inject ENOSPC failures This patch injects ENOSPC failures. Signed-off-by: Jaegeuk Kim f2fs: revisit error handling flows This patch fixes a couple of bugs regarding to orphan inodes when handling errors. This tries to - call alloc_nid_done with add_orphan_inode in handle_failed_inode - let truncate blocks in f2fs_evict_inode - not make a bad inode due to i_mode change Signed-off-by: Jaegeuk Kim f2fs: fix leak of orphan inode objects When unmounting filesystem, we should release all the ino entries. Signed-off-by: Jaegeuk Kim f2fs: retry to truncate blocks in -ENOMEM case This patch modifies to retry truncating node blocks in -ENOMEM case. Signed-off-by: Hou Pengyang Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/inode.c f2fs: remove unneeded readahead in find_fsync_dnodes In find_fsync_dnodes, get_tmp_page will read dnode page synchronously, previously, ra_meta_page did the same work, which is redundant, remove it. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: remove unneeded memset when updating xattr Each of fields in struct f2fs_xattr_entry will be assigned later, so previously we don't need to memset the struct. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: reuse get_extent_info Reuse get_extent_info for readability. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: shrink size of struct seg_entry Restructure struct seg_entry to eliminate holes in it, after that, in 32-bits machine, it reduces size from 32 bytes to 24 bytes; in 64-bits machine, it reduces size from 56 bytes to 40 bytes. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: don't worry about inode leak in evict_inode Even if an inode failed to release its blocks, it should be kept in an orphan inode list, so it will be released later. Signed-off-by: Jaegeuk Kim f2fs: remove an obsolete variable This patch removes an obsolete variable used in add_free_nid. Signed-off-by: Jaegeuk Kim f2fs: fix incorrect mapping in ->bmap Currently, generic_block_bmap is used in f2fs_bmap, its semantics is when the mapping is been found, return position of target physical block, otherwise return zero. But, previously, when there is no mapping info for specified logical block, f2fs_bmap will map target physical block to a uninitialized variable, which should be wrong. Fix it. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: avoid panic when truncating to max filesize The following panic occurs when truncating inode which has inline xattr to max filesize. [] get_dnode_of_data+0x4e/0x580 [f2fs] [] ? read_node_page+0x51/0x90 [f2fs] [] ? get_node_page.part.34+0xb9/0x170 [f2fs] [] truncate_blocks+0x131/0x3f0 [f2fs] [] f2fs_truncate+0x73/0x100 [f2fs] [] f2fs_setattr+0x62/0x2a0 [f2fs] [] notify_change+0x158/0x300 [] do_truncate+0x6b/0xa0 [] ? __sb_start_write+0x49/0x100 [] do_sys_ftruncate.constprop.12+0x118/0x170 [] SyS_ftruncate+0xe/0x10 [] tracesys+0xe1/0xe6 [] get_node_path+0x210/0x220 [f2fs] --[ end trace 5fea664dfbcc6625 ]--- The reason is truncate_blocks tries to truncate all node and data blocks start from specified block offset with value of (max filesize / block size), but actually, our valid max block offset is (max filesize / block size) - 1, so f2fs detects such invalid block offset with BUG_ON in truncation path. This patch lets f2fs skip truncating data which is exceeding max filesize. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim fscrypto/f2fs: allow fs-specific key prefix for fs encryption This patch allows fscrypto to handle a second key prefix given by filesystem. The main reason is to provide backward compatibility, since previously f2fs used "f2fs:" as a crypto prefix instead of "fscrypt:". Later, ext4 should also provide key_prefix() to give "ext4:". One concern decribed by Ted would be kinda double check overhead of prefixes. In x86, for example, validate_user_key consumes 8 ms after boot-up, which turns out derive_key_aes() consumed most of the time to load specific crypto module. After such the cold miss, it shows almost zero latencies, which treats as a negligible overhead. Note that request_key() detects wrong prefix in prior to derive_key_aes() even. Cc: Ted Tso Cc: stable@vger.kernel.org # v4.6 Signed-off-by: Jaegeuk Kim Conflicts: fs/crypto/keyinfo.c f2fs: fix inode cache leak When testing f2fs with inline_dentry option, generic/342 reports: VFS: Busy inodes after unmount of dm-0. Self-destruct in 5 seconds. Have a nice day... After rmmod f2fs module, kenrel shows following dmesg: ============================================================================= BUG f2fs_inode_cache (Tainted: G O ): Objects remaining in f2fs_inode_cache on __kmem_cache_shutdown() ----------------------------------------------------------------------------- Disabling lock debugging due to kernel taint INFO: Slab 0xf51ca0e0 objects=22 used=1 fp=0xd1e6fc60 flags=0x40004080 CPU: 3 PID: 7455 Comm: rmmod Tainted: G B O 4.6.0-rc4+ #16 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 00000086 00000086 d062fe18 c13a83a0 f51ca0e0 d062fe38 d062fea4 c11c7276 c1981040 f51ca0e0 00000016 00000001 d1e6fc60 40004080 656a624f 20737463 616d6572 6e696e69 6e692067 66326620 6e695f73 5f65646f 68636163 6e6f2065 Call Trace: [] dump_stack+0x5f/0x8f [] slab_err+0x76/0x80 [] ? __kmem_cache_shutdown+0x100/0x2f0 [] ? __kmem_cache_shutdown+0x100/0x2f0 [] __kmem_cache_shutdown+0x125/0x2f0 [] kmem_cache_destroy+0x158/0x1f0 [] ? mutex_unlock+0xd/0x10 [] exit_f2fs_fs+0x4b/0x5a8 [f2fs] [] SyS_delete_module+0x16c/0x1d0 [] ? do_fast_syscall_32+0x30/0x1c0 [] ? __this_cpu_preempt_check+0xf/0x20 [] ? trace_hardirqs_on_caller+0xdd/0x210 [] ? trace_hardirqs_off+0xb/0x10 [] do_fast_syscall_32+0xa1/0x1c0 [] sysenter_past_esp+0x45/0x74 INFO: Object 0xd1e6d9e0 @offset=6624 kmem_cache_destroy f2fs_inode_cache: Slab cache still has objects CPU: 3 PID: 7455 Comm: rmmod Tainted: G B O 4.6.0-rc4+ #16 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 00000286 00000286 d062fef4 c13a83a0 f174b000 d062ff14 d062ff28 c1198ac7 c197fe18 f3c5b980 d062ff20 000d04f2 d062ff0c d062ff0c d062ff14 d062ff14 f8f20dc0 fffffff5 d062e000 d062ff30 f8f15aa3 d062ff7c c10f596c 73663266 Call Trace: [] dump_stack+0x5f/0x8f [] kmem_cache_destroy+0x1e7/0x1f0 [] exit_f2fs_fs+0x4b/0x5a8 [f2fs] [] SyS_delete_module+0x16c/0x1d0 [] ? do_fast_syscall_32+0x30/0x1c0 [] ? __this_cpu_preempt_check+0xf/0x20 [] ? trace_hardirqs_on_caller+0xdd/0x210 [] ? trace_hardirqs_off+0xb/0x10 [] do_fast_syscall_32+0xa1/0x1c0 [] sysenter_past_esp+0x45/0x74 The reason is: in recovery flow, we use delayed iput mechanism for directory which has recovered dentry block. It means the reference of inode will be held until last dirty dentry page being writebacked. But when we mount f2fs with inline_dentry option, during recovery, dirent may only be recovered into dir inode page rather than dentry page, so there are no chance for us to release inode reference in ->writepage when writebacking last dentry page. We can call paired iget/iput explicityly for inline_dentry case, but for non-inline_dentry case, iput will call writeback_single_inode to write all data pages synchronously, but during recovery, ->writepages of f2fs skips writing all pages, result in losing dirent. This patch fixes this issue by obsoleting old mechanism, and introduce a new dir_list to hold all directory inodes which has recovered datas until finishing recovery. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fallocate data blocks in single locked node page This patch is to improve the expand_inode speed in fallocate by allocating data blocks as many as possible in single locked node page. In SSD, # time fallocate -l 500G $MNT/testfile Before : 1m 33.410 s After : 24.758 s Reported-by: Stephen Bates Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/file.c f2fs: read node blocks ahead when truncating blocks This patch enables reading node blocks in advance when truncating large data blocks. > time rm $MNT/testfile (500GB) after drop_cachees Before : 9.422 s After : 4.821 s Reported-by: Stephen Bates Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/node.c f2fs: do not preallocate block unaligned to 4KB Previously f2fs_preallocate_blocks() tries to allocate unaligned blocks. In f2fs_write_begin(), however, prepare_write_begin() does not skip its allocation due to (len != 4KB). So, it needs locking node page twice unexpectedly. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/data.c f2fs: use mnt_{want,drop}_write_file in ioctl In interfaces of ioctl, mnt_{want,drop}_write_file should be used for: - get exclusion against file system freezing which may used by lvm snapshot. - do telling filesystem that a write is about to be performed on it, and make sure that the writes are permitted. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: make atomic/volatile operation exclusive atomic/volatile ioctl interfaces are exposed to user like other file operation interface, it needs to make them getting exclusion against to each other to avoid potential conflict among these operations in concurrent scenario. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: support in batch multi blocks preallocation This patch introduces reserve_new_blocks to make preallocation of multi blocks as in batch operation, so it can avoid lots of redundant operation, result in better performance. In virtual machine, with rotational device: time fallocate -l 32G /mnt/f2fs/file Before: real 0m4.584s user 0m0.000s sys 0m4.580s After: real 0m0.292s user 0m0.000s sys 0m0.272s In x86, with SSD: time fallocate -l 500G $MNT/testfile Before : 24.758 s After : 1.604 s Signed-off-by: Chao Yu [Jaegeuk Kim: fix bugs and add performance numbers measured in x86.] Signed-off-by: Jaegeuk Kim f2fs: support in batch fzero in dnode page This patch tries to speedup fzero_range by making space preallocation and address removal of blocks in one dnode page as in batch operation. In virtual machine, with zram driver: dd if=/dev/zero of=/mnt/f2fs/file bs=1M count=4096 time xfs_io -f /mnt/f2fs/file -c "fzero 0 4096M" Before: real 0m3.276s user 0m0.008s sys 0m3.260s After: real 0m1.568s user 0m0.000s sys 0m1.564s Signed-off-by: Chao Yu [Jaegeuk Kim: consider ENOSPC case] Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/file.c f2fs: show # of orphan inodes This adds debug information for # of orphan inodes. Signed-off-by: Jaegeuk Kim f2fs: avoid f2fs_bug_on during recovery We don't need to use f2fs_bug_on() to treat with any error case when allocating a block during recovery. Signed-off-by: Jaegeuk Kim f2fs: fix deadlock when flush inline data Below backtrace info was reported by Yunlei He: Call Trace: [] schedule+0x35/0x80 [] rwsem_down_read_failed+0xed/0x130 [] call_rwsem_down_read_failed+0x18/0x [] down_read+0x20/0x30 [] f2fs_evict_inode+0x242/0x3a0 [f2fs] [] evict+0xc7/0x1a0 [] iput+0x196/0x200 [] __dentry_kill+0x179/0x1e0 [] dput+0x199/0x1f0 [] __fput+0x18b/0x220 [] ____fput+0xe/0x10 [] task_work_run+0x77/0x90 [] exit_to_usermode_loop+0x73/0xa2 [] do_syscall_64+0xfa/0x110 [] entry_SYSCALL64_slow_path+0x25/0x25 Call Trace: [] schedule+0x35/0x80 [] __wait_on_freeing_inode+0xa3/0xd0 [] ? autoremove_wake_function+0x40/0x4 [] find_inode_fast+0x7d/0xb0 [] ilookup+0x6a/0xd0 [] sync_node_pages+0x210/0x650 [f2fs] [] ? do_fsync+0x70/0x70 [] block_operations+0x9e/0xf0 [f2fs] [] ? bio_endio+0x55/0x60 [] write_checkpoint+0x92/0xba0 [f2fs] [] ? mempool_free_slab+0x17/0x20 [] ? mempool_free+0x2b/0x80 [] ? do_fsync+0x70/0x70 [] f2fs_sync_fs+0x63/0xd0 [f2fs] [] ? ext4_sync_fs+0xbf/0x190 [] sync_fs_one_sb+0x20/0x30 [] iterate_supers+0xb9/0x110 [] sys_sync+0x55/0x90 [] do_syscall_64+0x69/0x110 [] entry_SYSCALL64_slow_path+0x25/0x25 With following excuting serials, we will set inline_node in inode page after inode was unlinked, result in a deadloop described as below: 1. open file 2. write file 3. unlink file 4. write file 5. close file Thread A Thread B - dput - iput_final - inode->i_state |= I_FREEING - evict - f2fs_evict_inode - f2fs_sync_fs - write_checkpoint - block_operations - f2fs_lock_all (down_write(cp_rwsem)) - f2fs_lock_op (down_read(cp_rwsem)) - sync_node_pages - ilookup - find_inode_fast - __wait_on_freeing_inode (wait on I_FREEING clear) Here, we change to set inline_node flag only for linked inode for fixing. Reported-by: Yunlei He Signed-off-by: Chao Yu Tested-by: Jaegeuk Kim Cc: stable@vger.kernel.org # v4.6 Signed-off-by: Jaegeuk Kim f2fs: correct return value type of f2fs_fill_super Signed-off-by: Sheng Yong Signed-off-by: Jaegeuk Kim f2fs: fix i_current_depth during inline dentry conversion With below steps, we will see that dentry page becoming unaccessable later. This is because we forget updating i_current_depth in inode during inline dentry conversion, after that, once we failed at somewhere, it will leave i_current_depth as 0 in non-inline directory. Then, during ->lookup, the current_depth value makes all dentry pages in first level invisible. Fix it. 1) mount f2fs with inline_dentry option 2) mkdir dir 3) touch 180 files named [0-179] in dir 4) touch 180 in dir (fail after inline dir conversion) 5) ll dir ls: cannot access /mnt/f2fs/dir/0: No such file or directory ls: cannot access /mnt/f2fs/dir/1: No such file or directory ls: cannot access /mnt/f2fs/dir/2: No such file or directory ls: cannot access /mnt/f2fs/dir/3: No such file or directory ls: cannot access /mnt/f2fs/dir/4: No such file or directory drwxr-xr-x 2 root root 4096 may 13 21:47 ./ drwxr-xr-x 3 root root 4096 may 13 21:46 ../ -????????? ? ? ? ? ? 0 -????????? ? ? ? ? ? 1 -????????? ? ? ? ? ? 10 -????????? ? ? ? ? ? 100 -????????? ? ? ? ? ? 101 -????????? ? ? ? ? ? 102 Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/inline.c f2fs: fix incorrect error path handling in f2fs_move_rehashed_dirents Fix two bugs in error path of f2fs_move_rehashed_dirents: - release dir's inode page if fail to call kmalloc - recover i_current_depth if fail to converting Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: no need inc dirty pages under inode lock No need inc dirty pages under inode lock Signed-off-by: Yunlei He Signed-off-by: Jaegeuk Kim f2fs: add fault injection to sysfs This patch introduces a new struct f2fs_fault_info and a global f2fs_fault to save fault injection status. Fault injection entries are created in /sys/fs/f2fs/fault_injection/ during initializing f2fs module. Signed-off-by: Sheng Yong Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/super.c f2fs: manipulate dirty file inodes when DATA_FLUSH is set It needs to maintain dirty file inodes only if DATA_FLUSH is set. Otherwise, let's avoid its overhead. Signed-off-by: Jaegeuk Kim f2fs: use bio count instead of F2FS_WRITEBACK page count This can reduce page counting overhead. Signed-off-by: Jaegeuk Kim f2fs: use percpu_counter for page counters This patch substitutes percpu_counter for atomic_counter when counting various types of pages. Signed-off-by: Jaegeuk Kim f2fs: use percpu_counter for # of dirty pages in inode This patch adds percpu_counter for # of dirty pages in inode. Signed-off-by: Jaegeuk Kim f2fs: use percpu_counter for alloc_valid_block_count This patch uses percpu_count for sbi->alloc_valid_block_count. Signed-off-by: Jaegeuk Kim f2fs: use percpu_counter for total_valid_inode_count This patch uses percpu_counter to avoid stat_lock. Signed-off-by: Jaegeuk Kim f2fs: make exit_f2fs_fs more clear init_f2fs_fs does: 1) f2fs_build_trace_ios 2) init_inodecache 3) create_node_manager_caches 4) create_segment_manager_caches 5) create_checkpoint_caches 6) create_extent_cache 7) kset_create_and_add 8) kobject_init_and_add 9) register_shrinker 10) register_filesystem 11) f2fs_create_root_stats 12) proc_mkdir exit_f2fs_fs should do cleanup in the reverse order to make the code more clear. Signed-off-by: Tiezhu Yang Signed-off-by: Jaegeuk Kim f2fs: avoid ENOSPC fault in the recovery process This patch avoids impossible error injection, ENOSPC, during recovery process. Signed-off-by: Jaegeuk Kim f2fs: flush pending bios right away when error occurs Given errors, this patch flushes pending bios as soon as possible. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/file.c f2fs: fix to update dirty page count correctly Once we failed to merge inline data into inode page during flushing inline inode, we will skip invoking inode_dec_dirty_pages, which makes dirty page count incorrect, result in panic in ->evict_inode, Fix it. ------------[ cut here ]------------ kernel BUG at /home/yuchao/git/devf2fs/inode.c:336! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 3 PID: 10004 Comm: umount Tainted: G O 4.6.0-rc5+ #17 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 task: f0c33000 ti: c5212000 task.ti: c5212000 EIP: 0060:[] EFLAGS: 00010202 CPU: 3 EIP is at f2fs_evict_inode+0x85/0x490 [f2fs] EAX: 00000001 EBX: c4529ea0 ECX: 00000001 EDX: 00000000 ESI: c0131000 EDI: f89dd0a0 EBP: c5213e9c ESP: c5213e78 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 CR0: 80050033 CR2: b75878c0 CR3: 1a36a700 CR4: 000406f0 Stack: c4529ea0 c4529ef4 c5213e8c c176d45c c4529ef4 00000000 c4529ea0 c4529fac f89dd0a0 c5213eb0 c1204a68 c5213ed8 c452a2b4 c6680930 c5213ec0 c1204b64 c6680d44 c6680620 c5213eec c120588d ee84b000 ee84b5c0 c5214000 ee84b5e0 Call Trace: [] ? _raw_spin_unlock+0x2c/0x50 [] evict+0xa8/0x170 [] dispose_list+0x34/0x50 [] evict_inodes+0x10d/0x130 [] generic_shutdown_super+0x41/0xe0 [] ? unregister_shrinker+0x40/0x50 [] ? unregister_shrinker+0x40/0x50 [] kill_block_super+0x22/0x70 [] kill_f2fs_super+0x1e/0x20 [f2fs] [] deactivate_locked_super+0x3d/0x70 [] deactivate_super+0x43/0x60 [] cleanup_mnt+0x39/0x80 [] __cleanup_mnt+0x10/0x20 [] task_work_run+0x71/0x90 [] exit_to_usermode_loop+0x72/0x9e [] do_fast_syscall_32+0x19c/0x1c0 [] sysenter_past_esp+0x45/0x74 EIP: [] f2fs_evict_inode+0x85/0x490 [f2fs] SS:ESP 0068:c5213e78 ---[ end trace d30536330b7fdc58 ]--- Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: adjust other changes This patch changes: - PAGE_CACHE_* - inode_lock Signed-off-by: Jaegeuk Kim Revert "f2fs: no need inc dirty pages under inode lock" This reverts commit b951a4ec165af4973b2bd9c80fb5845fbd840435. Conflicts: fs/f2fs/checkpoint.c f2fs: use inode pointer for {set, clear}_inode_flag This patch refactors to use inode pointer for set_inode_flag and clear_inode_flag. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/acl.c fs/f2fs/file.c fs/f2fs/namei.c Conflicts: fs/f2fs/inode.c f2fs: introduce f2fs_i_size_write with mark_inode_dirty_sync This patch introduces f2fs_i_size_write() to call mark_inode_dirty_sync() with i_size_write(). Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/super.c f2fs: introduce f2fs_i_blocks_write with mark_inode_dirty_sync This patch introduces f2fs_i_blocks_write() to call mark_inode_dirty_sync() when changing inode->i_blocks. Signed-off-by: Jaegeuk Kim f2fs: introduce f2fs_i_links_write with mark_inode_dirty_sync This patch introduces f2fs_i_links_write() to call mark_inode_dirty_sync() when changing inode->i_links. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/namei.c f2fs: call mark_inode_dirty_sync for i_field changes This patch calls mark_inode_dirty_sync() for the following on-disk inode changes. -> largest -> ctime/mtime/atime -> i_current_depth -> i_xattr_nid -> i_pino -> i_advise -> i_flags -> i_mode Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/acl.c fs/f2fs/inode.c fs/f2fs/namei.c Conflicts: fs/f2fs/inode.c f2fs: flush inode metadata when checkpoint is doing This patch registers all the inodes which have dirty metadata to sync when checkpoint is doing. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/inode.c f2fs: remove syncing inode page in all the cases This patch reduces to call them across the whole tree. - sync_inode_page() - update_inode_page() - update_inode() - f2fs_write_inode() Instead, checkpoint will flush all the dirty inode metadata before syncing node pages. Note that, this is doable, since we call mark_inode_dirty_sync() for all inode's field change which needs to update on-disk inode as well. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/namei.c f2fs: avoid unnecessary updating inode during fsync If roll-forward recovery can recover i_size, we don't need to update inode's metadata during fsync. Signed-off-by: Jaegeuk Kim f2fs: detect congestion of flush command issues If flush commands do not incur any congestion, we don't need to throw that to dispatching queue which causes unnecessary latency. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/segment.c f2fs: set flush_merge by default This patch sets flush_merge by default. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/super.c f2fs: remove writepages lock This patch removes writepages lock. We can improve multi-threading performance. tiobench, 32 threads, 4KB write per fsync on SSD Before: 25.88 MB/s After: 28.03 MB/s Signed-off-by: Jaegeuk Kim f2fs: propagate error given by f2fs_find_entry If we get ENOMEM or EIO in f2fs_find_entry, we should stop right away. Otherwise, for example, we can get duplicate directory entry by ->chash and ->clevel. Signed-off-by: Jaegeuk Kim f2fs: inject to produce some orphan inodes Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/inode.c f2fs: do not skip writing data pages For data pages, let's try to flush as much as possible in background. On /dev/pmem0, 1. dd if=/dev/zero of=/mnt/test/testfile bs=1M count=2048 conv=fsync Before : 800 MB/s After : 1.1 GB/s 2. dd if=/dev/zero of=/mnt/test/testfile bs=1M count=2048 Before : 1.3 GB/s After : 2.2 GB/s Signed-off-by: Jaegeuk Kim f2fs: remove two steps to flush dirty data pages If there is no cold page, we don't need to do a loop to flush dirty data pages. On /dev/pmem0, 1. dd if=/dev/zero of=/mnt/test/testfile bs=1M count=2048 conv=fsync Before : 1.1 GB/s After : 1.2 GB/s 2. dd if=/dev/zero of=/mnt/test/testfile bs=1M count=2048 Before : 2.2 GB/s After : 2.3 GB/s Signed-off-by: Jaegeuk Kim f2fs: return the errno to the caller to avoid using a wrong page Commit aaf9607516ed38825268515ef4d773289a44f429 ("f2fs: check node page contents all the time") pointed out that "sometimes it was reported that its contents was missing", so it checks the page's mapping and contents. When "nid != nid_of_node(page)", ERR_PTR(-EIO) will be returned to the caller. However, commit e1c51b9f1df2f9efc2ec11488717e40cd12015f9 ("f2fs: clean up node page updating flow") moves "nid != nid_of_node(page)" test to "f2fs_bug_on(sbi, nid != nid_of_node(page))", this will return a wrong page to the caller when F2FS_CHECK_FS is off when "sometimes it was reported that its contents was missing" happens. This patch restores to check node page contents all the time, and returns the errno to make the caller known something is wrong and avoid to use the page. This patch also moves f2fs_bug_on to its proper location. Signed-off-by: Yunlong Song Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/node.c f2fs: return error of f2fs_lookup Now we can report an error to f2fs_lookup given by f2fs_find_entry. Suggested-by: He YunLei Signed-off-by: Jaegeuk Kim f2fs: handle writepage correctly Previously, f2fs_write_data_pages() calls __f2fs_writepage() which calls f2fs_write_data_page(). If f2fs_write_data_page() returns AOP_WRITEPAGE_ACTIVATE, __f2fs_writepage() calls mapping_set_error(). But, this should not happen at every time, since sometimes f2fs_write_data_page() tries to skip writing pages without error. For example, volatile_write() gives EIO all the time, as Shuoran Liu pointed out. Reported-by: Shuoran Liu Signed-off-by: Jaegeuk Kim f2fs: remove deprecated parameter Remove deprecated paramter. Signed-off-by: Jaegeuk Kim f2fs: avoid wrong count on dirty inodes The number should be covered by spin_lock. Otherwise we can see wrong count in f2fs_stat. Signed-off-by: Jaegeuk Kim f2fs: remove obsolete parameter in f2fs_truncate We don't need lock parameter, which is always true. Signed-off-by: Jaegeuk Kim f2fs: avoid data race between FI_DIRTY_INODE flag and update_inode FI_DIRTY_INODE flag is not covered by inode page lock, so it can be unset at any time like below. Thread #1 Thread #2 - lock_page(ipage) - update i_fields - update i_size/i_blocks/and so on - set FI_DIRTY_INODE - reset FI_DIRTY_INODE - set_page_dirty(ipage) In this case, we can lose the latest i_field information. Signed-off-by: Jaegeuk Kim f2fs: fix wrong percentage This should be 1%, 10MB / 1GB. Signed-off-by: Jaegeuk Kim f2fs: control not to exceed # of cached nat entries This is to avoid cache entry management overhead including radix tree. Signed-off-by: Jaegeuk Kim f2fs: set mapping error for EIO If EIO occurred, we need to set all the mapping to avoid any further IOs. Signed-off-by: Jaegeuk Kim f2fs: avoid reverse IO order for NODE and DATA There is a data race between allocate_data_block() and f2fs_sbumit_page_mbio(), which incur unnecessary reversed bio submission. Signed-off-by: Jaegeuk Kim f2fs: drop any block plugging In f2fs, we don't need to keep block plugging for NODE and DATA writes, since we already merged bios as much as possible. Signed-off-by: Jaegeuk Kim f2fs: skip clean segment for gc If a segment in a section is clean or prefreed, we don't need to get its summary and do gc. Signed-off-by: Jaegeuk Kim f2fs: introduce mode=lfs mount option This mount option is to enable original log-structured filesystem forcefully. So, there should be no random writes for main area. Especially, this supports host-managed SMR device. Signed-off-by: Jaegeuk Kim f2fs: fix deadlock in add_link failure mkdir sync_dirty_inode - init_inode_metadata - lock_page(node) - make_empty_dir - filemap_fdatawrite() - do_writepages - lock_page(data) - write_page(data) - lock_page(node) - f2fs_init_acl - error - truncate_inode_pages - lock_page(data) So, we don't need to truncate data pages in this error case, which will be done by f2fs_evict_inode. Signed-off-by: Jaegeuk Kim f2fs: find parent dentry correctly If dotdot directory is corrupted, its slot may be ocupied by another file. In this case, dentry[1] is not the parent directory. Rename and cross-rename will update the inode in dentry[1] incorrectly. This patch finds dotdot dentry by name. Signed-off-by: Sheng Yong [Jaegeuk Kim: remove wron bug_on] Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/f2fs.h f2fs: report error for f2fs_parent_dir If there is no dentry, we can report its error correctly. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/namei.c f2fs: call update_inode_page for orphan inodes Let's store orphan inode pages right away. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/namei.c Conflicts: fs/f2fs/inode.c f2fs: detect host-managed SMR by feature flag If mkfs.f2fs gives a feature flag for host-managed SMR, we can set mode=lfs by default. Signed-off-by: Jaegeuk Kim f2fs: produce more nids and reduce readahead nats The readahead nat pages are more likely to be reclaimed quickly, so it'd better to gather more free nids in advance. And, let's keep some free nids as much as possible. Signed-off-by: Jaegeuk Kim f2fs: avoid writing node/metapages during writes Let's keep more node/meta pages in run time. Signed-off-by: Jaegeuk Kim f2fs: avoid latency-critical readahead of node pages The f2fs_map_blocks is very related to the performance, so let's avoid any latency to read ahead node pages. Signed-off-by: Jaegeuk Kim f2fs: fix to avoid reading out encrypted data in page cache For encrypted inode, if user overwrites data of the inode, f2fs will read encrypted data into page cache, and then do the decryption. However reader can race with overwriter, and it will see encrypted data which has not been decrypted by overwriter yet. Fix it by moving decrypting work to background and keep page non-uptodated until data is decrypted. Thread A Thread B - f2fs_file_write_iter - __generic_file_write_iter - generic_perform_write - f2fs_write_begin - f2fs_submit_page_bio - generic_file_read_iter - do_generic_file_read - lock_page_killable - unlock_page - copy_page_to_iter hit the encrypted data in updated page - lock_page - fscrypt_decrypt_page Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/data.c f2fs: fix to detect truncation prior rather than EIO during read In procedure of synchonized read, after sending out the read request, reader will try to lock the page for waiting device to finish the read jobs and unlock the page, but meanwhile, truncater will race with reader, so after reader get lock of the page, it should check page's mapping to detect whether someone has truncated the page in advance, then reader has the chance to do the retry if truncation was done, otherwise read can be failed due to previous condition check. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix to redirty page if fail to gc data page If we fail to move data page during foreground GC, we should give another chance to writeback that page which was set dirty previously by writer. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: add nodiscard mount option This patch adds 'nodiscard' mount option. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: Documentation/filesystems/f2fs.txt f2fs: remove unnecessary goto statement When base_addr is NULL, there is no need to call kzfree, it should return -ENOMEM directly. Additionally, it is better to initialize variable 'error' with 0. Signed-off-by: Tiezhu Yang Signed-off-by: Jaegeuk Kim f2fs: introduce f2fs_set_page_dirty_nobuffer This patch adds f2fs_set_page_dirty_nobuffer() copied from __set_page_dirty_buffer. When appending 4KB blocks in f2fs on pmem with multiple cores, this improves the overall performance. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/f2fs.h f2fs: call SetPageUptodate if needed SetPageUptodate() issues memory barrier, resulting in performance degrdation. Let's avoid that. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/data.c f2fs: shrink critical region in spin_lock This patch shrinks the critical region in spin_lock. Signed-off-by: Jaegeuk Kim f2fs: skip to check the block address of node page If the node page is up-to-date, it should be alive. Signed-off-by: Jaegeuk Kim f2fs: fix incorrect f_bfree calculation in ->statfs As manual described, f_bfree indicates total free blocks in fs, in f2fs, it includes two parts: visible free blocks and over-provision blocks. This patch corrrects the calculation. fsblkcnt_t f_bfree; /* free blocks in fs */ Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: avoid mismatching block range for discard This patch skip discard block range smaller than trim_minlen, and can not be merged by neighbour Signed-off-by: Yunlei He Reviewed-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: fix to avoid redundant discard during fstrim With below test steps, f2fs will issue redundant discard when doing fstrim, the reason is that we issue discards for both prefree segments and consecutive freed region user wants to trim, part regions they covered are overlapped, here, we change to do not to issue any discards for prefree segments in trimmed range. 1. mount -t f2fs -o discard /dev/zram0 /mnt/f2fs 2. fstrim -o 0 -l 3221225472 -m 2097152 -v /mnt/f2fs/ 3. dd if=/dev/zero of=/mnt/f2fs/a bs=2M count=1 4. dd if=/dev/zero of=/mnt/f2fs/b bs=1M count=1 5. sync 6. rm /mnt/f2fs/a /mnt/f2fs/b 7. fstrim -o 0 -l 3221225472 -m 2097152 -v /mnt/f2fs/ Before: <...>-5428 [001] ...1 9511.052125: f2fs_issue_discard: dev = (251,0), blkstart = 0x2200, blklen = 0x200 <...>-5428 [001] ...1 9511.052787: f2fs_issue_discard: dev = (251,0), blkstart = 0x2200, blklen = 0x300 After: <...>-6764 [000] ...1 9720.382504: f2fs_issue_discard: dev = (251,0), blkstart = 0x2200, blklen = 0x300 Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: move i_size_write in f2fs_write_end We don't need to do i_size_write under page lock. Signed-off-by: Jaegeuk Kim f2fs: avoid mark_inode_dirty Let's check inode's dirtiness before calling mark_inode_dirty. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/inode.c f2fs: fix ERR_PTR returned by bio This is to fix wrong error pointer handling flow reported by Dan. Reported-by: Dan Carpenter Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: refactor __exchange_data_block for speed up This reduces the elapsed time to do xfstests/generic/017. Before: 715 s After: 458 s Signed-off-by: Jaegeuk Kim f2fs: disable extent_cache for fcollapse/finsert inodes This reduces the elapsed time to do xfstests/generic/017. Before: 458 s After: 390 s Signed-off-by: Jaegeuk Kim f2fs: add maximum prefree segments In 1TB storage, we need to admit 22841 prefree segments, which can consume too much segments. This patch sets 8GB in max. prefree segments in that case. Signed-off-by: Jaegeuk Kim f2fs: fix to avoid data update racing between GC and DIO Datas in file can be operated by GC and DIO simultaneously, so we will face race case as below: For write case: Thread A Thread B - generic_file_direct_write - invalidate_inode_pages2_range - f2fs_direct_IO - do_blockdev_direct_IO - do_direct_IO - get_more_blocks - f2fs_gc - do_garbage_collect - gc_data_segment - move_data_page - do_write_data_page migrate data block to new block address - dio_bio_submit update user data to old block address For read case: Thread A Thread B - generic_file_direct_write - invalidate_inode_pages2_range - f2fs_direct_IO - do_blockdev_direct_IO - do_direct_IO - get_more_blocks - f2fs_balance_fs - f2fs_gc - do_garbage_collect - gc_data_segment - move_data_page - do_write_data_page migrate data block to new block address - write_checkpoint - do_checkpoint - clear_prefree_segments - f2fs_issue_discard discard old block adress - dio_bio_submit update user buffer from obsolete block address In order to fix this, for one file, we should let DIO and GC getting exclusion against with each other. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/data.c f2fs: use blk_plug in all the possible paths This patch reverts 19a5f5e2ef37 (f2fs: drop any block plugging), and adds blk_plug in write paths additionally. The main reason is that blk_start_plug can be used to wake up from low-power mode before submitting further bios. Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/file.c f2fs: reset default idle interval value The default value of idle interval is 2 mins, but for most time when screen shutdown, there are still operations during the 2 mins interval, and gc's sleep time is about 30 secs to 60 secs, so there is almost no chance for GC thread to do garbage collecting. Set default value of idle interval value from 2 mins to 5 secs for fixing. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim f2fs: avoid memory allocation failure due to a long length We need to avoid ENOMEM due to unexpected long length. Signed-off-by: Jaegeuk Kim f2fs: fix to report error number of f2fs_find_entry This patch fixes to report the right error number of f2fs_find_entry to its caller. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim Conflicts: fs/f2fs/namei.c f2fs: avoid data race when deciding checkpoin in f2fs_sync_file When fs utilization is almost full, f2fs_sync_file should do checkpoint if there is not enough space for roll-forward later. (i.e. space_for_roll_forward) So, currently we have no lock for sbi->alloc_valid_block_count, resulting in race condition. In rare case, we can get -ENOSPC when doing roll-forward which triggers if (is_valid_blkaddr(sbi, dest, META_POR)) { if (src == NULL_ADDR) { err = reserve_new_block(&dn); f2fs_bug_on(sbi, err); ... } ... } in do_recover_data. So, this patch avoids that situation in advance. Signed-off-by: Jaegeuk Kim f2fs: handle error case with f2fs_bug_on It's enough to show BUG or WARN by f2fs_bug_on for error case. Then, we don't need to remain corrupted filesystem. Signed-off-by: Jaegeuk Kim f2fs: get victim segment again after new cp Previous selected segment may become free after write_checkpoint, if we do garbage collect on this segment, and then new_curseg happen to reuse it, it may cause f2fs_bug_on as below. panic+0x154/0x29c do_garbage_collect+0x15c/0xaf4 f2fs_gc+0x2dc/0x444 f2fs_balance_fs.part.22+0xcc/0x14c f2fs_balance_fs+0x28/0x34 f2fs_map_blocks+0x5ec/0x790 f2fs_preallocate_blocks+0xe0/0x100 f2fs_file_write_iter+0x64/0x11c new_sync_write+0xac/0x11c vfs_write+0x144/0x1e4 SyS_write+0x60/0xc0 Here, maybe we check sit and ssa type during reset_curseg. So, we check segment is stale or not, and select a new victim to avoid this. Signed-off-by: Yunlei He Signed-off-by: Jaegeuk Kim f2fs: adjust other changes This patch changes: - d_inode - file_dentry - inode_nohighmem - ... Signed-off-by: Jaegeuk Kim fscrypto: no support for v3.4 Encryption is now disabled. Change-Id: I06a44489285d4a46870660727c560589c8ae8932 Signed-off-by: Jaegeuk Kim --- Documentation/ABI/testing/sysfs-fs-f2fs | 32 + Documentation/filesystems/f2fs.txt | 34 +- fs/Makefile | 1 + fs/crypto/Kconfig | 18 + fs/crypto/Makefile | 3 + fs/crypto/crypto.c | 567 ++++++ fs/crypto/fname.c | 427 ++++ fs/crypto/keyinfo.c | 310 +++ fs/crypto/policy.c | 229 +++ fs/f2fs/Kconfig | 10 +- fs/f2fs/Makefile | 1 + fs/f2fs/acl.c | 19 +- fs/f2fs/acl.h | 0 fs/f2fs/checkpoint.c | 511 +++-- fs/f2fs/data.c | 2402 ++++++++++++----------- fs/f2fs/debug.c | 106 +- fs/f2fs/dir.c | 468 +++-- fs/f2fs/extent_cache.c | 749 +++++++ fs/f2fs/f2fs.h | 1107 ++++++++--- fs/f2fs/file.c | 1681 +++++++++++++--- fs/f2fs/gc.c | 428 ++-- fs/f2fs/gc.h | 8 - fs/f2fs/hash.c | 3 +- fs/f2fs/inline.c | 349 ++-- fs/f2fs/inode.c | 166 +- fs/f2fs/namei.c | 438 ++++- fs/f2fs/node.c | 841 +++++--- fs/f2fs/node.h | 70 +- fs/f2fs/recovery.c | 288 ++- fs/f2fs/segment.c | 919 ++++++--- fs/f2fs/segment.h | 94 +- fs/f2fs/shrinker.c | 141 ++ fs/f2fs/super.c | 931 +++++++-- fs/f2fs/trace.c | 12 +- fs/f2fs/trace.h | 4 +- fs/f2fs/xattr.c | 50 +- fs/f2fs/xattr.h | 7 +- include/linux/dcache.h | 2 + include/linux/f2fs_fs.h | 67 +- include/linux/fs.h | 28 + include/linux/fscrypto.h | 435 ++++ include/trace/events/f2fs.h | 173 +- 42 files changed, 10628 insertions(+), 3501 deletions(-) mode change 100644 => 100755 Documentation/ABI/testing/sysfs-fs-f2fs mode change 100644 => 100755 Documentation/filesystems/f2fs.txt create mode 100755 fs/crypto/Kconfig create mode 100755 fs/crypto/Makefile create mode 100755 fs/crypto/crypto.c create mode 100755 fs/crypto/fname.c create mode 100755 fs/crypto/keyinfo.c create mode 100755 fs/crypto/policy.c mode change 100644 => 100755 fs/f2fs/Kconfig mode change 100644 => 100755 fs/f2fs/Makefile mode change 100644 => 100755 fs/f2fs/acl.c mode change 100644 => 100755 fs/f2fs/acl.h mode change 100644 => 100755 fs/f2fs/checkpoint.c mode change 100644 => 100755 fs/f2fs/data.c mode change 100644 => 100755 fs/f2fs/debug.c mode change 100644 => 100755 fs/f2fs/dir.c create mode 100755 fs/f2fs/extent_cache.c mode change 100644 => 100755 fs/f2fs/f2fs.h mode change 100644 => 100755 fs/f2fs/file.c mode change 100644 => 100755 fs/f2fs/gc.c mode change 100644 => 100755 fs/f2fs/gc.h mode change 100644 => 100755 fs/f2fs/hash.c mode change 100644 => 100755 fs/f2fs/inline.c mode change 100644 => 100755 fs/f2fs/inode.c mode change 100644 => 100755 fs/f2fs/namei.c mode change 100644 => 100755 fs/f2fs/node.c mode change 100644 => 100755 fs/f2fs/node.h mode change 100644 => 100755 fs/f2fs/recovery.c mode change 100644 => 100755 fs/f2fs/segment.c mode change 100644 => 100755 fs/f2fs/segment.h create mode 100755 fs/f2fs/shrinker.c mode change 100644 => 100755 fs/f2fs/super.c mode change 100644 => 100755 fs/f2fs/trace.c mode change 100644 => 100755 fs/f2fs/trace.h mode change 100644 => 100755 fs/f2fs/xattr.c mode change 100644 => 100755 fs/f2fs/xattr.h mode change 100644 => 100755 include/linux/f2fs_fs.h create mode 100755 include/linux/fscrypto.h mode change 100644 => 100755 include/trace/events/f2fs.h diff --git a/Documentation/ABI/testing/sysfs-fs-f2fs b/Documentation/ABI/testing/sysfs-fs-f2fs old mode 100644 new mode 100755 index 2c4cc42006e8..a809f6005f14 --- a/Documentation/ABI/testing/sysfs-fs-f2fs +++ b/Documentation/ABI/testing/sysfs-fs-f2fs @@ -80,3 +80,35 @@ Date: February 2015 Contact: "Jaegeuk Kim" Description: Controls the trimming rate in batch mode. + +What: /sys/fs/f2fs//cp_interval +Date: October 2015 +Contact: "Jaegeuk Kim" +Description: + Controls the checkpoint timing. + +What: /sys/fs/f2fs//idle_interval +Date: January 2016 +Contact: "Jaegeuk Kim" +Description: + Controls the idle timing. + +What: /sys/fs/f2fs//ra_nid_pages +Date: October 2015 +Contact: "Chao Yu" +Description: + Controls the count of nid pages to be readaheaded. + +What: /sys/fs/f2fs//dirty_nats_ratio +Date: January 2016 +Contact: "Chao Yu" +Description: + Controls dirty nat entries ratio threshold, if current + ratio exceeds configured threshold, checkpoint will + be triggered for flushing dirty nat entries. + +What: /sys/fs/f2fs//lifetime_write_kbytes +Date: January 2016 +Contact: "Shuoran Liu" +Description: + Shows total written kbytes issued to disk. diff --git a/Documentation/filesystems/f2fs.txt b/Documentation/filesystems/f2fs.txt old mode 100644 new mode 100755 index 4da837c20286..b45ebfa34bd4 --- a/Documentation/filesystems/f2fs.txt +++ b/Documentation/filesystems/f2fs.txt @@ -102,13 +102,16 @@ background_gc=%s Turn on/off cleaning operations, namely garbage collection, triggered in background when I/O subsystem is idle. If background_gc=on, it will turn on the garbage collection and if background_gc=off, garbage collection - will be truned off. + will be turned off. If background_gc=sync, it will turn + on synchronous garbage collection running in background. Default value for this option is on. So garbage collection is on by default. disable_roll_forward Disable the roll-forward recovery routine norecovery Disable the roll-forward recovery routine, mounted read- only (i.e., -o ro,disable_roll_forward) -discard Issue discard/TRIM commands when a segment is cleaned. +discard/nodiscard Enable/disable real-time discard in f2fs, if discard is + enabled, f2fs will issue discard/TRIM commands when a + segment is cleaned. no_heap Disable heap-style segment allocation which finds free segments for data from the beginning of main area, while for node from the end of main area. @@ -140,6 +143,19 @@ nobarrier This option can be used if underlying storage guarantees fastboot This option is used when a system wants to reduce mount time as much as possible, even though normal performance can be sacrificed. +extent_cache Enable an extent cache based on rb-tree, it can cache + as many as extent which map between contiguous logical + address and physical address per inode, resulting in + increasing the cache hit ratio. Set by default. +noextent_cache Disable an extent cache based on rb-tree explicitly, see + the above extent_cache mount option. +noinline_data Disable the inline data feature, inline data feature is + enabled by default. +data_flush Enable data flushing before checkpoint in order to + persist data of regular and symlink. +mode=%s Control block allocation mode which supports "adaptive" + and "lfs". In "lfs" mode, there should be no random + writes towards main area. ================================================================================ DEBUGFS ENTRIES @@ -187,9 +203,11 @@ Files in /sys/fs/f2fs/ reclaim_segments This parameter controls the number of prefree segments to be reclaimed. If the number of prefree - segments is larger than this number, f2fs tries to - conduct checkpoint to reclaim the prefree segments - to free segments. By default, 100 segments, 200MB. + segments is larger than the number of segments + in the proportion to the percentage over total + volume size, f2fs tries to conduct checkpoint to + reclaim the prefree segments to free segments. + By default, 5% over total # of segments. max_small_discards This parameter controls the number of discard commands that consist small blocks less than 2MB. @@ -484,9 +502,11 @@ The number of blocks and buckets are determined by, # of blocks in level #n = | `- 4, Otherwise - ,- 2^n, if n < MAX_DIR_HASH_DEPTH / 2, + ,- 2^(n + dir_level), + | if n + dir_level < MAX_DIR_HASH_DEPTH / 2, # of buckets in level #n = | - `- 2^((MAX_DIR_HASH_DEPTH / 2) - 1), Otherwise + `- 2^((MAX_DIR_HASH_DEPTH / 2) - 1), + Otherwise When F2FS finds a file name in a directory, at first a hash value of the file name is calculated. Then, F2FS scans the hash table in level #0 to find the diff --git a/fs/Makefile b/fs/Makefile index 13f93ad3dd61..df85c4813cd4 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -42,6 +42,7 @@ obj-$(CONFIG_SIGNALFD) += signalfd.o obj-$(CONFIG_TIMERFD) += timerfd.o obj-$(CONFIG_EVENTFD) += eventfd.o obj-$(CONFIG_AIO) += aio.o +obj-$(CONFIG_FS_ENCRYPTION) += crypto/ obj-$(CONFIG_FILE_LOCKING) += locks.o obj-$(CONFIG_COMPAT) += compat.o compat_ioctl.o obj-$(CONFIG_BINFMT_AOUT) += binfmt_aout.o diff --git a/fs/crypto/Kconfig b/fs/crypto/Kconfig new file mode 100755 index 000000000000..92348faf9865 --- /dev/null +++ b/fs/crypto/Kconfig @@ -0,0 +1,18 @@ +config FS_ENCRYPTION + tristate "FS Encryption (Per-file encryption)" + depends on BLOCK + select CRYPTO + select CRYPTO_AES + select CRYPTO_CBC + select CRYPTO_ECB + select CRYPTO_XTS + select CRYPTO_CTS + select CRYPTO_CTR + select CRYPTO_SHA256 + select KEYS + select ENCRYPTED_KEYS + help + Enable encryption of files and directories. This + feature is similar to ecryptfs, but it is more memory + efficient since it avoids caching the encrypted and + decrypted pages in the page cache. diff --git a/fs/crypto/Makefile b/fs/crypto/Makefile new file mode 100755 index 000000000000..fec9afe5d84a --- /dev/null +++ b/fs/crypto/Makefile @@ -0,0 +1,3 @@ +obj-$(CONFIG_FS_ENCRYPTION) += # fscrypto.o + +fscrypto-y := # crypto.o fname.o policy.o keyinfo.o diff --git a/fs/crypto/crypto.c b/fs/crypto/crypto.c new file mode 100755 index 000000000000..a340a6f6f531 --- /dev/null +++ b/fs/crypto/crypto.c @@ -0,0 +1,567 @@ +/* + * This contains encryption functions for per-file encryption. + * + * Copyright (C) 2015, Google, Inc. + * Copyright (C) 2015, Motorola Mobility + * + * Written by Michael Halcrow, 2014. + * + * Filename encryption additions + * Uday Savagaonkar, 2014 + * Encryption policy handling additions + * Ildar Muslukhov, 2014 + * Add fscrypt_pullback_bio_page() + * Jaegeuk Kim, 2015. + * + * This has not yet undergone a rigorous security audit. + * + * The usage of AES-XTS should conform to recommendations in NIST + * Special Publication 800-38E and IEEE P1619/D16. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static unsigned int num_prealloc_crypto_pages = 32; +static unsigned int num_prealloc_crypto_ctxs = 128; + +module_param(num_prealloc_crypto_pages, uint, 0444); +MODULE_PARM_DESC(num_prealloc_crypto_pages, + "Number of crypto pages to preallocate"); +module_param(num_prealloc_crypto_ctxs, uint, 0444); +MODULE_PARM_DESC(num_prealloc_crypto_ctxs, + "Number of crypto contexts to preallocate"); + +static mempool_t *fscrypt_bounce_page_pool = NULL; + +static LIST_HEAD(fscrypt_free_ctxs); +static DEFINE_SPINLOCK(fscrypt_ctx_lock); + +static struct workqueue_struct *fscrypt_read_workqueue; +static DEFINE_MUTEX(fscrypt_init_mutex); + +static struct kmem_cache *fscrypt_ctx_cachep; +struct kmem_cache *fscrypt_info_cachep; + +/** + * fscrypt_release_ctx() - Releases an encryption context + * @ctx: The encryption context to release. + * + * If the encryption context was allocated from the pre-allocated pool, returns + * it to that pool. Else, frees it. + * + * If there's a bounce page in the context, this frees that. + */ +void fscrypt_release_ctx(struct fscrypt_ctx *ctx) +{ + unsigned long flags; + + if (ctx->flags & FS_WRITE_PATH_FL && ctx->w.bounce_page) { + mempool_free(ctx->w.bounce_page, fscrypt_bounce_page_pool); + ctx->w.bounce_page = NULL; + } + ctx->w.control_page = NULL; + if (ctx->flags & FS_CTX_REQUIRES_FREE_ENCRYPT_FL) { + kmem_cache_free(fscrypt_ctx_cachep, ctx); + } else { + spin_lock_irqsave(&fscrypt_ctx_lock, flags); + list_add(&ctx->free_list, &fscrypt_free_ctxs); + spin_unlock_irqrestore(&fscrypt_ctx_lock, flags); + } +} +EXPORT_SYMBOL(fscrypt_release_ctx); + +/** + * fscrypt_get_ctx() - Gets an encryption context + * @inode: The inode for which we are doing the crypto + * @gfp_flags: The gfp flag for memory allocation + * + * Allocates and initializes an encryption context. + * + * Return: An allocated and initialized encryption context on success; error + * value or NULL otherwise. + */ +struct fscrypt_ctx *fscrypt_get_ctx(struct inode *inode, gfp_t gfp_flags) +{ + struct fscrypt_ctx *ctx = NULL; + struct fscrypt_info *ci = inode->i_crypt_info; + unsigned long flags; + + if (ci == NULL) + return ERR_PTR(-ENOKEY); + + /* + * We first try getting the ctx from a free list because in + * the common case the ctx will have an allocated and + * initialized crypto tfm, so it's probably a worthwhile + * optimization. For the bounce page, we first try getting it + * from the kernel allocator because that's just about as fast + * as getting it from a list and because a cache of free pages + * should generally be a "last resort" option for a filesystem + * to be able to do its job. + */ + spin_lock_irqsave(&fscrypt_ctx_lock, flags); + ctx = list_first_entry_or_null(&fscrypt_free_ctxs, + struct fscrypt_ctx, free_list); + if (ctx) + list_del(&ctx->free_list); + spin_unlock_irqrestore(&fscrypt_ctx_lock, flags); + if (!ctx) { + ctx = kmem_cache_zalloc(fscrypt_ctx_cachep, gfp_flags); + if (!ctx) + return ERR_PTR(-ENOMEM); + ctx->flags |= FS_CTX_REQUIRES_FREE_ENCRYPT_FL; + } else { + ctx->flags &= ~FS_CTX_REQUIRES_FREE_ENCRYPT_FL; + } + ctx->flags &= ~FS_WRITE_PATH_FL; + return ctx; +} +EXPORT_SYMBOL(fscrypt_get_ctx); + +/** + * fscrypt_complete() - The completion callback for page encryption + * @req: The asynchronous encryption request context + * @res: The result of the encryption operation + */ +static void fscrypt_complete(struct crypto_async_request *req, int res) +{ + struct fscrypt_completion_result *ecr = req->data; + + if (res == -EINPROGRESS) + return; + ecr->res = res; + complete(&ecr->completion); +} + +typedef enum { + FS_DECRYPT = 0, + FS_ENCRYPT, +} fscrypt_direction_t; + +static int do_page_crypto(struct inode *inode, + fscrypt_direction_t rw, pgoff_t index, + struct page *src_page, struct page *dest_page, + gfp_t gfp_flags) +{ + u8 xts_tweak[FS_XTS_TWEAK_SIZE]; + struct ablkcipher_request *req = NULL; + DECLARE_FS_COMPLETION_RESULT(ecr); + struct scatterlist dst, src; + struct fscrypt_info *ci = inode->i_crypt_info; + struct crypto_ablkcipher *tfm = ci->ci_ctfm; + int res = 0; + + req = ablkcipher_request_alloc(tfm, gfp_flags); + if (!req) { + printk_ratelimited(KERN_ERR + "%s: crypto_request_alloc() failed\n", + __func__); + return -ENOMEM; + } + + ablkcipher_request_set_callback( + req, CRYPTO_TFM_REQ_MAY_BACKLOG | CRYPTO_TFM_REQ_MAY_SLEEP, + fscrypt_complete, &ecr); + + BUILD_BUG_ON(FS_XTS_TWEAK_SIZE < sizeof(index)); + memcpy(xts_tweak, &index, sizeof(index)); + memset(&xts_tweak[sizeof(index)], 0, + FS_XTS_TWEAK_SIZE - sizeof(index)); + + sg_init_table(&dst, 1); + sg_set_page(&dst, dest_page, PAGE_CACHE_SIZE, 0); + sg_init_table(&src, 1); + sg_set_page(&src, src_page, PAGE_CACHE_SIZE, 0); + ablkcipher_request_set_crypt(req, &src, &dst, PAGE_CACHE_SIZE, + xts_tweak); + if (rw == FS_DECRYPT) + res = crypto_ablkcipher_decrypt(req); + else + res = crypto_ablkcipher_encrypt(req); + if (res == -EINPROGRESS || res == -EBUSY) { + BUG_ON(req->base.data != &ecr); + wait_for_completion(&ecr.completion); + res = ecr.res; + } + ablkcipher_request_free(req); + if (res) { + printk_ratelimited(KERN_ERR + "%s: crypto_ablkcipher_encrypt() returned %d\n", + __func__, res); + return res; + } + return 0; +} + +static struct page *alloc_bounce_page(struct fscrypt_ctx *ctx, gfp_t gfp_flags) +{ + ctx->w.bounce_page = mempool_alloc(fscrypt_bounce_page_pool, gfp_flags); + if (ctx->w.bounce_page == NULL) + return ERR_PTR(-ENOMEM); + ctx->flags |= FS_WRITE_PATH_FL; + return ctx->w.bounce_page; +} + +/** + * fscypt_encrypt_page() - Encrypts a page + * @inode: The inode for which the encryption should take place + * @plaintext_page: The page to encrypt. Must be locked. + * @gfp_flags: The gfp flag for memory allocation + * + * Allocates a ciphertext page and encrypts plaintext_page into it using the ctx + * encryption context. + * + * Called on the page write path. The caller must call + * fscrypt_restore_control_page() on the returned ciphertext page to + * release the bounce buffer and the encryption context. + * + * Return: An allocated page with the encrypted content on success. Else, an + * error value or NULL. + */ +struct page *fscrypt_encrypt_page(struct inode *inode, + struct page *plaintext_page, gfp_t gfp_flags) +{ + struct fscrypt_ctx *ctx; + struct page *ciphertext_page = NULL; + int err; + + BUG_ON(!PageLocked(plaintext_page)); + + ctx = fscrypt_get_ctx(inode, gfp_flags); + if (IS_ERR(ctx)) + return (struct page *)ctx; + + /* The encryption operation will require a bounce page. */ + ciphertext_page = alloc_bounce_page(ctx, gfp_flags); + if (IS_ERR(ciphertext_page)) + goto errout; + + ctx->w.control_page = plaintext_page; + err = do_page_crypto(inode, FS_ENCRYPT, plaintext_page->index, + plaintext_page, ciphertext_page, + gfp_flags); + if (err) { + ciphertext_page = ERR_PTR(err); + goto errout; + } + SetPagePrivate(ciphertext_page); + set_page_private(ciphertext_page, (unsigned long)ctx); + lock_page(ciphertext_page); + return ciphertext_page; + +errout: + fscrypt_release_ctx(ctx); + return ciphertext_page; +} +EXPORT_SYMBOL(fscrypt_encrypt_page); + +/** + * f2crypt_decrypt_page() - Decrypts a page in-place + * @page: The page to decrypt. Must be locked. + * + * Decrypts page in-place using the ctx encryption context. + * + * Called from the read completion callback. + * + * Return: Zero on success, non-zero otherwise. + */ +int fscrypt_decrypt_page(struct page *page) +{ + BUG_ON(!PageLocked(page)); + + return do_page_crypto(page->mapping->host, + FS_DECRYPT, page->index, page, page, GFP_NOFS); +} +EXPORT_SYMBOL(fscrypt_decrypt_page); + +int fscrypt_zeroout_range(struct inode *inode, pgoff_t lblk, + sector_t pblk, unsigned int len) +{ + struct fscrypt_ctx *ctx; + struct page *ciphertext_page = NULL; + struct bio *bio; + int ret, err = 0; + + BUG_ON(inode->i_sb->s_blocksize != PAGE_CACHE_SIZE); + + ctx = fscrypt_get_ctx(inode, GFP_NOFS); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ciphertext_page = alloc_bounce_page(ctx, GFP_NOWAIT); + if (IS_ERR(ciphertext_page)) { + err = PTR_ERR(ciphertext_page); + goto errout; + } + + while (len--) { + err = do_page_crypto(inode, FS_ENCRYPT, lblk, + ZERO_PAGE(0), ciphertext_page, + GFP_NOFS); + if (err) + goto errout; + + bio = bio_alloc(GFP_NOWAIT, 1); + if (!bio) { + err = -ENOMEM; + goto errout; + } + bio->bi_bdev = inode->i_sb->s_bdev; + bio->bi_sector = + pblk << (inode->i_sb->s_blocksize_bits - 9); + ret = bio_add_page(bio, ciphertext_page, + inode->i_sb->s_blocksize, 0); + if (ret != inode->i_sb->s_blocksize) { + /* should never happen! */ + WARN_ON(1); + bio_put(bio); + err = -EIO; + goto errout; + } + err = submit_bio_wait(WRITE, bio); + bio_put(bio); + if (err) + goto errout; + lblk++; + pblk++; + } + err = 0; +errout: + fscrypt_release_ctx(ctx); + return err; +} +EXPORT_SYMBOL(fscrypt_zeroout_range); + +/* + * Validate dentries for encrypted directories to make sure we aren't + * potentially caching stale data after a key has been added or + * removed. + */ +static int fscrypt_d_revalidate(struct dentry *dentry, unsigned int flags) +{ + struct dentry *dir; + struct fscrypt_info *ci; + int dir_has_key, cached_with_key; + + if (flags & LOOKUP_RCU) + return -ECHILD; + + dir = dget_parent(dentry); + if (!d_inode(dir)->i_sb->s_cop->is_encrypted(d_inode(dir))) { + dput(dir); + return 0; + } + + ci = d_inode(dir)->i_crypt_info; + if (ci && ci->ci_keyring_key && + (ci->ci_keyring_key->flags & ((1 << KEY_FLAG_INVALIDATED) | + (1 << KEY_FLAG_REVOKED) | + (1 << KEY_FLAG_DEAD)))) + ci = NULL; + + /* this should eventually be an flag in d_flags */ + spin_lock(&dentry->d_lock); + cached_with_key = dentry->d_flags & DCACHE_ENCRYPTED_WITH_KEY; + spin_unlock(&dentry->d_lock); + dir_has_key = (ci != NULL); + dput(dir); + + /* + * If the dentry was cached without the key, and it is a + * negative dentry, it might be a valid name. We can't check + * if the key has since been made available due to locking + * reasons, so we fail the validation so ext4_lookup() can do + * this check. + * + * We also fail the validation if the dentry was created with + * the key present, but we no longer have the key, or vice versa. + */ + if (!cached_with_key || + (!cached_with_key && dir_has_key) || + (cached_with_key && !dir_has_key)) + return 0; + return 1; +} + +const struct dentry_operations fscrypt_d_ops = { + .d_revalidate = fscrypt_d_revalidate, +}; +EXPORT_SYMBOL(fscrypt_d_ops); + +/* + * Call fscrypt_decrypt_page on every single page, reusing the encryption + * context. + */ +static void completion_pages(struct work_struct *work) +{ + struct fscrypt_ctx *ctx = + container_of(work, struct fscrypt_ctx, r.work); + struct bio *bio = ctx->r.bio; + struct bio_vec *bv; + int i; + + bio_for_each_segment_all(bv, bio, i) { + struct page *page = bv->bv_page; + int ret = fscrypt_decrypt_page(page); + + if (ret) { + WARN_ON_ONCE(1); + SetPageError(page); + } else { + SetPageUptodate(page); + } + unlock_page(page); + } + fscrypt_release_ctx(ctx); + bio_put(bio); +} + +void fscrypt_decrypt_bio_pages(struct fscrypt_ctx *ctx, struct bio *bio) +{ + INIT_WORK(&ctx->r.work, completion_pages); + ctx->r.bio = bio; + queue_work(fscrypt_read_workqueue, &ctx->r.work); +} +EXPORT_SYMBOL(fscrypt_decrypt_bio_pages); + +void fscrypt_pullback_bio_page(struct page **page, bool restore) +{ + struct fscrypt_ctx *ctx; + struct page *bounce_page; + + /* The bounce data pages are unmapped. */ + if ((*page)->mapping) + return; + + /* The bounce data page is unmapped. */ + bounce_page = *page; + ctx = (struct fscrypt_ctx *)page_private(bounce_page); + + /* restore control page */ + *page = ctx->w.control_page; + + if (restore) + fscrypt_restore_control_page(bounce_page); +} +EXPORT_SYMBOL(fscrypt_pullback_bio_page); + +void fscrypt_restore_control_page(struct page *page) +{ + struct fscrypt_ctx *ctx; + + ctx = (struct fscrypt_ctx *)page_private(page); + set_page_private(page, (unsigned long)NULL); + ClearPagePrivate(page); + unlock_page(page); + fscrypt_release_ctx(ctx); +} +EXPORT_SYMBOL(fscrypt_restore_control_page); + +static void fscrypt_destroy(void) +{ + struct fscrypt_ctx *pos, *n; + + list_for_each_entry_safe(pos, n, &fscrypt_free_ctxs, free_list) + kmem_cache_free(fscrypt_ctx_cachep, pos); + INIT_LIST_HEAD(&fscrypt_free_ctxs); + mempool_destroy(fscrypt_bounce_page_pool); + fscrypt_bounce_page_pool = NULL; +} + +/** + * fscrypt_initialize() - allocate major buffers for fs encryption. + * + * We only call this when we start accessing encrypted files, since it + * results in memory getting allocated that wouldn't otherwise be used. + * + * Return: Zero on success, non-zero otherwise. + */ +int fscrypt_initialize(void) +{ + int i, res = -ENOMEM; + + if (fscrypt_bounce_page_pool) + return 0; + + mutex_lock(&fscrypt_init_mutex); + if (fscrypt_bounce_page_pool) + goto already_initialized; + + for (i = 0; i < num_prealloc_crypto_ctxs; i++) { + struct fscrypt_ctx *ctx; + + ctx = kmem_cache_zalloc(fscrypt_ctx_cachep, GFP_NOFS); + if (!ctx) + goto fail; + list_add(&ctx->free_list, &fscrypt_free_ctxs); + } + + fscrypt_bounce_page_pool = + mempool_create_page_pool(num_prealloc_crypto_pages, 0); + if (!fscrypt_bounce_page_pool) + goto fail; + +already_initialized: + mutex_unlock(&fscrypt_init_mutex); + return 0; +fail: + fscrypt_destroy(); + mutex_unlock(&fscrypt_init_mutex); + return res; +} +EXPORT_SYMBOL(fscrypt_initialize); + +/** + * fscrypt_init() - Set up for fs encryption. + */ +static int __init fscrypt_init(void) +{ + fscrypt_read_workqueue = alloc_workqueue("fscrypt_read_queue", + WQ_HIGHPRI, 0); + if (!fscrypt_read_workqueue) + goto fail; + + fscrypt_ctx_cachep = KMEM_CACHE(fscrypt_ctx, SLAB_RECLAIM_ACCOUNT); + if (!fscrypt_ctx_cachep) + goto fail_free_queue; + + fscrypt_info_cachep = KMEM_CACHE(fscrypt_info, SLAB_RECLAIM_ACCOUNT); + if (!fscrypt_info_cachep) + goto fail_free_ctx; + + return 0; + +fail_free_ctx: + kmem_cache_destroy(fscrypt_ctx_cachep); +fail_free_queue: + destroy_workqueue(fscrypt_read_workqueue); +fail: + return -ENOMEM; +} +module_init(fscrypt_init) + +/** + * fscrypt_exit() - Shutdown the fs encryption system + */ +static void __exit fscrypt_exit(void) +{ + fscrypt_destroy(); + + if (fscrypt_read_workqueue) + destroy_workqueue(fscrypt_read_workqueue); + kmem_cache_destroy(fscrypt_ctx_cachep); + kmem_cache_destroy(fscrypt_info_cachep); +} +module_exit(fscrypt_exit); + +MODULE_LICENSE("GPL"); diff --git a/fs/crypto/fname.c b/fs/crypto/fname.c new file mode 100755 index 000000000000..5e4ddeeba267 --- /dev/null +++ b/fs/crypto/fname.c @@ -0,0 +1,427 @@ +/* + * This contains functions for filename crypto management + * + * Copyright (C) 2015, Google, Inc. + * Copyright (C) 2015, Motorola Mobility + * + * Written by Uday Savagaonkar, 2014. + * Modified by Jaegeuk Kim, 2015. + * + * This has not yet undergone a rigorous security audit. + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +static u32 size_round_up(size_t size, size_t blksize) +{ + return ((size + blksize - 1) / blksize) * blksize; +} + +/** + * dir_crypt_complete() - + */ +static void dir_crypt_complete(struct crypto_async_request *req, int res) +{ + struct fscrypt_completion_result *ecr = req->data; + + if (res == -EINPROGRESS) + return; + ecr->res = res; + complete(&ecr->completion); +} + +/** + * fname_encrypt() - + * + * This function encrypts the input filename, and returns the length of the + * ciphertext. Errors are returned as negative numbers. We trust the caller to + * allocate sufficient memory to oname string. + */ +static int fname_encrypt(struct inode *inode, + const struct qstr *iname, struct fscrypt_str *oname) +{ + u32 ciphertext_len; + struct ablkcipher_request *req = NULL; + DECLARE_FS_COMPLETION_RESULT(ecr); + struct fscrypt_info *ci = inode->i_crypt_info; + struct crypto_ablkcipher *tfm = ci->ci_ctfm; + int res = 0; + char iv[FS_CRYPTO_BLOCK_SIZE]; + struct scatterlist src_sg, dst_sg; + int padding = 4 << (ci->ci_flags & FS_POLICY_FLAGS_PAD_MASK); + char *workbuf, buf[32], *alloc_buf = NULL; + unsigned lim; + + lim = inode->i_sb->s_cop->max_namelen(inode); + if (iname->len <= 0 || iname->len > lim) + return -EIO; + + ciphertext_len = (iname->len < FS_CRYPTO_BLOCK_SIZE) ? + FS_CRYPTO_BLOCK_SIZE : iname->len; + ciphertext_len = size_round_up(ciphertext_len, padding); + ciphertext_len = (ciphertext_len > lim) ? lim : ciphertext_len; + + if (ciphertext_len <= sizeof(buf)) { + workbuf = buf; + } else { + alloc_buf = kmalloc(ciphertext_len, GFP_NOFS); + if (!alloc_buf) + return -ENOMEM; + workbuf = alloc_buf; + } + + /* Allocate request */ + req = ablkcipher_request_alloc(tfm, GFP_NOFS); + if (!req) { + printk_ratelimited(KERN_ERR + "%s: crypto_request_alloc() failed\n", __func__); + kfree(alloc_buf); + return -ENOMEM; + } + ablkcipher_request_set_callback(req, + CRYPTO_TFM_REQ_MAY_BACKLOG | CRYPTO_TFM_REQ_MAY_SLEEP, + dir_crypt_complete, &ecr); + + /* Copy the input */ + memcpy(workbuf, iname->name, iname->len); + if (iname->len < ciphertext_len) + memset(workbuf + iname->len, 0, ciphertext_len - iname->len); + + /* Initialize IV */ + memset(iv, 0, FS_CRYPTO_BLOCK_SIZE); + + /* Create encryption request */ + sg_init_one(&src_sg, workbuf, ciphertext_len); + sg_init_one(&dst_sg, oname->name, ciphertext_len); + ablkcipher_request_set_crypt(req, &src_sg, &dst_sg, ciphertext_len, iv); + res = crypto_ablkcipher_encrypt(req); + if (res == -EINPROGRESS || res == -EBUSY) { + wait_for_completion(&ecr.completion); + res = ecr.res; + } + kfree(alloc_buf); + ablkcipher_request_free(req); + if (res < 0) + printk_ratelimited(KERN_ERR + "%s: Error (error code %d)\n", __func__, res); + + oname->len = ciphertext_len; + return res; +} + +/* + * fname_decrypt() + * This function decrypts the input filename, and returns + * the length of the plaintext. + * Errors are returned as negative numbers. + * We trust the caller to allocate sufficient memory to oname string. + */ +static int fname_decrypt(struct inode *inode, + const struct fscrypt_str *iname, + struct fscrypt_str *oname) +{ + struct ablkcipher_request *req = NULL; + DECLARE_FS_COMPLETION_RESULT(ecr); + struct scatterlist src_sg, dst_sg; + struct fscrypt_info *ci = inode->i_crypt_info; + struct crypto_ablkcipher *tfm = ci->ci_ctfm; + int res = 0; + char iv[FS_CRYPTO_BLOCK_SIZE]; + unsigned lim; + + lim = inode->i_sb->s_cop->max_namelen(inode); + if (iname->len <= 0 || iname->len > lim) + return -EIO; + + /* Allocate request */ + req = ablkcipher_request_alloc(tfm, GFP_NOFS); + if (!req) { + printk_ratelimited(KERN_ERR + "%s: crypto_request_alloc() failed\n", __func__); + return -ENOMEM; + } + ablkcipher_request_set_callback(req, + CRYPTO_TFM_REQ_MAY_BACKLOG | CRYPTO_TFM_REQ_MAY_SLEEP, + dir_crypt_complete, &ecr); + + /* Initialize IV */ + memset(iv, 0, FS_CRYPTO_BLOCK_SIZE); + + /* Create decryption request */ + sg_init_one(&src_sg, iname->name, iname->len); + sg_init_one(&dst_sg, oname->name, oname->len); + ablkcipher_request_set_crypt(req, &src_sg, &dst_sg, iname->len, iv); + res = crypto_ablkcipher_decrypt(req); + if (res == -EINPROGRESS || res == -EBUSY) { + wait_for_completion(&ecr.completion); + res = ecr.res; + } + ablkcipher_request_free(req); + if (res < 0) { + printk_ratelimited(KERN_ERR + "%s: Error (error code %d)\n", __func__, res); + return res; + } + + oname->len = strnlen(oname->name, iname->len); + return oname->len; +} + +static const char *lookup_table = + "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+,"; + +/** + * digest_encode() - + * + * Encodes the input digest using characters from the set [a-zA-Z0-9_+]. + * The encoded string is roughly 4/3 times the size of the input string. + */ +static int digest_encode(const char *src, int len, char *dst) +{ + int i = 0, bits = 0, ac = 0; + char *cp = dst; + + while (i < len) { + ac += (((unsigned char) src[i]) << bits); + bits += 8; + do { + *cp++ = lookup_table[ac & 0x3f]; + ac >>= 6; + bits -= 6; + } while (bits >= 6); + i++; + } + if (bits) + *cp++ = lookup_table[ac & 0x3f]; + return cp - dst; +} + +static int digest_decode(const char *src, int len, char *dst) +{ + int i = 0, bits = 0, ac = 0; + const char *p; + char *cp = dst; + + while (i < len) { + p = strchr(lookup_table, src[i]); + if (p == NULL || src[i] == 0) + return -2; + ac += (p - lookup_table) << bits; + bits += 6; + if (bits >= 8) { + *cp++ = ac & 0xff; + ac >>= 8; + bits -= 8; + } + i++; + } + if (ac) + return -1; + return cp - dst; +} + +u32 fscrypt_fname_encrypted_size(struct inode *inode, u32 ilen) +{ + int padding = 32; + struct fscrypt_info *ci = inode->i_crypt_info; + + if (ci) + padding = 4 << (ci->ci_flags & FS_POLICY_FLAGS_PAD_MASK); + if (ilen < FS_CRYPTO_BLOCK_SIZE) + ilen = FS_CRYPTO_BLOCK_SIZE; + return size_round_up(ilen, padding); +} +EXPORT_SYMBOL(fscrypt_fname_encrypted_size); + +/** + * fscrypt_fname_crypto_alloc_obuff() - + * + * Allocates an output buffer that is sufficient for the crypto operation + * specified by the context and the direction. + */ +int fscrypt_fname_alloc_buffer(struct inode *inode, + u32 ilen, struct fscrypt_str *crypto_str) +{ + unsigned int olen = fscrypt_fname_encrypted_size(inode, ilen); + + crypto_str->len = olen; + if (olen < FS_FNAME_CRYPTO_DIGEST_SIZE * 2) + olen = FS_FNAME_CRYPTO_DIGEST_SIZE * 2; + /* + * Allocated buffer can hold one more character to null-terminate the + * string + */ + crypto_str->name = kmalloc(olen + 1, GFP_NOFS); + if (!(crypto_str->name)) + return -ENOMEM; + return 0; +} +EXPORT_SYMBOL(fscrypt_fname_alloc_buffer); + +/** + * fscrypt_fname_crypto_free_buffer() - + * + * Frees the buffer allocated for crypto operation. + */ +void fscrypt_fname_free_buffer(struct fscrypt_str *crypto_str) +{ + if (!crypto_str) + return; + kfree(crypto_str->name); + crypto_str->name = NULL; +} +EXPORT_SYMBOL(fscrypt_fname_free_buffer); + +/** + * fscrypt_fname_disk_to_usr() - converts a filename from disk space to user + * space + */ +int fscrypt_fname_disk_to_usr(struct inode *inode, + u32 hash, u32 minor_hash, + const struct fscrypt_str *iname, + struct fscrypt_str *oname) +{ + const struct qstr qname = FSTR_TO_QSTR(iname); + char buf[24]; + int ret; + + if (fscrypt_is_dot_dotdot(&qname)) { + oname->name[0] = '.'; + oname->name[iname->len - 1] = '.'; + oname->len = iname->len; + return oname->len; + } + + if (iname->len < FS_CRYPTO_BLOCK_SIZE) + return -EUCLEAN; + + if (inode->i_crypt_info) + return fname_decrypt(inode, iname, oname); + + if (iname->len <= FS_FNAME_CRYPTO_DIGEST_SIZE) { + ret = digest_encode(iname->name, iname->len, oname->name); + oname->len = ret; + return ret; + } + if (hash) { + memcpy(buf, &hash, 4); + memcpy(buf + 4, &minor_hash, 4); + } else { + memset(buf, 0, 8); + } + memcpy(buf + 8, iname->name + iname->len - 16, 16); + oname->name[0] = '_'; + ret = digest_encode(buf, 24, oname->name + 1); + oname->len = ret + 1; + return ret + 1; +} +EXPORT_SYMBOL(fscrypt_fname_disk_to_usr); + +/** + * fscrypt_fname_usr_to_disk() - converts a filename from user space to disk + * space + */ +int fscrypt_fname_usr_to_disk(struct inode *inode, + const struct qstr *iname, + struct fscrypt_str *oname) +{ + if (fscrypt_is_dot_dotdot(iname)) { + oname->name[0] = '.'; + oname->name[iname->len - 1] = '.'; + oname->len = iname->len; + return oname->len; + } + if (inode->i_crypt_info) + return fname_encrypt(inode, iname, oname); + /* + * Without a proper key, a user is not allowed to modify the filenames + * in a directory. Consequently, a user space name cannot be mapped to + * a disk-space name + */ + return -EACCES; +} +EXPORT_SYMBOL(fscrypt_fname_usr_to_disk); + +int fscrypt_setup_filename(struct inode *dir, const struct qstr *iname, + int lookup, struct fscrypt_name *fname) +{ + int ret = 0, bigname = 0; + + memset(fname, 0, sizeof(struct fscrypt_name)); + fname->usr_fname = iname; + + if (!dir->i_sb->s_cop->is_encrypted(dir) || + fscrypt_is_dot_dotdot(iname)) { + fname->disk_name.name = (unsigned char *)iname->name; + fname->disk_name.len = iname->len; + return 0; + } + ret = get_crypt_info(dir); + if (ret && ret != -EOPNOTSUPP) + return ret; + + if (dir->i_crypt_info) { + ret = fscrypt_fname_alloc_buffer(dir, iname->len, + &fname->crypto_buf); + if (ret < 0) + return ret; + ret = fname_encrypt(dir, iname, &fname->crypto_buf); + if (ret < 0) + goto errout; + fname->disk_name.name = fname->crypto_buf.name; + fname->disk_name.len = fname->crypto_buf.len; + return 0; + } + if (!lookup) + return -EACCES; + + /* + * We don't have the key and we are doing a lookup; decode the + * user-supplied name + */ + if (iname->name[0] == '_') + bigname = 1; + if ((bigname && (iname->len != 33)) || (!bigname && (iname->len > 43))) + return -ENOENT; + + fname->crypto_buf.name = kmalloc(32, GFP_KERNEL); + if (fname->crypto_buf.name == NULL) + return -ENOMEM; + + ret = digest_decode(iname->name + bigname, iname->len - bigname, + fname->crypto_buf.name); + if (ret < 0) { + ret = -ENOENT; + goto errout; + } + fname->crypto_buf.len = ret; + if (bigname) { + memcpy(&fname->hash, fname->crypto_buf.name, 4); + memcpy(&fname->minor_hash, fname->crypto_buf.name + 4, 4); + } else { + fname->disk_name.name = fname->crypto_buf.name; + fname->disk_name.len = fname->crypto_buf.len; + } + return 0; + +errout: + fscrypt_fname_free_buffer(&fname->crypto_buf); + return ret; +} +EXPORT_SYMBOL(fscrypt_setup_filename); + +void fscrypt_free_filename(struct fscrypt_name *fname) +{ + kfree(fname->crypto_buf.name); + fname->crypto_buf.name = NULL; + fname->usr_fname = NULL; + fname->disk_name.name = NULL; +} +EXPORT_SYMBOL(fscrypt_free_filename); diff --git a/fs/crypto/keyinfo.c b/fs/crypto/keyinfo.c new file mode 100755 index 000000000000..7be4f4312341 --- /dev/null +++ b/fs/crypto/keyinfo.c @@ -0,0 +1,310 @@ +/* + * key management facility for FS encryption support. + * + * Copyright (C) 2015, Google, Inc. + * + * This contains encryption key functions. + * + * Written by Michael Halcrow, Ildar Muslukhov, and Uday Savagaonkar, 2015. + */ + +#include +#include +#include +#include +#include +#include +#include + +static void derive_crypt_complete(struct crypto_async_request *req, int rc) +{ + struct fscrypt_completion_result *ecr = req->data; + + if (rc == -EINPROGRESS) + return; + + ecr->res = rc; + complete(&ecr->completion); +} + +/** + * derive_key_aes() - Derive a key using AES-128-ECB + * @deriving_key: Encryption key used for derivation. + * @source_key: Source key to which to apply derivation. + * @derived_key: Derived key. + * + * Return: Zero on success; non-zero otherwise. + */ +static int derive_key_aes(u8 deriving_key[FS_AES_128_ECB_KEY_SIZE], + u8 source_key[FS_AES_256_XTS_KEY_SIZE], + u8 derived_key[FS_AES_256_XTS_KEY_SIZE]) +{ + int res = 0; + struct ablkcipher_request *req = NULL; + DECLARE_FS_COMPLETION_RESULT(ecr); + struct scatterlist src_sg, dst_sg; + struct crypto_ablkcipher *tfm = crypto_alloc_ablkcipher("ecb(aes)", 0, + 0); + + if (IS_ERR(tfm)) { + res = PTR_ERR(tfm); + tfm = NULL; + goto out; + } + crypto_ablkcipher_set_flags(tfm, CRYPTO_TFM_REQ_WEAK_KEY); + req = ablkcipher_request_alloc(tfm, GFP_NOFS); + if (!req) { + res = -ENOMEM; + goto out; + } + ablkcipher_request_set_callback(req, + CRYPTO_TFM_REQ_MAY_BACKLOG | CRYPTO_TFM_REQ_MAY_SLEEP, + derive_crypt_complete, &ecr); + res = crypto_ablkcipher_setkey(tfm, deriving_key, + FS_AES_128_ECB_KEY_SIZE); + if (res < 0) + goto out; + + sg_init_one(&src_sg, source_key, FS_AES_256_XTS_KEY_SIZE); + sg_init_one(&dst_sg, derived_key, FS_AES_256_XTS_KEY_SIZE); + ablkcipher_request_set_crypt(req, &src_sg, &dst_sg, + FS_AES_256_XTS_KEY_SIZE, NULL); + res = crypto_ablkcipher_encrypt(req); + if (res == -EINPROGRESS || res == -EBUSY) { + wait_for_completion(&ecr.completion); + res = ecr.res; + } +out: + if (req) + ablkcipher_request_free(req); + if (tfm) + crypto_free_ablkcipher(tfm); + return res; +} + +static int validate_user_key(struct fscrypt_info *crypt_info, + struct fscrypt_context *ctx, u8 *raw_key, + u8 *prefix, int prefix_size) +{ + u8 *full_key_descriptor; + struct key *keyring_key; + struct fscrypt_key *master_key; + const struct user_key_payload *ukp; + int full_key_len = prefix_size + (FS_KEY_DESCRIPTOR_SIZE * 2) + 1; + int res; + + full_key_descriptor = kmalloc(full_key_len, GFP_NOFS); + if (!full_key_descriptor) + return -ENOMEM; + + memcpy(full_key_descriptor, prefix, prefix_size); + sprintf(full_key_descriptor + prefix_size, + "%*phN", FS_KEY_DESCRIPTOR_SIZE, + ctx->master_key_descriptor); + full_key_descriptor[full_key_len - 1] = '\0'; + keyring_key = request_key(&key_type_logon, full_key_descriptor, NULL); + kfree(full_key_descriptor); + if (IS_ERR(keyring_key)) + return PTR_ERR(keyring_key); + + if (keyring_key->type != &key_type_logon) { + printk_once(KERN_WARNING + "%s: key type must be logon\n", __func__); + res = -ENOKEY; + goto out; + } + down_read(&keyring_key->sem); + ukp = ((struct user_key_payload *)keyring_key->payload.data); + if (ukp->datalen != sizeof(struct fscrypt_key)) { + res = -EINVAL; + up_read(&keyring_key->sem); + goto out; + } + master_key = (struct fscrypt_key *)ukp->data; + BUILD_BUG_ON(FS_AES_128_ECB_KEY_SIZE != FS_KEY_DERIVATION_NONCE_SIZE); + + if (master_key->size != FS_AES_256_XTS_KEY_SIZE) { + printk_once(KERN_WARNING + "%s: key size incorrect: %d\n", + __func__, master_key->size); + res = -ENOKEY; + up_read(&keyring_key->sem); + goto out; + } + res = derive_key_aes(ctx->nonce, master_key->raw, raw_key); + up_read(&keyring_key->sem); + if (res) + goto out; + + crypt_info->ci_keyring_key = keyring_key; + return 0; +out: + key_put(keyring_key); + return res; +} + +static void put_crypt_info(struct fscrypt_info *ci) +{ + if (!ci) + return; + + if (ci->ci_keyring_key) + key_put(ci->ci_keyring_key); + crypto_free_ablkcipher(ci->ci_ctfm); + kmem_cache_free(fscrypt_info_cachep, ci); +} + +int get_crypt_info(struct inode *inode) +{ + struct fscrypt_info *crypt_info; + struct fscrypt_context ctx; + struct crypto_ablkcipher *ctfm; + const char *cipher_str; + u8 raw_key[FS_MAX_KEY_SIZE]; + u8 mode; + int res; + + res = fscrypt_initialize(); + if (res) + return res; + + if (!inode->i_sb->s_cop->get_context) + return -EOPNOTSUPP; +retry: + crypt_info = ACCESS_ONCE(inode->i_crypt_info); + if (crypt_info) { + if (!crypt_info->ci_keyring_key || + key_validate(crypt_info->ci_keyring_key) == 0) + return 0; + fscrypt_put_encryption_info(inode, crypt_info); + goto retry; + } + + res = inode->i_sb->s_cop->get_context(inode, &ctx, sizeof(ctx)); + if (res < 0) { + if (!fscrypt_dummy_context_enabled(inode)) + return res; + ctx.contents_encryption_mode = FS_ENCRYPTION_MODE_AES_256_XTS; + ctx.filenames_encryption_mode = FS_ENCRYPTION_MODE_AES_256_CTS; + ctx.flags = 0; + } else if (res != sizeof(ctx)) { + return -EINVAL; + } + res = 0; + + crypt_info = kmem_cache_alloc(fscrypt_info_cachep, GFP_NOFS); + if (!crypt_info) + return -ENOMEM; + + crypt_info->ci_flags = ctx.flags; + crypt_info->ci_data_mode = ctx.contents_encryption_mode; + crypt_info->ci_filename_mode = ctx.filenames_encryption_mode; + crypt_info->ci_ctfm = NULL; + crypt_info->ci_keyring_key = NULL; + memcpy(crypt_info->ci_master_key, ctx.master_key_descriptor, + sizeof(crypt_info->ci_master_key)); + if (S_ISREG(inode->i_mode)) + mode = crypt_info->ci_data_mode; + else if (S_ISDIR(inode->i_mode) || S_ISLNK(inode->i_mode)) + mode = crypt_info->ci_filename_mode; + else + BUG(); + + switch (mode) { + case FS_ENCRYPTION_MODE_AES_256_XTS: + cipher_str = "xts(aes)"; + break; + case FS_ENCRYPTION_MODE_AES_256_CTS: + cipher_str = "cts(cbc(aes))"; + break; + default: + printk_once(KERN_WARNING + "%s: unsupported key mode %d (ino %u)\n", + __func__, mode, (unsigned) inode->i_ino); + res = -ENOKEY; + goto out; + } + if (fscrypt_dummy_context_enabled(inode)) { + memset(raw_key, 0x42, FS_AES_256_XTS_KEY_SIZE); + goto got_key; + } + + res = validate_user_key(crypt_info, &ctx, raw_key, + FS_KEY_DESC_PREFIX, FS_KEY_DESC_PREFIX_SIZE); + if (res && inode->i_sb->s_cop->key_prefix) { + u8 *prefix = NULL; + int prefix_size, res2; + + prefix_size = inode->i_sb->s_cop->key_prefix(inode, &prefix); + res2 = validate_user_key(crypt_info, &ctx, raw_key, + prefix, prefix_size); + if (res2) { + if (res2 == -ENOKEY) + res = -ENOKEY; + goto out; + } + } else if (res) { + goto out; + } +got_key: + ctfm = crypto_alloc_ablkcipher(cipher_str, 0, 0); + if (!ctfm || IS_ERR(ctfm)) { + res = ctfm ? PTR_ERR(ctfm) : -ENOMEM; + printk(KERN_DEBUG + "%s: error %d (inode %u) allocating crypto tfm\n", + __func__, res, (unsigned) inode->i_ino); + goto out; + } + crypt_info->ci_ctfm = ctfm; + crypto_ablkcipher_clear_flags(ctfm, ~0); + crypto_tfm_set_flags(crypto_ablkcipher_tfm(ctfm), + CRYPTO_TFM_REQ_WEAK_KEY); + res = crypto_ablkcipher_setkey(ctfm, raw_key, fscrypt_key_size(mode)); + if (res) + goto out; + + memzero_explicit(raw_key, sizeof(raw_key)); + if (cmpxchg(&inode->i_crypt_info, NULL, crypt_info) != NULL) { + put_crypt_info(crypt_info); + goto retry; + } + return 0; + +out: + if (res == -ENOKEY) + res = 0; + put_crypt_info(crypt_info); + memzero_explicit(raw_key, sizeof(raw_key)); + return res; +} + +void fscrypt_put_encryption_info(struct inode *inode, struct fscrypt_info *ci) +{ + struct fscrypt_info *prev; + + if (ci == NULL) + ci = ACCESS_ONCE(inode->i_crypt_info); + if (ci == NULL) + return; + + prev = cmpxchg(&inode->i_crypt_info, ci, NULL); + if (prev != ci) + return; + + put_crypt_info(ci); +} +EXPORT_SYMBOL(fscrypt_put_encryption_info); + +int fscrypt_get_encryption_info(struct inode *inode) +{ + struct fscrypt_info *ci = inode->i_crypt_info; + + if (!ci || + (ci->ci_keyring_key && + (ci->ci_keyring_key->flags & ((1 << KEY_FLAG_INVALIDATED) | + (1 << KEY_FLAG_REVOKED) | + (1 << KEY_FLAG_DEAD))))) + return get_crypt_info(inode); + return 0; +} +EXPORT_SYMBOL(fscrypt_get_encryption_info); diff --git a/fs/crypto/policy.c b/fs/crypto/policy.c new file mode 100755 index 000000000000..0f9961eede1e --- /dev/null +++ b/fs/crypto/policy.c @@ -0,0 +1,229 @@ +/* + * Encryption policy functions for per-file encryption support. + * + * Copyright (C) 2015, Google, Inc. + * Copyright (C) 2015, Motorola Mobility. + * + * Written by Michael Halcrow, 2015. + * Modified by Jaegeuk Kim, 2015. + */ + +#include +#include +#include + +static int inode_has_encryption_context(struct inode *inode) +{ + if (!inode->i_sb->s_cop->get_context) + return 0; + return (inode->i_sb->s_cop->get_context(inode, NULL, 0L) > 0); +} + +/* + * check whether the policy is consistent with the encryption context + * for the inode + */ +static int is_encryption_context_consistent_with_policy(struct inode *inode, + const struct fscrypt_policy *policy) +{ + struct fscrypt_context ctx; + int res; + + if (!inode->i_sb->s_cop->get_context) + return 0; + + res = inode->i_sb->s_cop->get_context(inode, &ctx, sizeof(ctx)); + if (res != sizeof(ctx)) + return 0; + + return (memcmp(ctx.master_key_descriptor, policy->master_key_descriptor, + FS_KEY_DESCRIPTOR_SIZE) == 0 && + (ctx.flags == policy->flags) && + (ctx.contents_encryption_mode == + policy->contents_encryption_mode) && + (ctx.filenames_encryption_mode == + policy->filenames_encryption_mode)); +} + +static int create_encryption_context_from_policy(struct inode *inode, + const struct fscrypt_policy *policy) +{ + struct fscrypt_context ctx; + int res; + + if (!inode->i_sb->s_cop->set_context) + return -EOPNOTSUPP; + + if (inode->i_sb->s_cop->prepare_context) { + res = inode->i_sb->s_cop->prepare_context(inode); + if (res) + return res; + } + + ctx.format = FS_ENCRYPTION_CONTEXT_FORMAT_V1; + memcpy(ctx.master_key_descriptor, policy->master_key_descriptor, + FS_KEY_DESCRIPTOR_SIZE); + + if (!fscrypt_valid_contents_enc_mode( + policy->contents_encryption_mode)) { + printk(KERN_WARNING + "%s: Invalid contents encryption mode %d\n", __func__, + policy->contents_encryption_mode); + return -EINVAL; + } + + if (!fscrypt_valid_filenames_enc_mode( + policy->filenames_encryption_mode)) { + printk(KERN_WARNING + "%s: Invalid filenames encryption mode %d\n", __func__, + policy->filenames_encryption_mode); + return -EINVAL; + } + + if (policy->flags & ~FS_POLICY_FLAGS_VALID) + return -EINVAL; + + ctx.contents_encryption_mode = policy->contents_encryption_mode; + ctx.filenames_encryption_mode = policy->filenames_encryption_mode; + ctx.flags = policy->flags; + BUILD_BUG_ON(sizeof(ctx.nonce) != FS_KEY_DERIVATION_NONCE_SIZE); + get_random_bytes(ctx.nonce, FS_KEY_DERIVATION_NONCE_SIZE); + + return inode->i_sb->s_cop->set_context(inode, &ctx, sizeof(ctx), NULL); +} + +int fscrypt_process_policy(struct inode *inode, + const struct fscrypt_policy *policy) +{ + if (policy->version != 0) + return -EINVAL; + + if (!inode_has_encryption_context(inode)) { + if (!inode->i_sb->s_cop->empty_dir) + return -EOPNOTSUPP; + if (!inode->i_sb->s_cop->empty_dir(inode)) + return -ENOTEMPTY; + return create_encryption_context_from_policy(inode, policy); + } + + if (is_encryption_context_consistent_with_policy(inode, policy)) + return 0; + + printk(KERN_WARNING "%s: Policy inconsistent with encryption context\n", + __func__); + return -EINVAL; +} +EXPORT_SYMBOL(fscrypt_process_policy); + +int fscrypt_get_policy(struct inode *inode, struct fscrypt_policy *policy) +{ + struct fscrypt_context ctx; + int res; + + if (!inode->i_sb->s_cop->get_context || + !inode->i_sb->s_cop->is_encrypted(inode)) + return -ENODATA; + + res = inode->i_sb->s_cop->get_context(inode, &ctx, sizeof(ctx)); + if (res != sizeof(ctx)) + return -ENODATA; + if (ctx.format != FS_ENCRYPTION_CONTEXT_FORMAT_V1) + return -EINVAL; + + policy->version = 0; + policy->contents_encryption_mode = ctx.contents_encryption_mode; + policy->filenames_encryption_mode = ctx.filenames_encryption_mode; + policy->flags = ctx.flags; + memcpy(&policy->master_key_descriptor, ctx.master_key_descriptor, + FS_KEY_DESCRIPTOR_SIZE); + return 0; +} +EXPORT_SYMBOL(fscrypt_get_policy); + +int fscrypt_has_permitted_context(struct inode *parent, struct inode *child) +{ + struct fscrypt_info *parent_ci, *child_ci; + int res; + + if ((parent == NULL) || (child == NULL)) { + printk(KERN_ERR "parent %p child %p\n", parent, child); + BUG_ON(1); + } + + /* no restrictions if the parent directory is not encrypted */ + if (!parent->i_sb->s_cop->is_encrypted(parent)) + return 1; + /* if the child directory is not encrypted, this is always a problem */ + if (!parent->i_sb->s_cop->is_encrypted(child)) + return 0; + res = fscrypt_get_encryption_info(parent); + if (res) + return 0; + res = fscrypt_get_encryption_info(child); + if (res) + return 0; + parent_ci = parent->i_crypt_info; + child_ci = child->i_crypt_info; + if (!parent_ci && !child_ci) + return 1; + if (!parent_ci || !child_ci) + return 0; + + return (memcmp(parent_ci->ci_master_key, + child_ci->ci_master_key, + FS_KEY_DESCRIPTOR_SIZE) == 0 && + (parent_ci->ci_data_mode == child_ci->ci_data_mode) && + (parent_ci->ci_filename_mode == child_ci->ci_filename_mode) && + (parent_ci->ci_flags == child_ci->ci_flags)); +} +EXPORT_SYMBOL(fscrypt_has_permitted_context); + +/** + * fscrypt_inherit_context() - Sets a child context from its parent + * @parent: Parent inode from which the context is inherited. + * @child: Child inode that inherits the context from @parent. + * @fs_data: private data given by FS. + * @preload: preload child i_crypt_info + * + * Return: Zero on success, non-zero otherwise + */ +int fscrypt_inherit_context(struct inode *parent, struct inode *child, + void *fs_data, bool preload) +{ + struct fscrypt_context ctx; + struct fscrypt_info *ci; + int res; + + if (!parent->i_sb->s_cop->set_context) + return -EOPNOTSUPP; + + res = fscrypt_get_encryption_info(parent); + if (res < 0) + return res; + + ci = parent->i_crypt_info; + if (ci == NULL) + return -ENOKEY; + + ctx.format = FS_ENCRYPTION_CONTEXT_FORMAT_V1; + if (fscrypt_dummy_context_enabled(parent)) { + ctx.contents_encryption_mode = FS_ENCRYPTION_MODE_AES_256_XTS; + ctx.filenames_encryption_mode = FS_ENCRYPTION_MODE_AES_256_CTS; + ctx.flags = 0; + memset(ctx.master_key_descriptor, 0x42, FS_KEY_DESCRIPTOR_SIZE); + res = 0; + } else { + ctx.contents_encryption_mode = ci->ci_data_mode; + ctx.filenames_encryption_mode = ci->ci_filename_mode; + ctx.flags = ci->ci_flags; + memcpy(ctx.master_key_descriptor, ci->ci_master_key, + FS_KEY_DESCRIPTOR_SIZE); + } + get_random_bytes(ctx.nonce, FS_KEY_DERIVATION_NONCE_SIZE); + res = parent->i_sb->s_cop->set_context(child, &ctx, + sizeof(ctx), fs_data); + if (res) + return res; + return preload ? fscrypt_get_encryption_info(child): 0; +} +EXPORT_SYMBOL(fscrypt_inherit_context); diff --git a/fs/f2fs/Kconfig b/fs/f2fs/Kconfig old mode 100644 new mode 100755 index 05f0f663f14c..7ffabc9e148b --- a/fs/f2fs/Kconfig +++ b/fs/f2fs/Kconfig @@ -45,7 +45,7 @@ config F2FS_FS_POSIX_ACL default y help Posix Access Control Lists (ACLs) support permissions for users and - gourps beyond the owner/group/world scheme. + groups beyond the owner/group/world scheme. To learn more about Access Control Lists, visit the POSIX ACLs for Linux website . @@ -81,3 +81,11 @@ config F2FS_IO_TRACE information and block IO patterns in the filesystem level. If unsure, say N. + +config F2FS_FAULT_INJECTION + bool "F2FS fault injection facility" + depends on F2FS_FS + help + Test F2FS to inject faults such as ENOMEM, ENOSPC, and so on. + + If unsure, say N. diff --git a/fs/f2fs/Makefile b/fs/f2fs/Makefile old mode 100644 new mode 100755 index d92397731db8..ca949ea7c02f --- a/fs/f2fs/Makefile +++ b/fs/f2fs/Makefile @@ -2,6 +2,7 @@ obj-$(CONFIG_F2FS_FS) += f2fs.o f2fs-y := dir.o file.o inode.o namei.o hash.o super.o inline.o f2fs-y += checkpoint.o gc.o data.o node.o segment.o recovery.o +f2fs-y += shrinker.o extent_cache.o f2fs-$(CONFIG_F2FS_STAT_FS) += debug.o f2fs-$(CONFIG_F2FS_FS_XATTR) += xattr.o f2fs-$(CONFIG_F2FS_FS_POSIX_ACL) += acl.o diff --git a/fs/f2fs/acl.c b/fs/f2fs/acl.c old mode 100644 new mode 100755 index df1a307f5e9c..f431f9925556 --- a/fs/f2fs/acl.c +++ b/fs/f2fs/acl.c @@ -107,7 +107,7 @@ static void *f2fs_acl_to_disk(const struct posix_acl *acl, size_t *size) struct f2fs_acl_entry *entry; int i; - f2fs_acl = kmalloc(sizeof(struct f2fs_acl_header) + acl->a_count * + f2fs_acl = f2fs_kmalloc(sizeof(struct f2fs_acl_header) + acl->a_count * sizeof(struct f2fs_acl_entry), GFP_NOFS); if (!f2fs_acl) return ERR_PTR(-ENOMEM); @@ -167,7 +167,7 @@ static struct posix_acl *__f2fs_get_acl(struct inode *inode, int type, retval = f2fs_getxattr(inode, name_index, "", NULL, 0, dpage); if (retval > 0) { - value = kmalloc(retval, GFP_F2FS_ZERO); + value = f2fs_kmalloc(retval, GFP_F2FS_ZERO); if (!value) return ERR_PTR(-ENOMEM); retval = f2fs_getxattr(inode, name_index, "", value, @@ -197,7 +197,6 @@ static int f2fs_set_acl(struct inode *inode, int type, struct posix_acl *acl, struct page *ipage) { struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb); - struct f2fs_inode_info *fi = F2FS_I(inode); int name_index; void *value = NULL; size_t size = 0; @@ -215,7 +214,7 @@ static int f2fs_set_acl(struct inode *inode, int type, error = posix_acl_equiv_mode(acl, &inode->i_mode); if (error < 0) return error; - set_acl_inode(fi, inode->i_mode); + set_acl_inode(inode, inode->i_mode); if (error == 0) acl = NULL; } @@ -231,10 +230,12 @@ static int f2fs_set_acl(struct inode *inode, int type, return -EINVAL; } + f2fs_mark_inode_dirty_sync(inode); + if (acl) { value = f2fs_acl_to_disk(acl, &size); if (IS_ERR(value)) { - clear_inode_flag(fi, FI_ACL_MODE); + clear_inode_flag(inode, FI_ACL_MODE); return (int)PTR_ERR(value); } } @@ -245,7 +246,7 @@ static int f2fs_set_acl(struct inode *inode, int type, if (!error) set_cached_acl(inode, type, acl); - clear_inode_flag(fi, FI_ACL_MODE); + clear_inode_flag(inode, FI_ACL_MODE); return error; } @@ -279,6 +280,8 @@ int f2fs_init_acl(struct inode *inode, struct inode *dir, struct page *ipage, return error; if (error > 0) error = f2fs_set_acl(inode, ACL_TYPE_ACCESS, acl, ipage); + + f2fs_mark_inode_dirty_sync(inode); cleanup: posix_acl_release(acl); return error; @@ -340,7 +343,7 @@ static int f2fs_xattr_get_acl(struct dentry *dentry, const char *name, if (!test_opt(sbi, POSIX_ACL)) return -EOPNOTSUPP; - acl = f2fs_get_acl(dentry->d_inode, type); + acl = f2fs_get_acl(d_inode(dentry), type); if (IS_ERR(acl)) return PTR_ERR(acl); if (!acl) @@ -355,7 +358,7 @@ static int f2fs_xattr_set_acl(struct dentry *dentry, const char *name, const void *value, size_t size, int flags, int type) { struct f2fs_sb_info *sbi = F2FS_SB(dentry->d_sb); - struct inode *inode = dentry->d_inode; + struct inode *inode = d_inode(dentry); struct posix_acl *acl = NULL; int error; diff --git a/fs/f2fs/acl.h b/fs/f2fs/acl.h old mode 100644 new mode 100755 diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c old mode 100644 new mode 100755 index 7b20cf064338..3eee62022b9b --- a/fs/f2fs/checkpoint.c +++ b/fs/f2fs/checkpoint.c @@ -26,6 +26,14 @@ static struct kmem_cache *ino_entry_slab; struct kmem_cache *inode_entry_slab; +void f2fs_stop_checkpoint(struct f2fs_sb_info *sbi, bool end_io) +{ + set_ckpt_flags(sbi->ckpt, CP_ERROR_FLAG); + sbi->sb->s_flags |= MS_RDONLY; + if (!end_io) + f2fs_flush_merged_bios(sbi); +} + /* * We guarantee no failure on the returned page. */ @@ -34,30 +42,38 @@ struct page *grab_meta_page(struct f2fs_sb_info *sbi, pgoff_t index) struct address_space *mapping = META_MAPPING(sbi); struct page *page = NULL; repeat: - page = grab_cache_page(mapping, index); + page = f2fs_grab_cache_page(mapping, index, false); if (!page) { cond_resched(); goto repeat; } - f2fs_wait_on_page_writeback(page, META); - SetPageUptodate(page); + f2fs_wait_on_page_writeback(page, META, true); + if (!PageUptodate(page)) + SetPageUptodate(page); return page; } /* * We guarantee no failure on the returned page. */ -struct page *get_meta_page(struct f2fs_sb_info *sbi, pgoff_t index) +static struct page *__get_meta_page(struct f2fs_sb_info *sbi, pgoff_t index, + bool is_meta) { struct address_space *mapping = META_MAPPING(sbi); struct page *page; struct f2fs_io_info fio = { + .sbi = sbi, .type = META, .rw = READ_SYNC | REQ_META | REQ_PRIO, - .blk_addr = index, + .old_blkaddr = index, + .new_blkaddr = index, + .encrypted_page = NULL, }; + + if (unlikely(!is_meta)) + fio.rw &= ~REQ_META; repeat: - page = grab_cache_page(mapping, index); + page = f2fs_grab_cache_page(mapping, index, false); if (!page) { cond_resched(); goto repeat; @@ -65,21 +81,43 @@ struct page *get_meta_page(struct f2fs_sb_info *sbi, pgoff_t index) if (PageUptodate(page)) goto out; - if (f2fs_submit_page_bio(sbi, page, &fio)) + fio.page = page; + + if (f2fs_submit_page_bio(&fio)) { + f2fs_put_page(page, 1); goto repeat; + } lock_page(page); if (unlikely(page->mapping != mapping)) { f2fs_put_page(page, 1); goto repeat; } + + /* + * if there is any IO error when accessing device, make our filesystem + * readonly and make sure do not write checkpoint with non-uptodate + * meta page. + */ + if (unlikely(!PageUptodate(page))) + f2fs_stop_checkpoint(sbi, false); out: mark_page_accessed(page); return page; } -static inline bool is_valid_blkaddr(struct f2fs_sb_info *sbi, - block_t blkaddr, int type) +struct page *get_meta_page(struct f2fs_sb_info *sbi, pgoff_t index) +{ + return __get_meta_page(sbi, index, true); +} + +/* for POR only */ +struct page *get_tmp_page(struct f2fs_sb_info *sbi, pgoff_t index) +{ + return __get_meta_page(sbi, index, false); +} + +bool is_valid_blkaddr(struct f2fs_sb_info *sbi, block_t blkaddr, int type) { switch (type) { case META_NAT: @@ -113,16 +151,23 @@ static inline bool is_valid_blkaddr(struct f2fs_sb_info *sbi, /* * Readahead CP/NAT/SIT/SSA pages */ -int ra_meta_pages(struct f2fs_sb_info *sbi, block_t start, int nrpages, int type) +int ra_meta_pages(struct f2fs_sb_info *sbi, block_t start, int nrpages, + int type, bool sync) { - block_t prev_blk_addr = 0; struct page *page; block_t blkno = start; struct f2fs_io_info fio = { + .sbi = sbi, .type = META, - .rw = READ_SYNC | REQ_META | REQ_PRIO + .rw = sync ? (READ_SYNC | REQ_META | REQ_PRIO) : READA, + .encrypted_page = NULL, }; + struct blk_plug plug; + + if (unlikely(type == META_POR)) + fio.rw &= ~REQ_META; + blk_start_plug(&plug); for (; nrpages-- > 0; blkno++) { if (!is_valid_blkaddr(sbi, blkno, type)) @@ -134,27 +179,25 @@ int ra_meta_pages(struct f2fs_sb_info *sbi, block_t start, int nrpages, int type NAT_BLOCK_OFFSET(NM_I(sbi)->max_nid))) blkno = 0; /* get nat block addr */ - fio.blk_addr = current_nat_addr(sbi, + fio.new_blkaddr = current_nat_addr(sbi, blkno * NAT_ENTRY_PER_BLOCK); break; case META_SIT: /* get sit block addr */ - fio.blk_addr = current_sit_addr(sbi, + fio.new_blkaddr = current_sit_addr(sbi, blkno * SIT_ENTRY_PER_BLOCK); - if (blkno != start && prev_blk_addr + 1 != fio.blk_addr) - goto out; - prev_blk_addr = fio.blk_addr; break; case META_SSA: case META_CP: case META_POR: - fio.blk_addr = blkno; + fio.new_blkaddr = blkno; break; default: BUG(); } - page = grab_cache_page(META_MAPPING(sbi), fio.blk_addr); + page = f2fs_grab_cache_page(META_MAPPING(sbi), + fio.new_blkaddr, false); if (!page) continue; if (PageUptodate(page)) { @@ -162,11 +205,14 @@ int ra_meta_pages(struct f2fs_sb_info *sbi, block_t start, int nrpages, int type continue; } - f2fs_submit_page_mbio(sbi, page, &fio); + fio.page = page; + fio.old_blkaddr = fio.new_blkaddr; + f2fs_submit_page_mbio(&fio); f2fs_put_page(page, 0); } out: f2fs_submit_merged_bio(sbi, META, READ); + blk_finish_plug(&plug); return blkno - start; } @@ -176,12 +222,12 @@ void ra_meta_pages_cond(struct f2fs_sb_info *sbi, pgoff_t index) bool readahead = false; page = find_get_page(META_MAPPING(sbi), index); - if (!page || (page && !PageUptodate(page))) + if (!page || !PageUptodate(page)) readahead = true; f2fs_put_page(page, 0); if (readahead) - ra_meta_pages(sbi, index, MAX_BIO_BLOCKS(sbi), META_POR); + ra_meta_pages(sbi, index, MAX_BIO_BLOCKS(sbi), META_POR, true); } static int f2fs_write_meta_page(struct page *page, @@ -198,13 +244,17 @@ static int f2fs_write_meta_page(struct page *page, if (unlikely(f2fs_cp_error(sbi))) goto redirty_out; - f2fs_wait_on_page_writeback(page, META); write_meta_page(sbi, page); dec_page_count(sbi, F2FS_DIRTY_META); - unlock_page(page); if (wbc->for_reclaim) + f2fs_submit_merged_bio_cond(sbi, NULL, page, 0, META, WRITE); + + unlock_page(page); + + if (unlikely(f2fs_cp_error(sbi))) f2fs_submit_merged_bio(sbi, META, WRITE); + return 0; redirty_out: @@ -216,25 +266,29 @@ static int f2fs_write_meta_pages(struct address_space *mapping, struct writeback_control *wbc) { struct f2fs_sb_info *sbi = F2FS_M_SB(mapping); + struct blk_plug plug; long diff, written; - trace_f2fs_writepages(mapping->host, wbc, META); - /* collect a number of dirty meta pages and write together */ if (wbc->for_kupdate || get_pages(sbi, F2FS_DIRTY_META) < nr_pages_to_skip(sbi, META)) goto skip_write; + trace_f2fs_writepages(mapping->host, wbc, META); + /* if mounting is failed, skip writing node pages */ mutex_lock(&sbi->cp_mutex); diff = nr_pages_to_write(sbi, META, wbc); + blk_start_plug(&plug); written = sync_meta_pages(sbi, META, wbc->nr_to_write); + blk_finish_plug(&plug); mutex_unlock(&sbi->cp_mutex); wbc->nr_to_write = max((long)0, wbc->nr_to_write - written - diff); return 0; skip_write: wbc->pages_skipped += get_pages(sbi, F2FS_DIRTY_META); + trace_f2fs_writepages(mapping->host, wbc, META); return 0; } @@ -242,15 +296,18 @@ long sync_meta_pages(struct f2fs_sb_info *sbi, enum page_type type, long nr_to_write) { struct address_space *mapping = META_MAPPING(sbi); - pgoff_t index = 0, end = LONG_MAX; + pgoff_t index = 0, end = ULONG_MAX, prev = ULONG_MAX; struct pagevec pvec; long nwritten = 0; struct writeback_control wbc = { .for_reclaim = 0, }; + struct blk_plug plug; pagevec_init(&pvec, 0); + blk_start_plug(&plug); + while (index <= end) { int i, nr_pages; nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, @@ -262,6 +319,13 @@ long sync_meta_pages(struct f2fs_sb_info *sbi, enum page_type type, for (i = 0; i < nr_pages; i++) { struct page *page = pvec.pages[i]; + if (prev == ULONG_MAX) + prev = page->index - 1; + if (nr_to_write != LONG_MAX && page->index != prev + 1) { + pagevec_release(&pvec); + goto stop; + } + lock_page(page); if (unlikely(page->mapping != mapping)) { @@ -274,6 +338,9 @@ long sync_meta_pages(struct f2fs_sb_info *sbi, enum page_type type, goto continue_unlock; } + f2fs_wait_on_page_writeback(page, META, true); + + BUG_ON(PageWriteback(page)); if (!clear_page_dirty_for_io(page)) goto continue_unlock; @@ -282,16 +349,19 @@ long sync_meta_pages(struct f2fs_sb_info *sbi, enum page_type type, break; } nwritten++; + prev = page->index; if (unlikely(nwritten >= nr_to_write)) break; } pagevec_release(&pvec); cond_resched(); } - +stop: if (nwritten) f2fs_submit_merged_bio(sbi, type, WRITE); + blk_finish_plug(&plug); + return nwritten; } @@ -299,9 +369,10 @@ static int f2fs_set_meta_page_dirty(struct page *page) { trace_f2fs_set_page_dirty(page, META); - SetPageUptodate(page); + if (!PageUptodate(page)) + SetPageUptodate(page); if (!PageDirty(page)) { - __set_page_dirty_nobuffers(page); + f2fs_set_page_dirty_nobuffers(page); inc_page_count(F2FS_P_SB(page), F2FS_DIRTY_META); SetPagePrivate(page); f2fs_trace_pid(page); @@ -321,26 +392,18 @@ const struct address_space_operations f2fs_meta_aops = { static void __add_ino_entry(struct f2fs_sb_info *sbi, nid_t ino, int type) { struct inode_management *im = &sbi->im[type]; - struct ino_entry *e; + struct ino_entry *e, *tmp; + + tmp = f2fs_kmem_cache_alloc(ino_entry_slab, GFP_NOFS); retry: - if (radix_tree_preload(GFP_NOFS)) { - cond_resched(); - goto retry; - } + radix_tree_preload(GFP_NOFS | __GFP_NOFAIL); spin_lock(&im->ino_lock); - e = radix_tree_lookup(&im->ino_root, ino); if (!e) { - e = kmem_cache_alloc(ino_entry_slab, GFP_ATOMIC); - if (!e) { - spin_unlock(&im->ino_lock); - radix_tree_preload_end(); - goto retry; - } + e = tmp; if (radix_tree_insert(&im->ino_root, ino, e)) { spin_unlock(&im->ino_lock); - kmem_cache_free(ino_entry_slab, e); radix_tree_preload_end(); goto retry; } @@ -353,6 +416,9 @@ static void __add_ino_entry(struct f2fs_sb_info *sbi, nid_t ino, int type) } spin_unlock(&im->ino_lock); radix_tree_preload_end(); + + if (e != tmp) + kmem_cache_free(ino_entry_slab, tmp); } static void __remove_ino_entry(struct f2fs_sb_info *sbi, nid_t ino, int type) @@ -373,13 +439,13 @@ static void __remove_ino_entry(struct f2fs_sb_info *sbi, nid_t ino, int type) spin_unlock(&im->ino_lock); } -void add_dirty_inode(struct f2fs_sb_info *sbi, nid_t ino, int type) +void add_ino_entry(struct f2fs_sb_info *sbi, nid_t ino, int type) { /* add new dirty ino entry into list */ __add_ino_entry(sbi, ino, type); } -void remove_dirty_inode(struct f2fs_sb_info *sbi, nid_t ino, int type) +void remove_ino_entry(struct f2fs_sb_info *sbi, nid_t ino, int type) { /* remove dirty ino entry from list */ __remove_ino_entry(sbi, ino, type); @@ -397,12 +463,12 @@ bool exist_written_data(struct f2fs_sb_info *sbi, nid_t ino, int mode) return e ? true : false; } -void release_dirty_inode(struct f2fs_sb_info *sbi) +void release_ino_entry(struct f2fs_sb_info *sbi, bool all) { struct ino_entry *e, *tmp; int i; - for (i = APPEND_INO; i <= UPDATE_INO; i++) { + for (i = all ? ORPHAN_INO: APPEND_INO; i <= UPDATE_INO; i++) { struct inode_management *im = &sbi->im[i]; spin_lock(&im->ino_lock); @@ -422,6 +488,13 @@ int acquire_orphan_inode(struct f2fs_sb_info *sbi) int err = 0; spin_lock(&im->ino_lock); + +#ifdef CONFIG_F2FS_FAULT_INJECTION + if (time_to_inject(FAULT_ORPHAN)) { + spin_unlock(&im->ino_lock); + return -ENOSPC; + } +#endif if (unlikely(im->ino_num >= sbi->max_orphans)) err = -ENOSPC; else @@ -441,10 +514,11 @@ void release_orphan_inode(struct f2fs_sb_info *sbi) spin_unlock(&im->ino_lock); } -void add_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino) +void add_orphan_inode(struct inode *inode) { /* add new orphan ino entry into list */ - __add_ino_entry(sbi, ino, ORPHAN_INO); + __add_ino_entry(F2FS_I_SB(inode), inode->i_ino, ORPHAN_INO); + update_inode_page(inode); } void remove_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino) @@ -453,29 +527,39 @@ void remove_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino) __remove_ino_entry(sbi, ino, ORPHAN_INO); } -static void recover_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino) +static int recover_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino) { - struct inode *inode = f2fs_iget(sbi->sb, ino); - f2fs_bug_on(sbi, IS_ERR(inode)); + struct inode *inode; + + inode = f2fs_iget(sbi->sb, ino); + if (IS_ERR(inode)) { + /* + * there should be a bug that we can't find the entry + * to orphan inode. + */ + f2fs_bug_on(sbi, PTR_ERR(inode) == -ENOENT); + return PTR_ERR(inode); + } + clear_nlink(inode); /* truncate all the data during iput */ iput(inode); + return 0; } -void recover_orphan_inodes(struct f2fs_sb_info *sbi) +int recover_orphan_inodes(struct f2fs_sb_info *sbi) { block_t start_blk, orphan_blocks, i, j; + int err; if (!is_set_ckpt_flags(F2FS_CKPT(sbi), CP_ORPHAN_PRESENT_FLAG)) - return; - - set_sbi_flag(sbi, SBI_POR_DOING); + return 0; start_blk = __start_cp_addr(sbi) + 1 + __cp_payload(sbi); orphan_blocks = __start_sum_addr(sbi) - 1 - __cp_payload(sbi); - ra_meta_pages(sbi, start_blk, orphan_blocks, META_CP); + ra_meta_pages(sbi, start_blk, orphan_blocks, META_CP, true); for (i = 0; i < orphan_blocks; i++) { struct page *page = get_meta_page(sbi, start_blk + i); @@ -484,14 +568,17 @@ void recover_orphan_inodes(struct f2fs_sb_info *sbi) orphan_blk = (struct f2fs_orphan_block *)page_address(page); for (j = 0; j < le32_to_cpu(orphan_blk->entry_count); j++) { nid_t ino = le32_to_cpu(orphan_blk->ino[j]); - recover_orphan_inode(sbi, ino); + err = recover_orphan_inode(sbi, ino); + if (err) { + f2fs_put_page(page, 1); + return err; + } } f2fs_put_page(page, 1); } /* clear Orphan Flag */ clear_ckpt_flags(F2FS_CKPT(sbi), CP_ORPHAN_PRESENT_FLAG); - clear_sbi_flag(sbi, SBI_POR_DOING); - return; + return 0; } static void write_orphan_inodes(struct f2fs_sb_info *sbi, block_t start_blk) @@ -499,7 +586,7 @@ static void write_orphan_inodes(struct f2fs_sb_info *sbi, block_t start_blk) struct list_head *head; struct f2fs_orphan_block *orphan_blk = NULL; unsigned int nentries = 0; - unsigned short index; + unsigned short index = 1; unsigned short orphan_blocks; struct page *page = NULL; struct ino_entry *orphan = NULL; @@ -507,22 +594,20 @@ static void write_orphan_inodes(struct f2fs_sb_info *sbi, block_t start_blk) orphan_blocks = GET_ORPHAN_BLOCKS(im->ino_num); - for (index = 0; index < orphan_blocks; index++) - grab_meta_page(sbi, start_blk + index); - - index = 1; - spin_lock(&im->ino_lock); + /* + * we don't need to do spin_lock(&im->ino_lock) here, since all the + * orphan inode operations are covered under f2fs_lock_op(). + * And, spin_lock should be avoided due to page operations below. + */ head = &im->ino_list; /* loop for each orphan inode entry and write them in Jornal block */ list_for_each_entry(orphan, head, list) { if (!page) { - page = find_get_page(META_MAPPING(sbi), start_blk++); - f2fs_bug_on(sbi, !page); + page = grab_meta_page(sbi, start_blk++); orphan_blk = (struct f2fs_orphan_block *)page_address(page); memset(orphan_blk, 0, sizeof(*orphan_blk)); - f2fs_put_page(page, 0); } orphan_blk->ino[nentries++] = cpu_to_le32(orphan->ino); @@ -551,8 +636,6 @@ static void write_orphan_inodes(struct f2fs_sb_info *sbi, block_t start_blk) set_page_dirty(page); f2fs_put_page(page, 1); } - - spin_unlock(&im->ino_lock); } static struct page *validate_checkpoint(struct f2fs_sb_info *sbi, @@ -650,6 +733,10 @@ int get_valid_checkpoint(struct f2fs_sb_info *sbi) cp_block = (struct f2fs_checkpoint *)page_address(cur_page); memcpy(sbi->ckpt, cp_block, blk_size); + /* Sanity checking of checkpoint */ + if (sanity_check_ckpt(sbi)) + goto fail_no_cp; + if (cp_blks <= 1) goto done; @@ -676,117 +763,94 @@ int get_valid_checkpoint(struct f2fs_sb_info *sbi) return -EINVAL; } -static int __add_dirty_inode(struct inode *inode, struct inode_entry *new) +static void __add_dirty_inode(struct inode *inode, enum inode_type type) { struct f2fs_sb_info *sbi = F2FS_I_SB(inode); + int flag = (type == DIR_INODE) ? FI_DIRTY_DIR : FI_DIRTY_FILE; - if (is_inode_flag_set(F2FS_I(inode), FI_DIRTY_DIR)) - return -EEXIST; + if (is_inode_flag_set(inode, flag)) + return; - set_inode_flag(F2FS_I(inode), FI_DIRTY_DIR); - F2FS_I(inode)->dirty_dir = new; - list_add_tail(&new->list, &sbi->dir_inode_list); - stat_inc_dirty_dir(sbi); - return 0; + set_inode_flag(inode, flag); + list_add_tail(&F2FS_I(inode)->dirty_list, &sbi->inode_list[type]); + stat_inc_dirty_inode(sbi, type); } -void update_dirty_page(struct inode *inode, struct page *page) +static void __remove_dirty_inode(struct inode *inode, enum inode_type type) { - struct f2fs_sb_info *sbi = F2FS_I_SB(inode); - struct inode_entry *new; - int ret = 0; + int flag = (type == DIR_INODE) ? FI_DIRTY_DIR : FI_DIRTY_FILE; - if (!S_ISDIR(inode->i_mode) && !S_ISREG(inode->i_mode)) + if (get_dirty_pages(inode) || !is_inode_flag_set(inode, flag)) return; - if (!S_ISDIR(inode->i_mode)) { - inode_inc_dirty_pages(inode); - goto out; - } - - new = f2fs_kmem_cache_alloc(inode_entry_slab, GFP_NOFS); - new->inode = inode; - INIT_LIST_HEAD(&new->list); - - spin_lock(&sbi->dir_inode_lock); - ret = __add_dirty_inode(inode, new); - inode_inc_dirty_pages(inode); - spin_unlock(&sbi->dir_inode_lock); - - if (ret) - kmem_cache_free(inode_entry_slab, new); -out: - SetPagePrivate(page); - f2fs_trace_pid(page); + list_del_init(&F2FS_I(inode)->dirty_list); + clear_inode_flag(inode, flag); + stat_dec_dirty_inode(F2FS_I_SB(inode), type); } -void add_dirty_dir_inode(struct inode *inode) +void update_dirty_page(struct inode *inode, struct page *page) { struct f2fs_sb_info *sbi = F2FS_I_SB(inode); - struct inode_entry *new = - f2fs_kmem_cache_alloc(inode_entry_slab, GFP_NOFS); - int ret = 0; + enum inode_type type = S_ISDIR(inode->i_mode) ? DIR_INODE : FILE_INODE; - new->inode = inode; - INIT_LIST_HEAD(&new->list); + if (!S_ISDIR(inode->i_mode) && !S_ISREG(inode->i_mode) && + !S_ISLNK(inode->i_mode)) + return; - spin_lock(&sbi->dir_inode_lock); - ret = __add_dirty_inode(inode, new); - spin_unlock(&sbi->dir_inode_lock); + spin_lock(&sbi->inode_lock[type]); + if (type != FILE_INODE || test_opt(sbi, DATA_FLUSH)) + __add_dirty_inode(inode, type); + inode_inc_dirty_pages(inode); + spin_unlock(&sbi->inode_lock[type]); - if (ret) - kmem_cache_free(inode_entry_slab, new); + SetPagePrivate(page); + f2fs_trace_pid(page); } -void remove_dirty_dir_inode(struct inode *inode) +void remove_dirty_inode(struct inode *inode) { struct f2fs_sb_info *sbi = F2FS_I_SB(inode); - struct inode_entry *entry; + enum inode_type type = S_ISDIR(inode->i_mode) ? DIR_INODE : FILE_INODE; - if (!S_ISDIR(inode->i_mode)) + if (!S_ISDIR(inode->i_mode) && !S_ISREG(inode->i_mode) && + !S_ISLNK(inode->i_mode)) return; - spin_lock(&sbi->dir_inode_lock); - if (get_dirty_pages(inode) || - !is_inode_flag_set(F2FS_I(inode), FI_DIRTY_DIR)) { - spin_unlock(&sbi->dir_inode_lock); + if (type == FILE_INODE && !test_opt(sbi, DATA_FLUSH)) return; - } - entry = F2FS_I(inode)->dirty_dir; - list_del(&entry->list); - F2FS_I(inode)->dirty_dir = NULL; - clear_inode_flag(F2FS_I(inode), FI_DIRTY_DIR); - stat_dec_dirty_dir(sbi); - spin_unlock(&sbi->dir_inode_lock); - kmem_cache_free(inode_entry_slab, entry); - - /* Only from the recovery routine */ - if (is_inode_flag_set(F2FS_I(inode), FI_DELAY_IPUT)) { - clear_inode_flag(F2FS_I(inode), FI_DELAY_IPUT); - iput(inode); - } + spin_lock(&sbi->inode_lock[type]); + __remove_dirty_inode(inode, type); + spin_unlock(&sbi->inode_lock[type]); } -void sync_dirty_dir_inodes(struct f2fs_sb_info *sbi) +int sync_dirty_inodes(struct f2fs_sb_info *sbi, enum inode_type type) { struct list_head *head; - struct inode_entry *entry; struct inode *inode; + struct f2fs_inode_info *fi; + bool is_dir = (type == DIR_INODE); + + trace_f2fs_sync_dirty_inodes_enter(sbi->sb, is_dir, + get_pages(sbi, is_dir ? + F2FS_DIRTY_DENTS : F2FS_DIRTY_DATA)); retry: if (unlikely(f2fs_cp_error(sbi))) - return; + return -EIO; - spin_lock(&sbi->dir_inode_lock); + spin_lock(&sbi->inode_lock[type]); - head = &sbi->dir_inode_list; + head = &sbi->inode_list[type]; if (list_empty(head)) { - spin_unlock(&sbi->dir_inode_lock); - return; + spin_unlock(&sbi->inode_lock[type]); + trace_f2fs_sync_dirty_inodes_exit(sbi->sb, is_dir, + get_pages(sbi, is_dir ? + F2FS_DIRTY_DENTS : F2FS_DIRTY_DATA)); + return 0; } - entry = list_entry(head->next, struct inode_entry, list); - inode = igrab(entry->inode); - spin_unlock(&sbi->dir_inode_lock); + fi = list_entry(head->next, struct f2fs_inode_info, dirty_list); + inode = igrab(&fi->vfs_inode); + spin_unlock(&sbi->inode_lock[type]); if (inode) { filemap_fdatawrite(inode->i_mapping); iput(inode); @@ -801,6 +865,34 @@ void sync_dirty_dir_inodes(struct f2fs_sb_info *sbi) goto retry; } +int f2fs_sync_inode_meta(struct f2fs_sb_info *sbi) +{ + struct list_head *head = &sbi->inode_list[DIRTY_META]; + struct inode *inode; + struct f2fs_inode_info *fi; + s64 total = get_pages(sbi, F2FS_DIRTY_IMETA); + + while (total--) { + if (unlikely(f2fs_cp_error(sbi))) + return -EIO; + + spin_lock(&sbi->inode_lock[DIRTY_META]); + if (list_empty(head)) { + spin_unlock(&sbi->inode_lock[DIRTY_META]); + return 0; + } + fi = list_entry(head->next, struct f2fs_inode_info, + gdirty_list); + inode = igrab(&fi->vfs_inode); + spin_unlock(&sbi->inode_lock[DIRTY_META]); + if (inode) { + update_inode_page(inode); + iput(inode); + } + }; + return 0; +} + /* * Freeze all the FS-operations for checkpoint. */ @@ -821,11 +913,17 @@ static int block_operations(struct f2fs_sb_info *sbi) /* write all the dirty dentry pages */ if (get_pages(sbi, F2FS_DIRTY_DENTS)) { f2fs_unlock_all(sbi); - sync_dirty_dir_inodes(sbi); - if (unlikely(f2fs_cp_error(sbi))) { - err = -EIO; + err = sync_dirty_inodes(sbi, DIR_INODE); + if (err) + goto out; + goto retry_flush_dents; + } + + if (get_pages(sbi, F2FS_DIRTY_IMETA)) { + f2fs_unlock_all(sbi); + err = f2fs_sync_inode_meta(sbi); + if (err) goto out; - } goto retry_flush_dents; } @@ -838,10 +936,9 @@ static int block_operations(struct f2fs_sb_info *sbi) if (get_pages(sbi, F2FS_DIRTY_NODES)) { up_write(&sbi->node_write); - sync_node_pages(sbi, 0, &wbc); - if (unlikely(f2fs_cp_error(sbi))) { + err = sync_node_pages(sbi, &wbc); + if (err) { f2fs_unlock_all(sbi); - err = -EIO; goto out; } goto retry_flush_nodes; @@ -854,6 +951,8 @@ static int block_operations(struct f2fs_sb_info *sbi) static void unblock_operations(struct f2fs_sb_info *sbi) { up_write(&sbi->node_write); + + build_free_nids(sbi); f2fs_unlock_all(sbi); } @@ -864,15 +963,15 @@ static void wait_on_all_pages_writeback(struct f2fs_sb_info *sbi) for (;;) { prepare_to_wait(&sbi->cp_wait, &wait, TASK_UNINTERRUPTIBLE); - if (!get_pages(sbi, F2FS_WRITEBACK)) + if (!atomic_read(&sbi->nr_wb_bios)) break; - io_schedule(); + io_schedule_timeout(5*HZ); } finish_wait(&sbi->cp_wait, &wait); } -static void do_checkpoint(struct f2fs_sb_info *sbi, struct cp_control *cpc) +static int do_checkpoint(struct f2fs_sb_info *sbi, struct cp_control *cpc) { struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi); struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_WARM_NODE); @@ -880,24 +979,28 @@ static void do_checkpoint(struct f2fs_sb_info *sbi, struct cp_control *cpc) unsigned long orphan_num = sbi->im[ORPHAN_INO].ino_num; nid_t last_nid = nm_i->next_scan_nid; block_t start_blk; - struct page *cp_page; unsigned int data_sum_blocks, orphan_blocks; __u32 crc32 = 0; - void *kaddr; int i; int cp_payload_blks = __cp_payload(sbi); + block_t discard_blk = NEXT_FREE_BLKADDR(sbi, curseg); + bool invalidate = false; + struct super_block *sb = sbi->sb; + struct curseg_info *seg_i = CURSEG_I(sbi, CURSEG_HOT_NODE); + u64 kbytes_written; /* * This avoids to conduct wrong roll-forward operations and uses * metapages, so should be called prior to sync_meta_pages below. */ - discard_next_dnode(sbi, NEXT_FREE_BLKADDR(sbi, curseg)); + if (!test_opt(sbi, LFS) && discard_next_dnode(sbi, discard_blk)) + invalidate = true; /* Flush all the NAT/SIT pages */ while (get_pages(sbi, F2FS_DIRTY_META)) { sync_meta_pages(sbi, META, LONG_MAX); if (unlikely(f2fs_cp_error(sbi))) - return; + return -EIO; } next_free_nid(sbi, &last_nid); @@ -979,20 +1082,17 @@ static void do_checkpoint(struct f2fs_sb_info *sbi, struct cp_control *cpc) start_blk = __start_cp_addr(sbi); + /* need to wait for end_io results */ + wait_on_all_pages_writeback(sbi); + if (unlikely(f2fs_cp_error(sbi))) + return -EIO; + /* write out checkpoint buffer at block 0 */ - cp_page = grab_meta_page(sbi, start_blk++); - kaddr = page_address(cp_page); - memcpy(kaddr, ckpt, F2FS_BLKSIZE); - set_page_dirty(cp_page); - f2fs_put_page(cp_page, 1); - - for (i = 1; i < 1 + cp_payload_blks; i++) { - cp_page = grab_meta_page(sbi, start_blk++); - kaddr = page_address(cp_page); - memcpy(kaddr, (char *)ckpt + i * F2FS_BLKSIZE, F2FS_BLKSIZE); - set_page_dirty(cp_page); - f2fs_put_page(cp_page, 1); - } + update_meta_page(sbi, ckpt, start_blk++); + + for (i = 1; i < 1 + cp_payload_blks; i++) + update_meta_page(sbi, (char *)ckpt + i * F2FS_BLKSIZE, + start_blk++); if (orphan_num) { write_orphan_inodes(sbi, start_blk); @@ -1001,30 +1101,34 @@ static void do_checkpoint(struct f2fs_sb_info *sbi, struct cp_control *cpc) write_data_summaries(sbi, start_blk); start_blk += data_sum_blocks; + + /* Record write statistics in the hot node summary */ + kbytes_written = sbi->kbytes_written; + if (sb->s_bdev->bd_part) + kbytes_written += BD_PART_WRITTEN(sbi); + + seg_i->journal->info.kbytes_written = cpu_to_le64(kbytes_written); + if (__remain_node_summaries(cpc->reason)) { write_node_summaries(sbi, start_blk); start_blk += NR_CURSEG_NODE_TYPE; } /* writeout checkpoint block */ - cp_page = grab_meta_page(sbi, start_blk); - kaddr = page_address(cp_page); - memcpy(kaddr, ckpt, F2FS_BLKSIZE); - set_page_dirty(cp_page); - f2fs_put_page(cp_page, 1); + update_meta_page(sbi, ckpt, start_blk); /* wait for previous submitted node/meta pages writeback */ wait_on_all_pages_writeback(sbi); if (unlikely(f2fs_cp_error(sbi))) - return; + return -EIO; - filemap_fdatawait_range(NODE_MAPPING(sbi), 0, LONG_MAX); - filemap_fdatawait_range(META_MAPPING(sbi), 0, LONG_MAX); + filemap_fdatawait_range(NODE_MAPPING(sbi), 0, LLONG_MAX); + filemap_fdatawait_range(META_MAPPING(sbi), 0, LLONG_MAX); /* update user_block_counts */ sbi->last_valid_block_count = sbi->total_valid_block_count; - sbi->alloc_valid_block_count = 0; + percpu_counter_set(&sbi->alloc_valid_block_count, 0); /* Here, we only have one bio having CP pack */ sync_meta_pages(sbi, META_FLUSH, LONG_MAX); @@ -1032,43 +1136,58 @@ static void do_checkpoint(struct f2fs_sb_info *sbi, struct cp_control *cpc) /* wait for previous submitted meta pages writeback */ wait_on_all_pages_writeback(sbi); - release_dirty_inode(sbi); + /* + * invalidate meta page which is used temporarily for zeroing out + * block at the end of warm node chain. + */ + if (invalidate) + invalidate_mapping_pages(META_MAPPING(sbi), discard_blk, + discard_blk); + + release_ino_entry(sbi, false); if (unlikely(f2fs_cp_error(sbi))) - return; + return -EIO; - clear_prefree_segments(sbi); + clear_prefree_segments(sbi, cpc); clear_sbi_flag(sbi, SBI_IS_DIRTY); + + return 0; } /* * We guarantee that this checkpoint procedure will not fail. */ -void write_checkpoint(struct f2fs_sb_info *sbi, struct cp_control *cpc) +int write_checkpoint(struct f2fs_sb_info *sbi, struct cp_control *cpc) { struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi); unsigned long long ckpt_ver; + int err = 0; mutex_lock(&sbi->cp_mutex); if (!is_sbi_flag_set(sbi, SBI_IS_DIRTY) && - (cpc->reason == CP_FASTBOOT || cpc->reason == CP_SYNC)) + (cpc->reason == CP_FASTBOOT || cpc->reason == CP_SYNC || + (cpc->reason == CP_DISCARD && !sbi->discard_blks))) goto out; - if (unlikely(f2fs_cp_error(sbi))) + if (unlikely(f2fs_cp_error(sbi))) { + err = -EIO; goto out; - if (f2fs_readonly(sbi->sb)) + } + if (f2fs_readonly(sbi->sb)) { + err = -EROFS; goto out; + } trace_f2fs_write_checkpoint(sbi->sb, cpc->reason, "start block_ops"); - if (block_operations(sbi)) + err = block_operations(sbi); + if (err) goto out; trace_f2fs_write_checkpoint(sbi->sb, cpc->reason, "finish block_ops"); - f2fs_submit_merged_bio(sbi, DATA, WRITE); - f2fs_submit_merged_bio(sbi, NODE, WRITE); - f2fs_submit_merged_bio(sbi, META, WRITE); + f2fs_flush_merged_bios(sbi); /* * update checkpoint pack index @@ -1083,7 +1202,7 @@ void write_checkpoint(struct f2fs_sb_info *sbi, struct cp_control *cpc) flush_sit_entries(sbi, cpc); /* unlock all the fs_lock[] in do_checkpoint() */ - do_checkpoint(sbi, cpc); + err = do_checkpoint(sbi, cpc); unblock_operations(sbi); stat_inc_cp_count(sbi->stat_info); @@ -1091,9 +1210,13 @@ void write_checkpoint(struct f2fs_sb_info *sbi, struct cp_control *cpc) if (cpc->reason == CP_RECOVERY) f2fs_msg(sbi->sb, KERN_NOTICE, "checkpoint: version = %llx", ckpt_ver); + + /* do checkpoint periodically */ + f2fs_update_time(sbi, CP_TIME); + trace_f2fs_write_checkpoint(sbi->sb, cpc->reason, "finish checkpoint"); out: mutex_unlock(&sbi->cp_mutex); - trace_f2fs_write_checkpoint(sbi->sb, cpc->reason, "finish checkpoint"); + return err; } void init_ino_entry_info(struct f2fs_sb_info *sbi) diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c old mode 100644 new mode 100755 index aa1078a7669c..ed7f4604e02e --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -12,12 +12,17 @@ #include #include #include +#include #include #include +#include #include #include #include #include +#include +#include +#include #include "f2fs.h" #include "node.h" @@ -25,19 +30,26 @@ #include "trace.h" #include -static struct kmem_cache *extent_tree_slab; -static struct kmem_cache *extent_node_slab; - static void f2fs_read_end_io(struct bio *bio, int err) { struct bio_vec *bvec; int i; + if (f2fs_bio_encrypted(bio)) { + if (err) { + fscrypt_release_ctx(bio->bi_private); + } else { + fscrypt_decrypt_bio_pages(bio->bi_private, bio); + return; + } + } + __bio_for_each_segment(bvec, bio, i, 0) { struct page *page = bvec->bv_page; if (!err) { - SetPageUptodate(page); + if (!PageUptodate(page)) + SetPageUptodate(page); } else { ClearPageUptodate(page); SetPageError(page); @@ -56,17 +68,16 @@ static void f2fs_write_end_io(struct bio *bio, int err) __bio_for_each_segment(bvec, bio, i, 0) { struct page *page = bvec->bv_page; + fscrypt_pullback_bio_page(&page, true); + if (unlikely(err)) { - set_page_dirty(page); set_bit(AS_EIO, &page->mapping->flags); - f2fs_stop_checkpoint(sbi); + f2fs_stop_checkpoint(sbi, true); } end_page_writeback(page); - dec_page_count(sbi, F2FS_WRITEBACK); } - - if (!get_pages(sbi, F2FS_WRITEBACK) && - !list_empty(&sbi->cp_wait.task_list)) + if (atomic_dec_and_test(&sbi->nr_wb_bios) && + wq_has_sleeper(&sbi->cp_wait)) wake_up(&sbi->cp_wait); bio_put(bio); @@ -80,17 +91,28 @@ static struct bio *__bio_alloc(struct f2fs_sb_info *sbi, block_t blk_addr, { struct bio *bio; - /* No failure on bio allocation */ - bio = bio_alloc(GFP_NOIO, npages); + bio = f2fs_bio_alloc(npages); bio->bi_bdev = sbi->sb->s_bdev; bio->bi_sector = SECTOR_FROM_BLOCK(blk_addr); bio->bi_end_io = is_read ? f2fs_read_end_io : f2fs_write_end_io; - bio->bi_private = sbi; + bio->bi_private = is_read ? NULL : sbi; return bio; } +static inline void __submit_bio(struct f2fs_sb_info *sbi, int rw, + struct bio *bio, enum page_type type) +{ + if (!is_read_io(rw)) { + atomic_inc(&sbi->nr_wb_bios); + if (f2fs_sb_mounted_hmsmr(sbi->sb) && + current->plug && (type == DATA || type == NODE)) + blk_finish_plug(current->plug); + } + submit_bio(rw, bio); +} + static void __submit_merged_bio(struct f2fs_bio_info *io) { struct f2fs_io_info *fio = &io->fio; @@ -103,12 +125,58 @@ static void __submit_merged_bio(struct f2fs_bio_info *io) else trace_f2fs_submit_write_bio(io->sbi->sb, fio, io->bio); - submit_bio(fio->rw, io->bio); + __submit_bio(io->sbi, fio->rw, io->bio, fio->type); io->bio = NULL; } -void f2fs_submit_merged_bio(struct f2fs_sb_info *sbi, - enum page_type type, int rw) +static bool __has_merged_page(struct f2fs_bio_info *io, struct inode *inode, + struct page *page, nid_t ino) +{ + struct bio_vec *bvec; + struct page *target; + int i; + + if (!io->bio) + return false; + + if (!inode && !page && !ino) + return true; + + __bio_for_each_segment(bvec, io->bio, i, 0) { + + if (bvec->bv_page->mapping) + target = bvec->bv_page; + else + target = fscrypt_control_page(bvec->bv_page); + + if (inode && inode == target->mapping->host) + return true; + if (page && page == target) + return true; + if (ino && ino == ino_of_node(target)) + return true; + } + + return false; +} + +static bool has_merged_page(struct f2fs_sb_info *sbi, struct inode *inode, + struct page *page, nid_t ino, + enum page_type type) +{ + enum page_type btype = PAGE_TYPE_OF_BIO(type); + struct f2fs_bio_info *io = &sbi->write_io[btype]; + bool ret; + + down_read(&io->io_rwsem); + ret = __has_merged_page(io, inode, page, ino); + up_read(&io->io_rwsem); + return ret; +} + +static void __f2fs_submit_merged_bio(struct f2fs_sb_info *sbi, + struct inode *inode, struct page *page, + nid_t ino, enum page_type type, int rw) { enum page_type btype = PAGE_TYPE_OF_BIO(type); struct f2fs_bio_info *io; @@ -117,6 +185,9 @@ void f2fs_submit_merged_bio(struct f2fs_sb_info *sbi, down_write(&io->io_rwsem); + if (!__has_merged_page(io, inode, page, ino)) + goto out; + /* change META to META_FLUSH in the checkpoint procedure */ if (type >= META_FLUSH) { io->fio.type = META_FLUSH; @@ -126,72 +197,107 @@ void f2fs_submit_merged_bio(struct f2fs_sb_info *sbi, io->fio.rw = WRITE_FLUSH_FUA | REQ_META | REQ_PRIO; } __submit_merged_bio(io); +out: up_write(&io->io_rwsem); } +void f2fs_submit_merged_bio(struct f2fs_sb_info *sbi, enum page_type type, + int rw) +{ + __f2fs_submit_merged_bio(sbi, NULL, NULL, 0, type, rw); +} + +void f2fs_submit_merged_bio_cond(struct f2fs_sb_info *sbi, + struct inode *inode, struct page *page, + nid_t ino, enum page_type type, int rw) +{ + if (has_merged_page(sbi, inode, page, ino, type)) + __f2fs_submit_merged_bio(sbi, inode, page, ino, type, rw); +} + +void f2fs_flush_merged_bios(struct f2fs_sb_info *sbi) +{ + f2fs_submit_merged_bio(sbi, DATA, WRITE); + f2fs_submit_merged_bio(sbi, NODE, WRITE); + f2fs_submit_merged_bio(sbi, META, WRITE); +} + /* * Fill the locked page with data located in the block address. * Return unlocked page. */ -int f2fs_submit_page_bio(struct f2fs_sb_info *sbi, struct page *page, - struct f2fs_io_info *fio) +int f2fs_submit_page_bio(struct f2fs_io_info *fio) { struct bio *bio; + struct page *page = fio->encrypted_page ? + fio->encrypted_page : fio->page; trace_f2fs_submit_page_bio(page, fio); - f2fs_trace_ios(page, fio, 0); + f2fs_trace_ios(fio, 0); /* Allocate a new bio */ - bio = __bio_alloc(sbi, fio->blk_addr, 1, is_read_io(fio->rw)); + bio = __bio_alloc(fio->sbi, fio->new_blkaddr, 1, is_read_io(fio->rw)); - if (bio_add_page(bio, page, PAGE_CACHE_SIZE, 0) < PAGE_CACHE_SIZE) { + if (bio_add_page(bio, page, PAGE_SIZE, 0) < PAGE_SIZE) { bio_put(bio); - f2fs_put_page(page, 1); return -EFAULT; } - submit_bio(fio->rw, bio); + __submit_bio(fio->sbi, fio->rw, bio, fio->type); return 0; } -void f2fs_submit_page_mbio(struct f2fs_sb_info *sbi, struct page *page, - struct f2fs_io_info *fio) +void f2fs_submit_page_mbio(struct f2fs_io_info *fio) { + struct f2fs_sb_info *sbi = fio->sbi; enum page_type btype = PAGE_TYPE_OF_BIO(fio->type); struct f2fs_bio_info *io; bool is_read = is_read_io(fio->rw); + struct page *bio_page; io = is_read ? &sbi->read_io : &sbi->write_io[btype]; - verify_block_addr(sbi, fio->blk_addr); + if (fio->old_blkaddr != NEW_ADDR) + verify_block_addr(sbi, fio->old_blkaddr); + verify_block_addr(sbi, fio->new_blkaddr); down_write(&io->io_rwsem); - if (!is_read) - inc_page_count(sbi, F2FS_WRITEBACK); - - if (io->bio && (io->last_block_in_bio != fio->blk_addr - 1 || + if (io->bio && (io->last_block_in_bio != fio->new_blkaddr - 1 || io->fio.rw != fio->rw)) __submit_merged_bio(io); alloc_new: if (io->bio == NULL) { int bio_blocks = MAX_BIO_BLOCKS(sbi); - io->bio = __bio_alloc(sbi, fio->blk_addr, bio_blocks, is_read); + io->bio = __bio_alloc(sbi, fio->new_blkaddr, + bio_blocks, is_read); io->fio = *fio; } - if (bio_add_page(io->bio, page, PAGE_CACHE_SIZE, 0) < - PAGE_CACHE_SIZE) { + bio_page = fio->encrypted_page ? fio->encrypted_page : fio->page; + + if (bio_add_page(io->bio, bio_page, PAGE_SIZE, 0) < + PAGE_SIZE) { __submit_merged_bio(io); goto alloc_new; } - io->last_block_in_bio = fio->blk_addr; - f2fs_trace_ios(page, fio, 0); + io->last_block_in_bio = fio->new_blkaddr; + f2fs_trace_ios(fio, 0); up_write(&io->io_rwsem); - trace_f2fs_submit_page_mbio(page, fio); + trace_f2fs_submit_page_mbio(fio->page, fio); +} + +static void __set_data_blkaddr(struct dnode_of_data *dn) +{ + struct f2fs_node *rn = F2FS_NODE(dn->node_page); + __le32 *addr_array; + + /* Get physical address of data block */ + addr_array = blkaddr_in_node(rn); + addr_array[dn->ofs_in_node] = cpu_to_le32(dn->data_blkaddr); } /* @@ -202,39 +308,63 @@ void f2fs_submit_page_mbio(struct f2fs_sb_info *sbi, struct page *page, */ void set_data_blkaddr(struct dnode_of_data *dn) { - struct f2fs_node *rn; - __le32 *addr_array; - struct page *node_page = dn->node_page; - unsigned int ofs_in_node = dn->ofs_in_node; - - f2fs_wait_on_page_writeback(node_page, NODE); - - rn = F2FS_NODE(node_page); + f2fs_wait_on_page_writeback(dn->node_page, NODE, true); + __set_data_blkaddr(dn); + if (set_page_dirty(dn->node_page)) + dn->node_changed = true; +} - /* Get physical address of data block */ - addr_array = blkaddr_in_node(rn); - addr_array[ofs_in_node] = cpu_to_le32(dn->data_blkaddr); - set_page_dirty(node_page); +void f2fs_update_data_blkaddr(struct dnode_of_data *dn, block_t blkaddr) +{ + dn->data_blkaddr = blkaddr; + set_data_blkaddr(dn); + f2fs_update_extent_cache(dn); } -int reserve_new_block(struct dnode_of_data *dn) +/* dn->ofs_in_node will be returned with up-to-date last block pointer */ +int reserve_new_blocks(struct dnode_of_data *dn, blkcnt_t count) { struct f2fs_sb_info *sbi = F2FS_I_SB(dn->inode); - if (unlikely(is_inode_flag_set(F2FS_I(dn->inode), FI_NO_ALLOC))) + if (!count) + return 0; + + if (unlikely(is_inode_flag_set(dn->inode, FI_NO_ALLOC))) return -EPERM; - if (unlikely(!inc_valid_block_count(sbi, dn->inode, 1))) + if (unlikely(!inc_valid_block_count(sbi, dn->inode, &count))) return -ENOSPC; - trace_f2fs_reserve_new_block(dn->inode, dn->nid, dn->ofs_in_node); + trace_f2fs_reserve_new_blocks(dn->inode, dn->nid, + dn->ofs_in_node, count); - dn->data_blkaddr = NEW_ADDR; - set_data_blkaddr(dn); - mark_inode_dirty(dn->inode); - sync_inode_page(dn); + f2fs_wait_on_page_writeback(dn->node_page, NODE, true); + + for (; count > 0; dn->ofs_in_node++) { + block_t blkaddr = + datablock_addr(dn->node_page, dn->ofs_in_node); + if (blkaddr == NULL_ADDR) { + dn->data_blkaddr = NEW_ADDR; + __set_data_blkaddr(dn); + count--; + } + } + + if (set_page_dirty(dn->node_page)) + dn->node_changed = true; return 0; } +/* Should keep dn->ofs_in_node unchanged */ +int reserve_new_block(struct dnode_of_data *dn) +{ + unsigned int ofs_in_node = dn->ofs_in_node; + int ret; + + ret = reserve_new_blocks(dn, 1); + dn->ofs_in_node = ofs_in_node; + return ret; +} + int f2fs_reserve_block(struct dnode_of_data *dn, pgoff_t index) { bool need_put = dn->inode_page ? false : true; @@ -251,1092 +381,740 @@ int f2fs_reserve_block(struct dnode_of_data *dn, pgoff_t index) return err; } -static void f2fs_map_bh(struct super_block *sb, pgoff_t pgofs, - struct extent_info *ei, struct buffer_head *bh_result) +int f2fs_get_block(struct dnode_of_data *dn, pgoff_t index) { - unsigned int blkbits = sb->s_blocksize_bits; - size_t max_size = bh_result->b_size; - size_t mapped_size; - - clear_buffer_new(bh_result); - map_bh(bh_result, sb, ei->blk + pgofs - ei->fofs); - mapped_size = (ei->fofs + ei->len - pgofs) << blkbits; - bh_result->b_size = min(max_size, mapped_size); -} - -static bool lookup_extent_info(struct inode *inode, pgoff_t pgofs, - struct extent_info *ei) -{ - struct f2fs_inode_info *fi = F2FS_I(inode); - pgoff_t start_fofs, end_fofs; - block_t start_blkaddr; + struct extent_info ei; + struct inode *inode = dn->inode; - read_lock(&fi->ext_lock); - if (fi->ext.len == 0) { - read_unlock(&fi->ext_lock); - return false; + if (f2fs_lookup_extent_cache(inode, index, &ei)) { + dn->data_blkaddr = ei.blk + index - ei.fofs; + return 0; } - stat_inc_total_hit(inode->i_sb); - - start_fofs = fi->ext.fofs; - end_fofs = fi->ext.fofs + fi->ext.len - 1; - start_blkaddr = fi->ext.blk; - - if (pgofs >= start_fofs && pgofs <= end_fofs) { - *ei = fi->ext; - stat_inc_read_hit(inode->i_sb); - read_unlock(&fi->ext_lock); - return true; - } - read_unlock(&fi->ext_lock); - return false; + return f2fs_reserve_block(dn, index); } -static bool update_extent_info(struct inode *inode, pgoff_t fofs, - block_t blkaddr) +struct page *get_read_data_page(struct inode *inode, pgoff_t index, + int rw, bool for_write) { - struct f2fs_inode_info *fi = F2FS_I(inode); - pgoff_t start_fofs, end_fofs; - block_t start_blkaddr, end_blkaddr; - int need_update = true; - - write_lock(&fi->ext_lock); - - start_fofs = fi->ext.fofs; - end_fofs = fi->ext.fofs + fi->ext.len - 1; - start_blkaddr = fi->ext.blk; - end_blkaddr = fi->ext.blk + fi->ext.len - 1; - - /* Drop and initialize the matched extent */ - if (fi->ext.len == 1 && fofs == start_fofs) - fi->ext.len = 0; - - /* Initial extent */ - if (fi->ext.len == 0) { - if (blkaddr != NULL_ADDR) { - fi->ext.fofs = fofs; - fi->ext.blk = blkaddr; - fi->ext.len = 1; - } - goto end_update; - } + struct address_space *mapping = inode->i_mapping; + struct dnode_of_data dn; + struct page *page; + struct extent_info ei; + int err; + struct f2fs_io_info fio = { + .sbi = F2FS_I_SB(inode), + .type = DATA, + .rw = rw, + .encrypted_page = NULL, + }; - /* Front merge */ - if (fofs == start_fofs - 1 && blkaddr == start_blkaddr - 1) { - fi->ext.fofs--; - fi->ext.blk--; - fi->ext.len++; - goto end_update; - } + if (f2fs_encrypted_inode(inode) && S_ISREG(inode->i_mode)) + return read_mapping_page(mapping, index, NULL); - /* Back merge */ - if (fofs == end_fofs + 1 && blkaddr == end_blkaddr + 1) { - fi->ext.len++; - goto end_update; - } + page = f2fs_grab_cache_page(mapping, index, for_write); + if (!page) + return ERR_PTR(-ENOMEM); - /* Split the existing extent */ - if (fi->ext.len > 1 && - fofs >= start_fofs && fofs <= end_fofs) { - if ((end_fofs - fofs) < (fi->ext.len >> 1)) { - fi->ext.len = fofs - start_fofs; - } else { - fi->ext.fofs = fofs + 1; - fi->ext.blk = start_blkaddr + fofs - start_fofs + 1; - fi->ext.len -= fofs - start_fofs + 1; - } - } else { - need_update = false; + if (f2fs_lookup_extent_cache(inode, index, &ei)) { + dn.data_blkaddr = ei.blk + index - ei.fofs; + goto got_it; } - /* Finally, if the extent is very fragmented, let's drop the cache. */ - if (fi->ext.len < F2FS_MIN_EXTENT_LEN) { - fi->ext.len = 0; - set_inode_flag(fi, FI_NO_EXTENT); - need_update = true; - } -end_update: - write_unlock(&fi->ext_lock); - return need_update; -} + set_new_dnode(&dn, inode, NULL, NULL, 0); + err = get_dnode_of_data(&dn, index, LOOKUP_NODE); + if (err) + goto put_err; + f2fs_put_dnode(&dn); -static struct extent_node *__attach_extent_node(struct f2fs_sb_info *sbi, - struct extent_tree *et, struct extent_info *ei, - struct rb_node *parent, struct rb_node **p) -{ - struct extent_node *en; + if (unlikely(dn.data_blkaddr == NULL_ADDR)) { + err = -ENOENT; + goto put_err; + } +got_it: + if (PageUptodate(page)) { + unlock_page(page); + return page; + } - en = kmem_cache_alloc(extent_node_slab, GFP_ATOMIC); - if (!en) - return NULL; + /* + * A new dentry page is allocated but not able to be written, since its + * new inode page couldn't be allocated due to -ENOSPC. + * In such the case, its blkaddr can be remained as NEW_ADDR. + * see, f2fs_add_link -> get_new_data_page -> init_inode_metadata. + */ + if (dn.data_blkaddr == NEW_ADDR) { + zero_user_segment(page, 0, PAGE_SIZE); + if (!PageUptodate(page)) + SetPageUptodate(page); + unlock_page(page); + return page; + } - en->ei = *ei; - INIT_LIST_HEAD(&en->list); + fio.new_blkaddr = fio.old_blkaddr = dn.data_blkaddr; + fio.page = page; + err = f2fs_submit_page_bio(&fio); + if (err) + goto put_err; + return page; - rb_link_node(&en->rb_node, parent, p); - rb_insert_color(&en->rb_node, &et->root); - et->count++; - atomic_inc(&sbi->total_ext_node); - return en; +put_err: + f2fs_put_page(page, 1); + return ERR_PTR(err); } -static void __detach_extent_node(struct f2fs_sb_info *sbi, - struct extent_tree *et, struct extent_node *en) +struct page *find_data_page(struct inode *inode, pgoff_t index) { - rb_erase(&en->rb_node, &et->root); - et->count--; - atomic_dec(&sbi->total_ext_node); - - if (et->cached_en == en) - et->cached_en = NULL; -} + struct address_space *mapping = inode->i_mapping; + struct page *page; -static struct extent_tree *__find_extent_tree(struct f2fs_sb_info *sbi, - nid_t ino) -{ - struct extent_tree *et; + page = find_get_page(mapping, index); + if (page && PageUptodate(page)) + return page; + f2fs_put_page(page, 0); - down_read(&sbi->extent_tree_lock); - et = radix_tree_lookup(&sbi->extent_tree_root, ino); - if (!et) { - up_read(&sbi->extent_tree_lock); - return NULL; - } - atomic_inc(&et->refcount); - up_read(&sbi->extent_tree_lock); + page = get_read_data_page(inode, index, READ_SYNC, false); + if (IS_ERR(page)) + return page; - return et; -} + if (PageUptodate(page)) + return page; -static struct extent_tree *__grab_extent_tree(struct inode *inode) -{ - struct f2fs_sb_info *sbi = F2FS_I_SB(inode); - struct extent_tree *et; - nid_t ino = inode->i_ino; - - down_write(&sbi->extent_tree_lock); - et = radix_tree_lookup(&sbi->extent_tree_root, ino); - if (!et) { - et = f2fs_kmem_cache_alloc(extent_tree_slab, GFP_NOFS); - f2fs_radix_tree_insert(&sbi->extent_tree_root, ino, et); - memset(et, 0, sizeof(struct extent_tree)); - et->ino = ino; - et->root = RB_ROOT; - et->cached_en = NULL; - rwlock_init(&et->lock); - atomic_set(&et->refcount, 0); - et->count = 0; - sbi->total_ext_tree++; + wait_on_page_locked(page); + if (unlikely(!PageUptodate(page))) { + f2fs_put_page(page, 0); + return ERR_PTR(-EIO); } - atomic_inc(&et->refcount); - up_write(&sbi->extent_tree_lock); - - return et; + return page; } -static struct extent_node *__lookup_extent_tree(struct extent_tree *et, - unsigned int fofs) +/* + * If it tries to access a hole, return an error. + * Because, the callers, functions in dir.c and GC, should be able to know + * whether this page exists or not. + */ +struct page *get_lock_data_page(struct inode *inode, pgoff_t index, + bool for_write) { - struct rb_node *node = et->root.rb_node; - struct extent_node *en; - - if (et->cached_en) { - struct extent_info *cei = &et->cached_en->ei; + struct address_space *mapping = inode->i_mapping; + struct page *page; +repeat: + page = get_read_data_page(inode, index, READ_SYNC, for_write); + if (IS_ERR(page)) + return page; - if (cei->fofs <= fofs && cei->fofs + cei->len > fofs) - return et->cached_en; + /* wait for read completion */ + lock_page(page); + if (unlikely(page->mapping != mapping)) { + f2fs_put_page(page, 1); + goto repeat; } - - while (node) { - en = rb_entry(node, struct extent_node, rb_node); - - if (fofs < en->ei.fofs) { - node = node->rb_left; - } else if (fofs >= en->ei.fofs + en->ei.len) { - node = node->rb_right; - } else { - et->cached_en = en; - return en; - } + if (unlikely(!PageUptodate(page))) { + f2fs_put_page(page, 1); + return ERR_PTR(-EIO); } - return NULL; + return page; } -static struct extent_node *__try_back_merge(struct f2fs_sb_info *sbi, - struct extent_tree *et, struct extent_node *en) +/* + * Caller ensures that this data page is never allocated. + * A new zero-filled data page is allocated in the page cache. + * + * Also, caller should grab and release a rwsem by calling f2fs_lock_op() and + * f2fs_unlock_op(). + * Note that, ipage is set only by make_empty_dir, and if any error occur, + * ipage should be released by this function. + */ +struct page *get_new_data_page(struct inode *inode, + struct page *ipage, pgoff_t index, bool new_i_size) { - struct extent_node *prev; - struct rb_node *node; - - node = rb_prev(&en->rb_node); - if (!node) - return NULL; - - prev = rb_entry(node, struct extent_node, rb_node); - if (__is_back_mergeable(&en->ei, &prev->ei)) { - en->ei.fofs = prev->ei.fofs; - en->ei.blk = prev->ei.blk; - en->ei.len += prev->ei.len; - __detach_extent_node(sbi, et, prev); - return prev; - } - return NULL; -} + struct address_space *mapping = inode->i_mapping; + struct page *page; + struct dnode_of_data dn; + int err; -static struct extent_node *__try_front_merge(struct f2fs_sb_info *sbi, - struct extent_tree *et, struct extent_node *en) -{ - struct extent_node *next; - struct rb_node *node; - - node = rb_next(&en->rb_node); - if (!node) - return NULL; - - next = rb_entry(node, struct extent_node, rb_node); - if (__is_front_mergeable(&en->ei, &next->ei)) { - en->ei.len += next->ei.len; - __detach_extent_node(sbi, et, next); - return next; + page = f2fs_grab_cache_page(mapping, index, true); + if (!page) { + /* + * before exiting, we should make sure ipage will be released + * if any error occur. + */ + f2fs_put_page(ipage, 1); + return ERR_PTR(-ENOMEM); } - return NULL; -} -static struct extent_node *__insert_extent_tree(struct f2fs_sb_info *sbi, - struct extent_tree *et, struct extent_info *ei, - struct extent_node **den) -{ - struct rb_node **p = &et->root.rb_node; - struct rb_node *parent = NULL; - struct extent_node *en; - - while (*p) { - parent = *p; - en = rb_entry(parent, struct extent_node, rb_node); - - if (ei->fofs < en->ei.fofs) { - if (__is_front_mergeable(ei, &en->ei)) { - f2fs_bug_on(sbi, !den); - en->ei.fofs = ei->fofs; - en->ei.blk = ei->blk; - en->ei.len += ei->len; - *den = __try_back_merge(sbi, et, en); - return en; - } - p = &(*p)->rb_left; - } else if (ei->fofs >= en->ei.fofs + en->ei.len) { - if (__is_back_mergeable(ei, &en->ei)) { - f2fs_bug_on(sbi, !den); - en->ei.len += ei->len; - *den = __try_front_merge(sbi, et, en); - return en; - } - p = &(*p)->rb_right; - } else { - f2fs_bug_on(sbi, 1); - } + set_new_dnode(&dn, inode, ipage, NULL, 0); + err = f2fs_reserve_block(&dn, index); + if (err) { + f2fs_put_page(page, 1); + return ERR_PTR(err); } + if (!ipage) + f2fs_put_dnode(&dn); - return __attach_extent_node(sbi, et, ei, parent, p); -} + if (PageUptodate(page)) + goto got_it; -static unsigned int __free_extent_tree(struct f2fs_sb_info *sbi, - struct extent_tree *et, bool free_all) -{ - struct rb_node *node, *next; - struct extent_node *en; - unsigned int count = et->count; - - node = rb_first(&et->root); - while (node) { - next = rb_next(node); - en = rb_entry(node, struct extent_node, rb_node); - - if (free_all) { - spin_lock(&sbi->extent_lock); - if (!list_empty(&en->list)) - list_del_init(&en->list); - spin_unlock(&sbi->extent_lock); - } + if (dn.data_blkaddr == NEW_ADDR) { + zero_user_segment(page, 0, PAGE_SIZE); + if (!PageUptodate(page)) + SetPageUptodate(page); + } else { + f2fs_put_page(page, 1); - if (free_all || list_empty(&en->list)) { - __detach_extent_node(sbi, et, en); - kmem_cache_free(extent_node_slab, en); - } - node = next; + /* if ipage exists, blkaddr should be NEW_ADDR */ + f2fs_bug_on(F2FS_I_SB(inode), ipage); + page = get_lock_data_page(inode, index, true); + if (IS_ERR(page)) + return page; } - - return count - et->count; +got_it: + if (new_i_size && i_size_read(inode) < + ((loff_t)(index + 1) << PAGE_SHIFT)) + f2fs_i_size_write(inode, ((loff_t)(index + 1) << PAGE_SHIFT)); + return page; } -static void f2fs_init_extent_tree(struct inode *inode, - struct f2fs_extent *i_ext) +static int __allocate_data_block(struct dnode_of_data *dn) { - struct f2fs_sb_info *sbi = F2FS_I_SB(inode); - struct extent_tree *et; - struct extent_node *en; - struct extent_info ei; + struct f2fs_sb_info *sbi = F2FS_I_SB(dn->inode); + struct f2fs_summary sum; + struct node_info ni; + int seg = CURSEG_WARM_DATA; + pgoff_t fofs; + blkcnt_t count = 1; - if (le32_to_cpu(i_ext->len) < F2FS_MIN_EXTENT_LEN) - return; + if (unlikely(is_inode_flag_set(dn->inode, FI_NO_ALLOC))) + return -EPERM; - et = __grab_extent_tree(inode); + dn->data_blkaddr = datablock_addr(dn->node_page, dn->ofs_in_node); + if (dn->data_blkaddr == NEW_ADDR) + goto alloc; - write_lock(&et->lock); - if (et->count) - goto out; + if (unlikely(!inc_valid_block_count(sbi, dn->inode, &count))) + return -ENOSPC; - set_extent_info(&ei, le32_to_cpu(i_ext->fofs), - le32_to_cpu(i_ext->blk), le32_to_cpu(i_ext->len)); +alloc: + get_node_info(sbi, dn->nid, &ni); + set_summary(&sum, dn->nid, dn->ofs_in_node, ni.version); - en = __insert_extent_tree(sbi, et, &ei, NULL); - if (en) { - et->cached_en = en; + if (dn->ofs_in_node == 0 && dn->inode_page == dn->node_page) + seg = CURSEG_DIRECT_IO; - spin_lock(&sbi->extent_lock); - list_add_tail(&en->list, &sbi->extent_list); - spin_unlock(&sbi->extent_lock); - } -out: - write_unlock(&et->lock); - atomic_dec(&et->refcount); + allocate_data_block(sbi, NULL, dn->data_blkaddr, &dn->data_blkaddr, + &sum, seg); + set_data_blkaddr(dn); + + /* update i_size */ + fofs = start_bidx_of_node(ofs_of_node(dn->node_page), dn->inode) + + dn->ofs_in_node; + if (i_size_read(dn->inode) < ((loff_t)(fofs + 1) << PAGE_SHIFT)) + f2fs_i_size_write(dn->inode, + ((loff_t)(fofs + 1) << PAGE_SHIFT)); + return 0; } -static bool f2fs_lookup_extent_tree(struct inode *inode, pgoff_t pgofs, - struct extent_info *ei) +ssize_t f2fs_preallocate_blocks(struct inode *inode, loff_t pos, size_t count, bool dio) { - struct f2fs_sb_info *sbi = F2FS_I_SB(inode); - struct extent_tree *et; - struct extent_node *en; + struct f2fs_map_blocks map; + ssize_t ret = 0; - trace_f2fs_lookup_extent_tree_start(inode, pgofs); + map.m_lblk = F2FS_BLK_ALIGN(pos); + map.m_len = F2FS_BYTES_TO_BLK(count); + map.m_next_pgofs = NULL; - et = __find_extent_tree(sbi, inode->i_ino); - if (!et) - return false; + if (f2fs_encrypted_inode(inode)) + return 0; - read_lock(&et->lock); - en = __lookup_extent_tree(et, pgofs); - if (en) { - *ei = en->ei; - spin_lock(&sbi->extent_lock); - if (!list_empty(&en->list)) - list_move_tail(&en->list, &sbi->extent_list); - spin_unlock(&sbi->extent_lock); - stat_inc_read_hit(sbi->sb); + if (dio) { + ret = f2fs_convert_inline_inode(inode); + if (ret) + return ret; + return f2fs_map_blocks(inode, &map, 1, F2FS_GET_BLOCK_PRE_DIO); } - stat_inc_total_hit(sbi->sb); - read_unlock(&et->lock); - - trace_f2fs_lookup_extent_tree_end(inode, pgofs, en); - - atomic_dec(&et->refcount); - return en ? true : false; + if (pos + count > MAX_INLINE_DATA) { + ret = f2fs_convert_inline_inode(inode); + if (ret) + return ret; + } + if (!f2fs_has_inline_data(inode)) + return f2fs_map_blocks(inode, &map, 1, F2FS_GET_BLOCK_PRE_AIO); + return ret; } -static void f2fs_update_extent_tree(struct inode *inode, pgoff_t fofs, - block_t blkaddr) +/* + * f2fs_map_blocks() now supported readahead/bmap/rw direct_IO with + * f2fs_map_blocks structure. + * If original data blocks are allocated, then give them to blockdev. + * Otherwise, + * a. preallocate requested block addresses + * b. do not use extent cache for better performance + * c. give the block addresses to blockdev + */ +int f2fs_map_blocks(struct inode *inode, struct f2fs_map_blocks *map, + int create, int flag) { + unsigned int maxblocks = map->m_len; + struct dnode_of_data dn; struct f2fs_sb_info *sbi = F2FS_I_SB(inode); - struct extent_tree *et; - struct extent_node *en = NULL, *en1 = NULL, *en2 = NULL, *en3 = NULL; - struct extent_node *den = NULL; - struct extent_info ei, dei; - unsigned int endofs; - - trace_f2fs_update_extent_tree(inode, fofs, blkaddr); + int mode = create ? ALLOC_NODE : LOOKUP_NODE; + pgoff_t pgofs, end_offset, end; + int err = 0, ofs = 1; + unsigned int ofs_in_node, last_ofs_in_node; + blkcnt_t prealloc; + struct extent_info ei; + bool allocated = false; + block_t blkaddr; - et = __grab_extent_tree(inode); + map->m_len = 0; + map->m_flags = 0; - write_lock(&et->lock); + /* it only supports block size == page size */ + pgofs = (pgoff_t)map->m_lblk; + end = pgofs + maxblocks; - /* 1. lookup and remove existing extent info in cache */ - en = __lookup_extent_tree(et, fofs); - if (!en) - goto update_extent; + if (!create && f2fs_lookup_extent_cache(inode, pgofs, &ei)) { + map->m_pblk = ei.blk + pgofs - ei.fofs; + map->m_len = min((pgoff_t)maxblocks, ei.fofs + ei.len - pgofs); + map->m_flags = F2FS_MAP_MAPPED; + goto out; + } - dei = en->ei; - __detach_extent_node(sbi, et, en); +next_dnode: + if (create) + f2fs_lock_op(sbi); - /* 2. if extent can be split more, split and insert the left part */ - if (dei.len > 1) { - /* insert left part of split extent into cache */ - if (fofs - dei.fofs >= F2FS_MIN_EXTENT_LEN) { - set_extent_info(&ei, dei.fofs, dei.blk, - fofs - dei.fofs); - en1 = __insert_extent_tree(sbi, et, &ei, NULL); + /* When reading holes, we need its node page */ + set_new_dnode(&dn, inode, NULL, NULL, 0); + err = get_dnode_of_data(&dn, pgofs, mode); + if (err) { + if (flag == F2FS_GET_BLOCK_BMAP) + map->m_pblk = 0; + if (err == -ENOENT) { + err = 0; + if (map->m_next_pgofs) + *map->m_next_pgofs = + get_next_page_offset(&dn, pgofs); } + goto unlock_out; + } - /* insert right part of split extent into cache */ - endofs = dei.fofs + dei.len - 1; - if (endofs - fofs >= F2FS_MIN_EXTENT_LEN) { - set_extent_info(&ei, fofs + 1, - fofs - dei.fofs + dei.blk, endofs - fofs); - en2 = __insert_extent_tree(sbi, et, &ei, NULL); + prealloc = 0; + ofs_in_node = dn.ofs_in_node; + end_offset = ADDRS_PER_PAGE(dn.node_page, inode); + +next_block: + blkaddr = datablock_addr(dn.node_page, dn.ofs_in_node); + + if (blkaddr == NEW_ADDR || blkaddr == NULL_ADDR) { + if (create) { + if (unlikely(f2fs_cp_error(sbi))) { + err = -EIO; + goto sync_out; + } + if (flag == F2FS_GET_BLOCK_PRE_AIO) { + if (blkaddr == NULL_ADDR) { + prealloc++; + last_ofs_in_node = dn.ofs_in_node; + } + } else { + err = __allocate_data_block(&dn); + if (!err) { + set_inode_flag(inode, FI_APPEND_WRITE); + allocated = true; + } + } + if (err) + goto sync_out; + map->m_flags = F2FS_MAP_NEW; + blkaddr = dn.data_blkaddr; + } else { + if (flag == F2FS_GET_BLOCK_BMAP) { + map->m_pblk = 0; + goto sync_out; + } + if (flag == F2FS_GET_BLOCK_FIEMAP && + blkaddr == NULL_ADDR) { + if (map->m_next_pgofs) + *map->m_next_pgofs = pgofs + 1; + } + if (flag != F2FS_GET_BLOCK_FIEMAP || + blkaddr != NEW_ADDR) + goto sync_out; } } -update_extent: - /* 3. update extent in extent cache */ - if (blkaddr) { - set_extent_info(&ei, fofs, blkaddr, 1); - en3 = __insert_extent_tree(sbi, et, &ei, &den); - } + if (flag == F2FS_GET_BLOCK_PRE_AIO) + goto skip; - /* 4. update in global extent list */ - spin_lock(&sbi->extent_lock); - if (en && !list_empty(&en->list)) - list_del(&en->list); - /* - * en1 and en2 split from en, they will become more and more smaller - * fragments after splitting several times. So if the length is smaller - * than F2FS_MIN_EXTENT_LEN, we will not add them into extent tree. - */ - if (en1) - list_add_tail(&en1->list, &sbi->extent_list); - if (en2) - list_add_tail(&en2->list, &sbi->extent_list); - if (en3) { - if (list_empty(&en3->list)) - list_add_tail(&en3->list, &sbi->extent_list); - else - list_move_tail(&en3->list, &sbi->extent_list); + if (map->m_len == 0) { + /* preallocated unwritten block should be mapped for fiemap. */ + if (blkaddr == NEW_ADDR) + map->m_flags |= F2FS_MAP_UNWRITTEN; + map->m_flags |= F2FS_MAP_MAPPED; + + map->m_pblk = blkaddr; + map->m_len = 1; + } else if ((map->m_pblk != NEW_ADDR && + blkaddr == (map->m_pblk + ofs)) || + (map->m_pblk == NEW_ADDR && blkaddr == NEW_ADDR) || + flag == F2FS_GET_BLOCK_PRE_DIO) { + ofs++; + map->m_len++; + } else { + goto sync_out; } - if (den && !list_empty(&den->list)) - list_del(&den->list); - spin_unlock(&sbi->extent_lock); - - /* 5. release extent node */ - if (en) - kmem_cache_free(extent_node_slab, en); - if (den) - kmem_cache_free(extent_node_slab, den); - - write_unlock(&et->lock); - atomic_dec(&et->refcount); -} -void f2fs_preserve_extent_tree(struct inode *inode) -{ - struct extent_tree *et; - struct extent_info *ext = &F2FS_I(inode)->ext; - bool sync = false; +skip: + dn.ofs_in_node++; + pgofs++; - if (!test_opt(F2FS_I_SB(inode), EXTENT_CACHE)) - return; + /* preallocate blocks in batch for one dnode page */ + if (flag == F2FS_GET_BLOCK_PRE_AIO && + (pgofs == end || dn.ofs_in_node == end_offset)) { + + dn.ofs_in_node = ofs_in_node; + err = reserve_new_blocks(&dn, prealloc); + if (err) + goto sync_out; - et = __find_extent_tree(F2FS_I_SB(inode), inode->i_ino); - if (!et) { - if (ext->len) { - ext->len = 0; - update_inode_page(inode); + map->m_len += dn.ofs_in_node - ofs_in_node; + if (prealloc && dn.ofs_in_node != last_ofs_in_node + 1) { + err = -ENOSPC; + goto sync_out; } - return; + dn.ofs_in_node = end_offset; } - read_lock(&et->lock); - if (et->count) { - struct extent_node *en; - - if (et->cached_en) { - en = et->cached_en; - } else { - struct rb_node *node = rb_first(&et->root); + if (pgofs >= end) + goto sync_out; + else if (dn.ofs_in_node < end_offset) + goto next_block; - if (!node) - node = rb_last(&et->root); - en = rb_entry(node, struct extent_node, rb_node); - } + f2fs_put_dnode(&dn); - if (__is_extent_same(ext, &en->ei)) - goto out; + if (create) { + f2fs_unlock_op(sbi); + f2fs_balance_fs(sbi, allocated); + } + allocated = false; + goto next_dnode; - *ext = en->ei; - sync = true; - } else if (ext->len) { - ext->len = 0; - sync = true; +sync_out: + f2fs_put_dnode(&dn); +unlock_out: + if (create) { + f2fs_unlock_op(sbi); + f2fs_balance_fs(sbi, allocated); } out: - read_unlock(&et->lock); - atomic_dec(&et->refcount); - - if (sync) - update_inode_page(inode); + trace_f2fs_map_blocks(inode, map, err); + return err; } -void f2fs_shrink_extent_tree(struct f2fs_sb_info *sbi, int nr_shrink) +static int __get_data_block(struct inode *inode, sector_t iblock, + struct buffer_head *bh, int create, int flag, + pgoff_t *next_pgofs) { - struct extent_tree *treevec[EXT_TREE_VEC_SIZE]; - struct extent_node *en, *tmp; - unsigned long ino = F2FS_ROOT_INO(sbi); - struct radix_tree_iter iter; - void **slot; - unsigned int found; - unsigned int node_cnt = 0, tree_cnt = 0; - - if (!test_opt(sbi, EXTENT_CACHE)) - return; + struct f2fs_map_blocks map; + int ret; - if (available_free_memory(sbi, EXTENT_CACHE)) - return; + map.m_lblk = iblock; + map.m_len = bh->b_size >> inode->i_blkbits; + map.m_next_pgofs = next_pgofs; - spin_lock(&sbi->extent_lock); - list_for_each_entry_safe(en, tmp, &sbi->extent_list, list) { - if (!nr_shrink--) - break; - list_del_init(&en->list); + ret = f2fs_map_blocks(inode, &map, create, flag); + if (!ret) { + map_bh(bh, inode->i_sb, map.m_pblk); + bh->b_state = (bh->b_state & ~F2FS_MAP_FLAGS) | map.m_flags; + bh->b_size = map.m_len << inode->i_blkbits; } - spin_unlock(&sbi->extent_lock); - - down_read(&sbi->extent_tree_lock); - while ((found = radix_tree_gang_lookup(&sbi->extent_tree_root, - (void **)treevec, ino, EXT_TREE_VEC_SIZE))) { - unsigned i; - - ino = treevec[found - 1]->ino + 1; - for (i = 0; i < found; i++) { - struct extent_tree *et = treevec[i]; - - atomic_inc(&et->refcount); - write_lock(&et->lock); - node_cnt += __free_extent_tree(sbi, et, false); - write_unlock(&et->lock); - atomic_dec(&et->refcount); - } - } - up_read(&sbi->extent_tree_lock); - - down_write(&sbi->extent_tree_lock); - radix_tree_for_each_slot(slot, &sbi->extent_tree_root, &iter, - F2FS_ROOT_INO(sbi)) { - struct extent_tree *et = (struct extent_tree *)*slot; - - if (!atomic_read(&et->refcount) && !et->count) { - radix_tree_delete(&sbi->extent_tree_root, et->ino); - kmem_cache_free(extent_tree_slab, et); - sbi->total_ext_tree--; - tree_cnt++; - } - } - up_write(&sbi->extent_tree_lock); - - trace_f2fs_shrink_extent_tree(sbi, node_cnt, tree_cnt); + return ret; } -void f2fs_destroy_extent_tree(struct inode *inode) +static int get_data_block(struct inode *inode, sector_t iblock, + struct buffer_head *bh_result, int create, int flag, + pgoff_t *next_pgofs) { - struct f2fs_sb_info *sbi = F2FS_I_SB(inode); - struct extent_tree *et; - unsigned int node_cnt = 0; - - if (!test_opt(sbi, EXTENT_CACHE)) - return; - - et = __find_extent_tree(sbi, inode->i_ino); - if (!et) - goto out; - - /* free all extent info belong to this extent tree */ - write_lock(&et->lock); - node_cnt = __free_extent_tree(sbi, et, true); - write_unlock(&et->lock); - - atomic_dec(&et->refcount); - - /* try to find and delete extent tree entry in radix tree */ - down_write(&sbi->extent_tree_lock); - et = radix_tree_lookup(&sbi->extent_tree_root, inode->i_ino); - if (!et) { - up_write(&sbi->extent_tree_lock); - goto out; - } - f2fs_bug_on(sbi, atomic_read(&et->refcount) || et->count); - radix_tree_delete(&sbi->extent_tree_root, inode->i_ino); - kmem_cache_free(extent_tree_slab, et); - sbi->total_ext_tree--; - up_write(&sbi->extent_tree_lock); -out: - trace_f2fs_destroy_extent_tree(inode, node_cnt); - return; + return __get_data_block(inode, iblock, bh_result, create, + flag, next_pgofs); } -void f2fs_init_extent_cache(struct inode *inode, struct f2fs_extent *i_ext) +static int get_data_block_dio(struct inode *inode, sector_t iblock, + struct buffer_head *bh_result, int create) { - if (test_opt(F2FS_I_SB(inode), EXTENT_CACHE)) - f2fs_init_extent_tree(inode, i_ext); - - write_lock(&F2FS_I(inode)->ext_lock); - get_extent_info(&F2FS_I(inode)->ext, *i_ext); - write_unlock(&F2FS_I(inode)->ext_lock); + return __get_data_block(inode, iblock, bh_result, create, + F2FS_GET_BLOCK_DIO, NULL); } -static bool f2fs_lookup_extent_cache(struct inode *inode, pgoff_t pgofs, - struct extent_info *ei) +static int get_data_block_bmap(struct inode *inode, sector_t iblock, + struct buffer_head *bh_result, int create) { - if (is_inode_flag_set(F2FS_I(inode), FI_NO_EXTENT)) - return false; + /* Block number less than F2FS MAX BLOCKS */ + if (unlikely(iblock >= F2FS_I_SB(inode)->max_file_blocks)) + return -EFBIG; - if (test_opt(F2FS_I_SB(inode), EXTENT_CACHE)) - return f2fs_lookup_extent_tree(inode, pgofs, ei); - - return lookup_extent_info(inode, pgofs, ei); + return __get_data_block(inode, iblock, bh_result, create, + F2FS_GET_BLOCK_BMAP, NULL); } -void f2fs_update_extent_cache(struct dnode_of_data *dn) +static inline sector_t logical_to_blk(struct inode *inode, loff_t offset) { - struct f2fs_inode_info *fi = F2FS_I(dn->inode); - pgoff_t fofs; - - f2fs_bug_on(F2FS_I_SB(dn->inode), dn->data_blkaddr == NEW_ADDR); - - if (is_inode_flag_set(fi, FI_NO_EXTENT)) - return; - - fofs = start_bidx_of_node(ofs_of_node(dn->node_page), fi) + - dn->ofs_in_node; - - if (test_opt(F2FS_I_SB(dn->inode), EXTENT_CACHE)) - return f2fs_update_extent_tree(dn->inode, fofs, - dn->data_blkaddr); - - if (update_extent_info(dn->inode, fofs, dn->data_blkaddr)) - sync_inode_page(dn); + return (offset >> inode->i_blkbits); } -struct page *find_data_page(struct inode *inode, pgoff_t index, bool sync) +static inline loff_t blk_to_logical(struct inode *inode, sector_t blk) { - struct address_space *mapping = inode->i_mapping; - struct dnode_of_data dn; - struct page *page; - struct extent_info ei; - int err; - struct f2fs_io_info fio = { - .type = DATA, - .rw = sync ? READ_SYNC : READA, - }; - - /* - * If sync is false, it needs to check its block allocation. - * This is need and triggered by two flows: - * gc and truncate_partial_data_page. - */ - if (!sync) - goto search; - - page = find_get_page(mapping, index); - if (page && PageUptodate(page)) - return page; - f2fs_put_page(page, 0); -search: - if (f2fs_lookup_extent_cache(inode, index, &ei)) { - dn.data_blkaddr = ei.blk + index - ei.fofs; - goto got_it; - } - - set_new_dnode(&dn, inode, NULL, NULL, 0); - err = get_dnode_of_data(&dn, index, LOOKUP_NODE); - if (err) - return ERR_PTR(err); - f2fs_put_dnode(&dn); - - if (dn.data_blkaddr == NULL_ADDR) - return ERR_PTR(-ENOENT); - - /* By fallocate(), there is no cached page, but with NEW_ADDR */ - if (unlikely(dn.data_blkaddr == NEW_ADDR)) - return ERR_PTR(-EINVAL); - -got_it: - page = grab_cache_page(mapping, index); - if (!page) - return ERR_PTR(-ENOMEM); - - if (PageUptodate(page)) { - unlock_page(page); - return page; - } - - fio.blk_addr = dn.data_blkaddr; - err = f2fs_submit_page_bio(F2FS_I_SB(inode), page, &fio); - if (err) - return ERR_PTR(err); - - if (sync) { - wait_on_page_locked(page); - if (unlikely(!PageUptodate(page))) { - f2fs_put_page(page, 0); - return ERR_PTR(-EIO); - } - } - return page; + return (blk << inode->i_blkbits); } -/* - * If it tries to access a hole, return an error. - * Because, the callers, functions in dir.c and GC, should be able to know - * whether this page exists or not. - */ -struct page *get_lock_data_page(struct inode *inode, pgoff_t index) +int f2fs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + u64 start, u64 len) { - struct address_space *mapping = inode->i_mapping; - struct dnode_of_data dn; - struct page *page; - struct extent_info ei; - int err; - struct f2fs_io_info fio = { - .type = DATA, - .rw = READ_SYNC, - }; -repeat: - page = grab_cache_page(mapping, index); - if (!page) - return ERR_PTR(-ENOMEM); + struct buffer_head map_bh; + sector_t start_blk, last_blk; + pgoff_t next_pgofs; + loff_t isize; + u64 logical = 0, phys = 0, size = 0; + u32 flags = 0; + int ret = 0; + + ret = fiemap_check_flags(fieinfo, FIEMAP_FLAG_SYNC); + if (ret) + return ret; - if (f2fs_lookup_extent_cache(inode, index, &ei)) { - dn.data_blkaddr = ei.blk + index - ei.fofs; - goto got_it; + if (f2fs_has_inline_data(inode)) { + ret = f2fs_inline_data_fiemap(inode, fieinfo, start, len); + if (ret != -EAGAIN) + return ret; } - set_new_dnode(&dn, inode, NULL, NULL, 0); - err = get_dnode_of_data(&dn, index, LOOKUP_NODE); - if (err) { - f2fs_put_page(page, 1); - return ERR_PTR(err); - } - f2fs_put_dnode(&dn); + inode_lock(inode); - if (unlikely(dn.data_blkaddr == NULL_ADDR)) { - f2fs_put_page(page, 1); - return ERR_PTR(-ENOENT); - } + isize = i_size_read(inode); + if (start >= isize) + goto out; -got_it: - if (PageUptodate(page)) - return page; + if (start + len > isize) + len = isize - start; - /* - * A new dentry page is allocated but not able to be written, since its - * new inode page couldn't be allocated due to -ENOSPC. - * In such the case, its blkaddr can be remained as NEW_ADDR. - * see, f2fs_add_link -> get_new_data_page -> init_inode_metadata. - */ - if (dn.data_blkaddr == NEW_ADDR) { - zero_user_segment(page, 0, PAGE_CACHE_SIZE); - SetPageUptodate(page); - return page; - } + if (logical_to_blk(inode, len) == 0) + len = blk_to_logical(inode, 1); - fio.blk_addr = dn.data_blkaddr; - err = f2fs_submit_page_bio(F2FS_I_SB(inode), page, &fio); - if (err) - return ERR_PTR(err); + start_blk = logical_to_blk(inode, start); + last_blk = logical_to_blk(inode, start + len - 1); - lock_page(page); - if (unlikely(!PageUptodate(page))) { - f2fs_put_page(page, 1); - return ERR_PTR(-EIO); - } - if (unlikely(page->mapping != mapping)) { - f2fs_put_page(page, 1); - goto repeat; - } - return page; -} +next: + memset(&map_bh, 0, sizeof(struct buffer_head)); + map_bh.b_size = len; -/* - * Caller ensures that this data page is never allocated. - * A new zero-filled data page is allocated in the page cache. - * - * Also, caller should grab and release a rwsem by calling f2fs_lock_op() and - * f2fs_unlock_op(). - * Note that, ipage is set only by make_empty_dir. - */ -struct page *get_new_data_page(struct inode *inode, - struct page *ipage, pgoff_t index, bool new_i_size) -{ - struct address_space *mapping = inode->i_mapping; - struct page *page; - struct dnode_of_data dn; - int err; + ret = get_data_block(inode, start_blk, &map_bh, 0, + F2FS_GET_BLOCK_FIEMAP, &next_pgofs); + if (ret) + goto out; - set_new_dnode(&dn, inode, ipage, NULL, 0); - err = f2fs_reserve_block(&dn, index); - if (err) - return ERR_PTR(err); -repeat: - page = grab_cache_page(mapping, index); - if (!page) { - err = -ENOMEM; - goto put_err; + /* HOLE */ + if (!buffer_mapped(&map_bh)) { + start_blk = next_pgofs; + /* Go through holes util pass the EOF */ + if (blk_to_logical(inode, start_blk) < isize) + goto prep_next; + /* Found a hole beyond isize means no more extents. + * Note that the premise is that filesystems don't + * punch holes beyond isize and keep size unchanged. + */ + flags |= FIEMAP_EXTENT_LAST; } - if (PageUptodate(page)) - return page; - - if (dn.data_blkaddr == NEW_ADDR) { - zero_user_segment(page, 0, PAGE_CACHE_SIZE); - SetPageUptodate(page); - } else { - struct f2fs_io_info fio = { - .type = DATA, - .rw = READ_SYNC, - .blk_addr = dn.data_blkaddr, - }; - err = f2fs_submit_page_bio(F2FS_I_SB(inode), page, &fio); - if (err) - goto put_err; + if (size) { + if (f2fs_encrypted_inode(inode)) + flags |= FIEMAP_EXTENT_DATA_ENCRYPTED; - lock_page(page); - if (unlikely(!PageUptodate(page))) { - f2fs_put_page(page, 1); - err = -EIO; - goto put_err; - } - if (unlikely(page->mapping != mapping)) { - f2fs_put_page(page, 1); - goto repeat; - } - } - - if (new_i_size && - i_size_read(inode) < ((index + 1) << PAGE_CACHE_SHIFT)) { - i_size_write(inode, ((index + 1) << PAGE_CACHE_SHIFT)); - /* Only the directory inode sets new_i_size */ - set_inode_flag(F2FS_I(inode), FI_UPDATE_DIR); + ret = fiemap_fill_next_extent(fieinfo, logical, + phys, size, flags); } - return page; - -put_err: - f2fs_put_dnode(&dn); - return ERR_PTR(err); -} - -static int __allocate_data_block(struct dnode_of_data *dn) -{ - struct f2fs_sb_info *sbi = F2FS_I_SB(dn->inode); - struct f2fs_inode_info *fi = F2FS_I(dn->inode); - struct f2fs_summary sum; - struct node_info ni; - int seg = CURSEG_WARM_DATA; - pgoff_t fofs; - - if (unlikely(is_inode_flag_set(F2FS_I(dn->inode), FI_NO_ALLOC))) - return -EPERM; - - dn->data_blkaddr = datablock_addr(dn->node_page, dn->ofs_in_node); - if (dn->data_blkaddr == NEW_ADDR) - goto alloc; - if (unlikely(!inc_valid_block_count(sbi, dn->inode, 1))) - return -ENOSPC; - -alloc: - get_node_info(sbi, dn->nid, &ni); - set_summary(&sum, dn->nid, dn->ofs_in_node, ni.version); + if (start_blk > last_blk || ret) + goto out; - if (dn->ofs_in_node == 0 && dn->inode_page == dn->node_page) - seg = CURSEG_DIRECT_IO; + logical = blk_to_logical(inode, start_blk); + phys = blk_to_logical(inode, map_bh.b_blocknr); + size = map_bh.b_size; + flags = 0; + if (buffer_unwritten(&map_bh)) + flags = FIEMAP_EXTENT_UNWRITTEN; - allocate_data_block(sbi, NULL, dn->data_blkaddr, &dn->data_blkaddr, - &sum, seg); - - /* direct IO doesn't use extent cache to maximize the performance */ - set_data_blkaddr(dn); + start_blk += logical_to_blk(inode, size); - /* update i_size */ - fofs = start_bidx_of_node(ofs_of_node(dn->node_page), fi) + - dn->ofs_in_node; - if (i_size_read(dn->inode) < ((fofs + 1) << PAGE_CACHE_SHIFT)) - i_size_write(dn->inode, ((fofs + 1) << PAGE_CACHE_SHIFT)); +prep_next: + cond_resched(); + if (fatal_signal_pending(current)) + ret = -EINTR; + else + goto next; +out: + if (ret == 1) + ret = 0; - return 0; + inode_unlock(inode); + return ret; } -static void __allocate_data_blocks(struct inode *inode, loff_t offset, - size_t count) +struct bio *f2fs_grab_bio(struct inode *inode, block_t blkaddr, + unsigned nr_pages) { struct f2fs_sb_info *sbi = F2FS_I_SB(inode); - struct dnode_of_data dn; - u64 start = F2FS_BYTES_TO_BLK(offset); - u64 len = F2FS_BYTES_TO_BLK(count); - bool allocated; - u64 end_offset; - - while (len) { - f2fs_balance_fs(sbi); - f2fs_lock_op(sbi); - - /* When reading holes, we need its node page */ - set_new_dnode(&dn, inode, NULL, NULL, 0); - if (get_dnode_of_data(&dn, start, ALLOC_NODE)) - goto out; - - allocated = false; - end_offset = ADDRS_PER_PAGE(dn.node_page, F2FS_I(inode)); - - while (dn.ofs_in_node < end_offset && len) { - block_t blkaddr; - - blkaddr = datablock_addr(dn.node_page, dn.ofs_in_node); - if (blkaddr == NULL_ADDR || blkaddr == NEW_ADDR) { - if (__allocate_data_block(&dn)) - goto sync_out; - allocated = true; - } - len--; - start++; - dn.ofs_in_node++; - } - - if (allocated) - sync_inode_page(&dn); - - f2fs_put_dnode(&dn); - f2fs_unlock_op(sbi); - } - return; - -sync_out: - if (allocated) - sync_inode_page(&dn); - f2fs_put_dnode(&dn); -out: - f2fs_unlock_op(sbi); - return; -} - -/* - * get_data_block() now supported readahead/bmap/rw direct_IO with mapped bh. - * If original data blocks are allocated, then give them to blockdev. - * Otherwise, - * a. preallocate requested block addresses - * b. do not use extent cache for better performance - * c. give the block addresses to blockdev - */ -static int __get_data_block(struct inode *inode, sector_t iblock, - struct buffer_head *bh_result, int create, bool fiemap) -{ - unsigned int blkbits = inode->i_sb->s_blocksize_bits; - unsigned maxblocks = bh_result->b_size >> blkbits; - struct dnode_of_data dn; - int mode = create ? ALLOC_NODE : LOOKUP_NODE_RA; - pgoff_t pgofs, end_offset; - int err = 0, ofs = 1; - struct extent_info ei; - bool allocated = false; - - /* Get the page offset from the block offset(iblock) */ - pgofs = (pgoff_t)(iblock >> (PAGE_CACHE_SHIFT - blkbits)); - - if (f2fs_lookup_extent_cache(inode, pgofs, &ei)) { - f2fs_map_bh(inode->i_sb, pgofs, &ei, bh_result); - goto out; - } - - if (create) - f2fs_lock_op(F2FS_I_SB(inode)); - - /* When reading holes, we need its node page */ - set_new_dnode(&dn, inode, NULL, NULL, 0); - err = get_dnode_of_data(&dn, pgofs, mode); - if (err) { - if (err == -ENOENT) - err = 0; - goto unlock_out; - } - if (dn.data_blkaddr == NEW_ADDR && !fiemap) - goto put_out; - - if (dn.data_blkaddr != NULL_ADDR) { - clear_buffer_new(bh_result); - map_bh(bh_result, inode->i_sb, dn.data_blkaddr); - } else if (create) { - err = __allocate_data_block(&dn); - if (err) - goto put_out; - allocated = true; - set_buffer_new(bh_result); - map_bh(bh_result, inode->i_sb, dn.data_blkaddr); - } else { - goto put_out; - } - - end_offset = ADDRS_PER_PAGE(dn.node_page, F2FS_I(inode)); - bh_result->b_size = (((size_t)1) << blkbits); - dn.ofs_in_node++; - pgofs++; - -get_next: - if (dn.ofs_in_node >= end_offset) { - if (allocated) - sync_inode_page(&dn); - allocated = false; - f2fs_put_dnode(&dn); + struct fscrypt_ctx *ctx = NULL; + struct block_device *bdev = sbi->sb->s_bdev; + struct bio *bio; - set_new_dnode(&dn, inode, NULL, NULL, 0); - err = get_dnode_of_data(&dn, pgofs, mode); - if (err) { - if (err == -ENOENT) - err = 0; - goto unlock_out; - } - if (dn.data_blkaddr == NEW_ADDR && !fiemap) - goto put_out; + if (f2fs_encrypted_inode(inode) && S_ISREG(inode->i_mode)) { + ctx = fscrypt_get_ctx(inode, GFP_NOFS); + if (IS_ERR(ctx)) + return ERR_CAST(ctx); - end_offset = ADDRS_PER_PAGE(dn.node_page, F2FS_I(inode)); + /* wait the page to be moved by cleaning */ + f2fs_wait_on_encrypted_page_writeback(sbi, blkaddr); } - if (maxblocks > (bh_result->b_size >> blkbits)) { - block_t blkaddr = datablock_addr(dn.node_page, dn.ofs_in_node); - if (blkaddr == NULL_ADDR && create) { - err = __allocate_data_block(&dn); - if (err) - goto sync_out; - allocated = true; - set_buffer_new(bh_result); - blkaddr = dn.data_blkaddr; - } - /* Give more consecutive addresses for the readahead */ - if (blkaddr == (bh_result->b_blocknr + ofs)) { - ofs++; - dn.ofs_in_node++; - pgofs++; - bh_result->b_size += (((size_t)1) << blkbits); - goto get_next; - } + bio = bio_alloc(GFP_KERNEL, min_t(int, nr_pages, BIO_MAX_PAGES)); + if (!bio) { + if (ctx) + fscrypt_release_ctx(ctx); + return ERR_PTR(-ENOMEM); } -sync_out: - if (allocated) - sync_inode_page(&dn); -put_out: - f2fs_put_dnode(&dn); -unlock_out: - if (create) - f2fs_unlock_op(F2FS_I_SB(inode)); -out: - trace_f2fs_get_data_block(inode, iblock, bh_result, err); - return err; -} + bio->bi_bdev = bdev; + bio->bi_sector = SECTOR_FROM_BLOCK(blkaddr); + bio->bi_end_io = f2fs_read_end_io; + bio->bi_private = ctx; -static int get_data_block(struct inode *inode, sector_t iblock, - struct buffer_head *bh_result, int create) -{ - return __get_data_block(inode, iblock, bh_result, create, false); + return bio; } -static int get_data_block_fiemap(struct inode *inode, sector_t iblock, - struct buffer_head *bh_result, int create) +/* + * This function was originally taken from fs/mpage.c, and customized for f2fs. + * Major change was from block_size == page_size in f2fs by default. + */ +static int f2fs_mpage_readpages(struct address_space *mapping, + struct list_head *pages, struct page *page, + unsigned nr_pages) { - return __get_data_block(inode, iblock, bh_result, create, true); -} + struct bio *bio = NULL; + unsigned page_idx; + sector_t last_block_in_bio = 0; + struct inode *inode = mapping->host; + const unsigned blkbits = inode->i_blkbits; + const unsigned blocksize = 1 << blkbits; + sector_t block_in_file; + sector_t last_block; + sector_t last_block_in_file; + sector_t block_nr; + struct f2fs_map_blocks map; + + map.m_pblk = 0; + map.m_lblk = 0; + map.m_len = 0; + map.m_flags = 0; + map.m_next_pgofs = NULL; + + for (page_idx = 0; nr_pages; page_idx++, nr_pages--) { + + prefetchw(&page->flags); + if (pages) { + page = list_entry(pages->prev, struct page, lru); + list_del(&page->lru); + if (add_to_page_cache_lru(page, mapping, + page->index, GFP_KERNEL)) + goto next_page; + } -int f2fs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, - u64 start, u64 len) -{ - return generic_block_fiemap(inode, fieinfo, - start, len, get_data_block_fiemap); + block_in_file = (sector_t)page->index; + last_block = block_in_file + nr_pages; + last_block_in_file = (i_size_read(inode) + blocksize - 1) >> + blkbits; + if (last_block > last_block_in_file) + last_block = last_block_in_file; + + /* + * Map blocks using the previous result first. + */ + if ((map.m_flags & F2FS_MAP_MAPPED) && + block_in_file > map.m_lblk && + block_in_file < (map.m_lblk + map.m_len)) + goto got_it; + + /* + * Then do more f2fs_map_blocks() calls until we are + * done with this page. + */ + map.m_flags = 0; + + if (block_in_file < last_block) { + map.m_lblk = block_in_file; + map.m_len = last_block - block_in_file; + + if (f2fs_map_blocks(inode, &map, 0, + F2FS_GET_BLOCK_READ)) + goto set_error_page; + } +got_it: + if ((map.m_flags & F2FS_MAP_MAPPED)) { + block_nr = map.m_pblk + block_in_file - map.m_lblk; + SetPageMappedToDisk(page); + + if (!PageUptodate(page) && !cleancache_get_page(page)) { + SetPageUptodate(page); + goto confused; + } + } else { + zero_user_segment(page, 0, PAGE_SIZE); + if (!PageUptodate(page)) + SetPageUptodate(page); + unlock_page(page); + goto next_page; + } + + /* + * This page will go to BIO. Do we need to send this + * BIO off first? + */ + if (bio && (last_block_in_bio != block_nr - 1)) { +submit_and_realloc: + __submit_bio(F2FS_I_SB(inode), READ, bio, DATA); + bio = NULL; + } + if (bio == NULL) { + bio = f2fs_grab_bio(inode, block_nr, nr_pages); + if (IS_ERR(bio)) { + bio = NULL; + goto set_error_page; + } + } + + if (bio_add_page(bio, page, blocksize, 0) < blocksize) + goto submit_and_realloc; + + last_block_in_bio = block_nr; + goto next_page; +set_error_page: + SetPageError(page); + zero_user_segment(page, 0, PAGE_SIZE); + unlock_page(page); + goto next_page; +confused: + if (bio) { + __submit_bio(F2FS_I_SB(inode), READ, bio, DATA); + bio = NULL; + } + unlock_page(page); +next_page: + if (pages) + put_page(page); + } + BUG_ON(pages && !list_empty(pages)); + if (bio) + __submit_bio(F2FS_I_SB(inode), READ, bio, DATA); + return 0; } static int f2fs_read_data_page(struct file *file, struct page *page) @@ -1350,8 +1128,7 @@ static int f2fs_read_data_page(struct file *file, struct page *page) if (f2fs_has_inline_data(inode)) ret = f2fs_read_inline_data(inode, page); if (ret == -EAGAIN) - ret = mpage_readpage(page, get_data_block); - + ret = f2fs_mpage_readpages(page->mapping, NULL, page, 1); return ret; } @@ -1360,16 +1137,20 @@ static int f2fs_read_data_pages(struct file *file, struct list_head *pages, unsigned nr_pages) { struct inode *inode = file->f_mapping->host; + struct page *page = list_entry(pages->prev, struct page, lru); + + trace_f2fs_readpages(inode, page, nr_pages); /* If the file has inline data, skip readpages */ if (f2fs_has_inline_data(inode)) return 0; - return mpage_readpages(mapping, pages, nr_pages, get_data_block); + return f2fs_mpage_readpages(mapping, pages, NULL, nr_pages); } -int do_write_data_page(struct page *page, struct f2fs_io_info *fio) +int do_write_data_page(struct f2fs_io_info *fio) { + struct page *page = fio->page; struct inode *inode = page->mapping->host; struct dnode_of_data dn; int err = 0; @@ -1379,34 +1160,56 @@ int do_write_data_page(struct page *page, struct f2fs_io_info *fio) if (err) return err; - fio->blk_addr = dn.data_blkaddr; + fio->old_blkaddr = dn.data_blkaddr; /* This page is already truncated */ - if (fio->blk_addr == NULL_ADDR) { + if (fio->old_blkaddr == NULL_ADDR) { ClearPageUptodate(page); goto out_writepage; } + if (f2fs_encrypted_inode(inode) && S_ISREG(inode->i_mode)) { + gfp_t gfp_flags = GFP_NOFS; + + /* wait for GCed encrypted page writeback */ + f2fs_wait_on_encrypted_page_writeback(F2FS_I_SB(inode), + fio->old_blkaddr); +retry_encrypt: + fio->encrypted_page = fscrypt_encrypt_page(inode, fio->page, + gfp_flags); + if (IS_ERR(fio->encrypted_page)) { + err = PTR_ERR(fio->encrypted_page); + if (err == -ENOMEM) { + /* flush pending ios and wait for a while */ + f2fs_flush_merged_bios(F2FS_I_SB(inode)); + congestion_wait(BLK_RW_ASYNC, HZ/50); + gfp_flags |= __GFP_NOFAIL; + err = 0; + goto retry_encrypt; + } + goto out_writepage; + } + } + set_page_writeback(page); /* * If current allocation needs SSR, * it had better in-place writes for updated data. */ - if (unlikely(fio->blk_addr != NEW_ADDR && + if (unlikely(fio->old_blkaddr != NEW_ADDR && !is_cold_data(page) && + !IS_ATOMIC_WRITTEN_PAGE(page) && need_inplace_update(inode))) { - rewrite_data_page(page, fio); - set_inode_flag(F2FS_I(inode), FI_UPDATE_WRITE); + rewrite_data_page(fio); + set_inode_flag(inode, FI_UPDATE_WRITE); trace_f2fs_do_write_data_page(page, IPU); } else { - write_data_page(page, &dn, fio); - set_data_blkaddr(&dn); - f2fs_update_extent_cache(&dn); + write_data_page(&dn, fio); trace_f2fs_do_write_data_page(page, OPU); - set_inode_flag(F2FS_I(inode), FI_APPEND_WRITE); + set_inode_flag(inode, FI_APPEND_WRITE); if (page->index == 0) - set_inode_flag(F2FS_I(inode), FI_FIRST_BLOCK_WRITTEN); + set_inode_flag(inode, FI_FIRST_BLOCK_WRITTEN); } out_writepage: f2fs_put_dnode(&dn); @@ -1420,13 +1223,17 @@ static int f2fs_write_data_page(struct page *page, struct f2fs_sb_info *sbi = F2FS_I_SB(inode); loff_t i_size = i_size_read(inode); const pgoff_t end_index = ((unsigned long long) i_size) - >> PAGE_CACHE_SHIFT; + >> PAGE_SHIFT; + loff_t psize = (page->index + 1) << PAGE_SHIFT; unsigned offset = 0; bool need_balance_fs = false; int err = 0; struct f2fs_io_info fio = { + .sbi = sbi, .type = DATA, .rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE, + .page = page, + .encrypted_page = NULL, }; trace_f2fs_writepage(page, DATA); @@ -1438,34 +1245,34 @@ static int f2fs_write_data_page(struct page *page, * If the offset is out-of-range of file size, * this page does not have to be written to disk. */ - offset = i_size & (PAGE_CACHE_SIZE - 1); + offset = i_size & (PAGE_SIZE - 1); if ((page->index >= end_index + 1) || !offset) goto out; - zero_user_segment(page, offset, PAGE_CACHE_SIZE); + zero_user_segment(page, offset, PAGE_SIZE); write: if (unlikely(is_sbi_flag_set(sbi, SBI_POR_DOING))) goto redirty_out; if (f2fs_is_drop_cache(inode)) goto out; - if (f2fs_is_volatile_file(inode) && !wbc->for_reclaim && - available_free_memory(sbi, BASE_CHECK)) + /* we should not write 0'th page having journal header */ + if (f2fs_is_volatile_file(inode) && (!page->index || + (!wbc->for_reclaim && + available_free_memory(sbi, BASE_CHECK)))) goto redirty_out; - /* Dentry blocks are controlled by checkpoint */ - if (S_ISDIR(inode->i_mode)) { - if (unlikely(f2fs_cp_error(sbi))) - goto redirty_out; - err = do_write_data_page(page, &fio); - goto done; - } - /* we should bypass data pages to proceed the kworkder jobs */ if (unlikely(f2fs_cp_error(sbi))) { - SetPageError(page); + mapping_set_error(page->mapping, -EIO); goto out; } + /* Dentry blocks are controlled by checkpoint */ + if (S_ISDIR(inode->i_mode)) { + err = do_write_data_page(&fio); + goto done; + } + if (!wbc->for_reclaim) need_balance_fs = true; else if (has_not_enough_free_secs(sbi, 0)) @@ -1476,7 +1283,9 @@ static int f2fs_write_data_page(struct page *page, if (f2fs_has_inline_data(inode)) err = f2fs_write_inline_data(inode, page); if (err == -EAGAIN) - err = do_write_data_page(page, &fio); + err = do_write_data_page(&fio); + if (F2FS_I(inode)->last_disk_size < psize) + F2FS_I(inode)->last_disk_size = psize; f2fs_unlock_op(sbi); done: if (err && err != -ENOENT) @@ -1487,24 +1296,140 @@ static int f2fs_write_data_page(struct page *page, inode_dec_dirty_pages(inode); if (err) ClearPageUptodate(page); + + if (wbc->for_reclaim) { + f2fs_submit_merged_bio_cond(sbi, NULL, page, 0, DATA, WRITE); + remove_dirty_inode(inode); + } + unlock_page(page); - if (need_balance_fs) - f2fs_balance_fs(sbi); - if (wbc->for_reclaim) + f2fs_balance_fs(sbi, need_balance_fs); + + if (unlikely(f2fs_cp_error(sbi))) f2fs_submit_merged_bio(sbi, DATA, WRITE); + return 0; redirty_out: redirty_page_for_writepage(wbc, page); - return AOP_WRITEPAGE_ACTIVATE; + unlock_page(page); + return err; } -static int __f2fs_writepage(struct page *page, struct writeback_control *wbc, - void *data) +/* + * This function was copied from write_cche_pages from mm/page-writeback.c. + * The major change is making write step of cold data page separately from + * warm/hot data page. + */ +static int f2fs_write_cache_pages(struct address_space *mapping, + struct writeback_control *wbc) { - struct address_space *mapping = data; - int ret = mapping->a_ops->writepage(page, wbc); - mapping_set_error(mapping, ret); + int ret = 0; + int done = 0; + struct pagevec pvec; + int nr_pages; + pgoff_t uninitialized_var(writeback_index); + pgoff_t index; + pgoff_t end; /* Inclusive */ + pgoff_t done_index; + int cycled; + int range_whole = 0; + int tag; + + pagevec_init(&pvec, 0); + + if (wbc->range_cyclic) { + writeback_index = mapping->writeback_index; /* prev offset */ + index = writeback_index; + if (index == 0) + cycled = 1; + else + cycled = 0; + end = -1; + } else { + index = wbc->range_start >> PAGE_SHIFT; + end = wbc->range_end >> PAGE_SHIFT; + if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX) + range_whole = 1; + cycled = 1; /* ignore range_cyclic tests */ + } + if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) + tag = PAGECACHE_TAG_TOWRITE; + else + tag = PAGECACHE_TAG_DIRTY; +retry: + if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) + tag_pages_for_writeback(mapping, index, end); + done_index = index; + while (!done && (index <= end)) { + int i; + + nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag, + min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1); + if (nr_pages == 0) + break; + + for (i = 0; i < nr_pages; i++) { + struct page *page = pvec.pages[i]; + + if (page->index > end) { + done = 1; + break; + } + + done_index = page->index; + + lock_page(page); + + if (unlikely(page->mapping != mapping)) { +continue_unlock: + unlock_page(page); + continue; + } + + if (!PageDirty(page)) { + /* someone wrote it for us */ + goto continue_unlock; + } + + if (PageWriteback(page)) { + if (wbc->sync_mode != WB_SYNC_NONE) + f2fs_wait_on_page_writeback(page, + DATA, true); + else + goto continue_unlock; + } + + BUG_ON(PageWriteback(page)); + if (!clear_page_dirty_for_io(page)) + goto continue_unlock; + + ret = mapping->a_ops->writepage(page, wbc); + if (unlikely(ret)) { + done_index = page->index + 1; + done = 1; + break; + } + + if (--wbc->nr_to_write <= 0 && + wbc->sync_mode == WB_SYNC_NONE) { + done = 1; + break; + } + } + pagevec_release(&pvec); + cond_resched(); + } + + if (!cycled && !done) { + cycled = 1; + index = 0; + end = writeback_index - 1; + goto retry; + } + if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0)) + mapping->writeback_index = done_index; + return ret; } @@ -1513,48 +1438,135 @@ static int f2fs_write_data_pages(struct address_space *mapping, { struct inode *inode = mapping->host; struct f2fs_sb_info *sbi = F2FS_I_SB(inode); + struct blk_plug plug; int ret; - long diff; - - trace_f2fs_writepages(mapping->host, wbc, DATA); /* deal with chardevs and other special file */ if (!mapping->a_ops->writepage) return 0; + /* skip writing if there is no dirty page in this inode */ + if (!get_dirty_pages(inode) && wbc->sync_mode == WB_SYNC_NONE) + return 0; + if (S_ISDIR(inode->i_mode) && wbc->sync_mode == WB_SYNC_NONE && get_dirty_pages(inode) < nr_pages_to_skip(sbi, DATA) && available_free_memory(sbi, DIRTY_DENTS)) goto skip_write; + /* skip writing during file defragment */ + if (is_inode_flag_set(inode, FI_DO_DEFRAG)) + goto skip_write; + /* during POR, we don't need to trigger writepage at all. */ if (unlikely(is_sbi_flag_set(sbi, SBI_POR_DOING))) goto skip_write; - diff = nr_pages_to_write(sbi, DATA, wbc); - - ret = write_cache_pages(mapping, wbc, __f2fs_writepage, mapping); + trace_f2fs_writepages(mapping->host, wbc, DATA); + blk_start_plug(&plug); + ret = f2fs_write_cache_pages(mapping, wbc); + blk_finish_plug(&plug); + /* + * if some pages were truncated, we cannot guarantee its mapping->host + * to detect pending bios. + */ f2fs_submit_merged_bio(sbi, DATA, WRITE); - remove_dirty_dir_inode(inode); - - wbc->nr_to_write = max((long)0, wbc->nr_to_write - diff); + remove_dirty_inode(inode); return ret; skip_write: wbc->pages_skipped += get_dirty_pages(inode); + trace_f2fs_writepages(mapping->host, wbc, DATA); return 0; } static void f2fs_write_failed(struct address_space *mapping, loff_t to) { struct inode *inode = mapping->host; + loff_t i_size = i_size_read(inode); + + if (to > i_size) { + truncate_pagecache(inode, 0, i_size); + truncate_blocks(inode, i_size, true); + } +} + +static int prepare_write_begin(struct f2fs_sb_info *sbi, + struct page *page, loff_t pos, unsigned len, + block_t *blk_addr, bool *node_changed) +{ + struct inode *inode = page->mapping->host; + pgoff_t index = page->index; + struct dnode_of_data dn; + struct page *ipage; + bool locked = false; + struct extent_info ei; + int err = 0; + + /* + * we already allocated all the blocks, so we don't need to get + * the block addresses when there is no need to fill the page. + */ + if (!f2fs_has_inline_data(inode) && !f2fs_encrypted_inode(inode) && + len == PAGE_SIZE) + return 0; + + if (f2fs_has_inline_data(inode) || + (pos & PAGE_MASK) >= i_size_read(inode)) { + f2fs_lock_op(sbi); + locked = true; + } +restart: + /* check inline_data */ + ipage = get_node_page(sbi, inode->i_ino); + if (IS_ERR(ipage)) { + err = PTR_ERR(ipage); + goto unlock_out; + } - if (to > inode->i_size) { - truncate_pagecache(inode, 0, inode->i_size); - truncate_blocks(inode, inode->i_size, true); + set_new_dnode(&dn, inode, ipage, ipage, 0); + + if (f2fs_has_inline_data(inode)) { + if (pos + len <= MAX_INLINE_DATA) { + read_inline_data(page, ipage); + set_inode_flag(inode, FI_DATA_EXIST); + if (inode->i_nlink) + set_inline_node(ipage); + } else { + err = f2fs_convert_inline_page(&dn, page); + if (err) + goto out; + if (dn.data_blkaddr == NULL_ADDR) + err = f2fs_get_block(&dn, index); + } + } else if (locked) { + err = f2fs_get_block(&dn, index); + } else { + if (f2fs_lookup_extent_cache(inode, index, &ei)) { + dn.data_blkaddr = ei.blk + index - ei.fofs; + } else { + /* hole case */ + err = get_dnode_of_data(&dn, index, LOOKUP_NODE); + if (err || dn.data_blkaddr == NULL_ADDR) { + f2fs_put_dnode(&dn); + f2fs_lock_op(sbi); + locked = true; + goto restart; + } + } } + + /* convert_inline_page can make node_changed */ + *blk_addr = dn.data_blkaddr; + *node_changed = dn.node_changed; +out: + f2fs_put_dnode(&dn); +unlock_out: + if (locked) + f2fs_unlock_op(sbi); + return err; } static int f2fs_write_begin(struct file *file, struct address_space *mapping, @@ -1563,15 +1575,14 @@ static int f2fs_write_begin(struct file *file, struct address_space *mapping, { struct inode *inode = mapping->host; struct f2fs_sb_info *sbi = F2FS_I_SB(inode); - struct page *page, *ipage; - pgoff_t index = ((unsigned long long) pos) >> PAGE_CACHE_SHIFT; - struct dnode_of_data dn; + struct page *page = NULL; + pgoff_t index = ((unsigned long long) pos) >> PAGE_SHIFT; + bool need_balance = false; + block_t blkaddr = NULL_ADDR; int err = 0; trace_f2fs_write_begin(inode, pos, len, flags); - f2fs_balance_fs(sbi); - /* * We should check this at this moment to avoid deadlock on inode page * and #0 page. The locking rule for inline_data conversion should be: @@ -1591,83 +1602,80 @@ static int f2fs_write_begin(struct file *file, struct address_space *mapping, *pagep = page; - f2fs_lock_op(sbi); - - /* check inline_data */ - ipage = get_node_page(sbi, inode->i_ino); - if (IS_ERR(ipage)) { - err = PTR_ERR(ipage); - goto unlock_fail; - } - - set_new_dnode(&dn, inode, ipage, ipage, 0); + err = prepare_write_begin(sbi, page, pos, len, + &blkaddr, &need_balance); + if (err) + goto fail; - if (f2fs_has_inline_data(inode)) { - if (pos + len <= MAX_INLINE_DATA) { - read_inline_data(page, ipage); - set_inode_flag(F2FS_I(inode), FI_DATA_EXIST); - sync_inode_page(&dn); - goto put_next; + if (need_balance && has_not_enough_free_secs(sbi, 0)) { + unlock_page(page); + f2fs_balance_fs(sbi, true); + lock_page(page); + if (page->mapping != mapping) { + /* The page got truncated from under us */ + f2fs_put_page(page, 1); + goto repeat; } - err = f2fs_convert_inline_page(&dn, page); - if (err) - goto put_fail; } - err = f2fs_reserve_block(&dn, index); - if (err) - goto put_fail; -put_next: - f2fs_put_dnode(&dn); - f2fs_unlock_op(sbi); - if ((len == PAGE_CACHE_SIZE) || PageUptodate(page)) - return 0; + f2fs_wait_on_page_writeback(page, DATA, false); - f2fs_wait_on_page_writeback(page, DATA); + /* wait for GCed encrypted page writeback */ + if (f2fs_encrypted_inode(inode) && S_ISREG(inode->i_mode)) + f2fs_wait_on_encrypted_page_writeback(sbi, blkaddr); - if ((pos & PAGE_CACHE_MASK) >= i_size_read(inode)) { - unsigned start = pos & (PAGE_CACHE_SIZE - 1); + if (len == PAGE_SIZE) + goto out_update; + if (PageUptodate(page)) + goto out_clear; + + if ((pos & PAGE_MASK) >= i_size_read(inode)) { + unsigned start = pos & (PAGE_SIZE - 1); unsigned end = start + len; /* Reading beyond i_size is simple: memset to zero */ - zero_user_segments(page, 0, start, end, PAGE_CACHE_SIZE); - goto out; + zero_user_segments(page, 0, start, end, PAGE_SIZE); + goto out_update; } - if (dn.data_blkaddr == NEW_ADDR) { - zero_user_segment(page, 0, PAGE_CACHE_SIZE); + if (blkaddr == NEW_ADDR) { + zero_user_segment(page, 0, PAGE_SIZE); } else { - struct f2fs_io_info fio = { - .type = DATA, - .rw = READ_SYNC, - .blk_addr = dn.data_blkaddr, - }; - err = f2fs_submit_page_bio(sbi, page, &fio); - if (err) + struct bio *bio; + + bio = f2fs_grab_bio(inode, blkaddr, 1); + if (IS_ERR(bio)) { + err = PTR_ERR(bio); goto fail; + } - lock_page(page); - if (unlikely(!PageUptodate(page))) { - f2fs_put_page(page, 1); - err = -EIO; + if (bio_add_page(bio, page, PAGE_SIZE, 0) < PAGE_SIZE) { + bio_put(bio); + err = -EFAULT; goto fail; } + + __submit_bio(sbi, READ_SYNC, bio, DATA); + + lock_page(page); if (unlikely(page->mapping != mapping)) { f2fs_put_page(page, 1); goto repeat; } + if (unlikely(!PageUptodate(page))) { + err = -EIO; + goto fail; + } } -out: - SetPageUptodate(page); +out_update: + if (!PageUptodate(page)) + SetPageUptodate(page); +out_clear: clear_cold_data(page); return 0; -put_fail: - f2fs_put_dnode(&dn); -unlock_fail: - f2fs_unlock_op(sbi); - f2fs_put_page(page, 1); fail: + f2fs_put_page(page, 1); f2fs_write_failed(mapping, pos + len); return err; } @@ -1682,65 +1690,85 @@ static int f2fs_write_end(struct file *file, trace_f2fs_write_end(inode, pos, len, copied); set_page_dirty(page); + f2fs_put_page(page, 1); - if (pos + copied > i_size_read(inode)) { - i_size_write(inode, pos + copied); - mark_inode_dirty(inode); - update_inode_page(inode); - } + if (pos + copied > i_size_read(inode)) + f2fs_i_size_write(inode, pos + copied); - f2fs_put_page(page, 1); + f2fs_update_time(F2FS_I_SB(inode), REQ_TIME); return copied; } -static int check_direct_IO(struct inode *inode, int rw, +static ssize_t check_direct_IO(struct inode *inode, int rw, const struct iovec *iov, loff_t offset, unsigned long nr_segs) { unsigned blocksize_mask = inode->i_sb->s_blocksize - 1; - int i; - - if (rw == READ) - return 0; + int seg, i; + size_t size; + unsigned long addr; + ssize_t retval = -EINVAL; + loff_t end = offset; if (offset & blocksize_mask) return -EINVAL; - for (i = 0; i < nr_segs; i++) - if (iov[i].iov_len & blocksize_mask) - return -EINVAL; + /* Check the memory alignment. Blocks cannot straddle pages */ + for (seg = 0; seg < nr_segs; seg++) { + addr = (unsigned long)iov[seg].iov_base; + size = iov[seg].iov_len; + end += size; + if ((addr & blocksize_mask) || (size & blocksize_mask)) + goto out; - return 0; + /* If this is a write we don't need to check anymore */ + if (rw & WRITE) + continue; + + /* + * Check to make sure we don't have duplicate iov_base's in this + * iovec, if so return EINVAL, otherwise we'll get csum errors + * when reading back. + */ + for (i = seg + 1; i < nr_segs; i++) { + if (iov[seg].iov_base == iov[i].iov_base) + goto out; + } + } + retval = 0; +out: + return retval; } static ssize_t f2fs_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, loff_t offset, unsigned long nr_segs) { - struct file *file = iocb->ki_filp; - struct address_space *mapping = file->f_mapping; + struct address_space *mapping = iocb->ki_filp->f_mapping; struct inode *inode = mapping->host; size_t count = iov_length(iov, nr_segs); int err; - /* we don't need to use inline_data strictly */ - if (f2fs_has_inline_data(inode)) { - err = f2fs_convert_inline_inode(inode); - if (err) - return err; - } + err = check_direct_IO(inode, rw, iov, offset, nr_segs); + if (err) + return err; - if (check_direct_IO(inode, rw, iov, offset, nr_segs)) + if (f2fs_encrypted_inode(inode) && S_ISREG(inode->i_mode)) + return 0; + if (test_opt(F2FS_I_SB(inode), LFS)) return 0; trace_f2fs_direct_IO_enter(inode, offset, count, rw); - if (rw & WRITE) - __allocate_data_blocks(inode, offset, count); - + down_read(&F2FS_I(inode)->dio_rwsem[rw]); err = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs, - get_data_block); - if (err < 0 && (rw & WRITE)) - f2fs_write_failed(mapping, offset + count); + get_data_block_dio); + up_read(&F2FS_I(inode)->dio_rwsem[rw]); + if (err < 0 && (rw & WRITE)) { + if (err > 0) + set_inode_flag(inode, FI_UPDATE_WRITE); + else if (err < 0) + f2fs_write_failed(mapping, offset + count); + } trace_f2fs_direct_IO_exit(inode, offset, count, rw, err); @@ -1752,7 +1780,7 @@ void f2fs_invalidate_page(struct page *page, unsigned long offset) struct inode *inode = page->mapping->host; struct f2fs_sb_info *sbi = F2FS_I_SB(inode); - if (inode->i_ino >= F2FS_ROOT_INO(sbi) && (offset % PAGE_CACHE_SIZE)) + if (inode->i_ino >= F2FS_ROOT_INO(sbi) && (offset % PAGE_SIZE)) return; if (PageDirty(page)) { @@ -1763,6 +1791,12 @@ void f2fs_invalidate_page(struct page *page, unsigned long offset) else inode_dec_dirty_pages(inode); } + + /* This is atomic written page, keep Private */ + if (IS_ATOMIC_WRITTEN_PAGE(page)) + return; + + set_page_private(page, 0); ClearPagePrivate(page); } @@ -1772,10 +1806,42 @@ int f2fs_release_page(struct page *page, gfp_t wait) if (PageDirty(page)) return 0; + /* This is atomic written page, keep Private */ + if (IS_ATOMIC_WRITTEN_PAGE(page)) + return 0; + + set_page_private(page, 0); ClearPagePrivate(page); return 1; } +/* + * This was copied from __set_page_dirty_buffers which gives higher performance + * in very high speed storages. (e.g., pmem) + */ +void f2fs_set_page_dirty_nobuffers(struct page *page) +{ + struct address_space *mapping = page->mapping; + unsigned long flags; + + if (unlikely(!mapping)) + return; + + spin_lock(&mapping->private_lock); + SetPageDirty(page); + spin_unlock(&mapping->private_lock); + + spin_lock_irqsave(&mapping->tree_lock, flags); + WARN_ON_ONCE(!PageUptodate(page)); + account_page_dirtied(page, mapping); + radix_tree_tag_set(&mapping->page_tree, + page_index(page), PAGECACHE_TAG_DIRTY); + spin_unlock_irqrestore(&mapping->tree_lock, flags); + + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); + return; +} + static int f2fs_set_data_page_dirty(struct page *page) { struct address_space *mapping = page->mapping; @@ -1783,17 +1849,23 @@ static int f2fs_set_data_page_dirty(struct page *page) trace_f2fs_set_page_dirty(page, DATA); - SetPageUptodate(page); + if (!PageUptodate(page)) + SetPageUptodate(page); if (f2fs_is_atomic_file(inode)) { - register_inmem_page(inode, page); - return 1; + if (!IS_ATOMIC_WRITTEN_PAGE(page)) { + register_inmem_page(inode, page); + return 1; + } + /* + * Previously, this page has been registered, we just + * return here. + */ + return 0; } - mark_inode_dirty(inode); - if (!PageDirty(page)) { - __set_page_dirty_nobuffers(page); + f2fs_set_page_dirty_nobuffers(page); update_dirty_page(inode, page); return 1; } @@ -1804,44 +1876,14 @@ static sector_t f2fs_bmap(struct address_space *mapping, sector_t block) { struct inode *inode = mapping->host; - /* we don't need to use inline_data strictly */ - if (f2fs_has_inline_data(inode)) { - int err = f2fs_convert_inline_inode(inode); - if (err) - return err; - } - return generic_block_bmap(mapping, block, get_data_block); -} - -void init_extent_cache_info(struct f2fs_sb_info *sbi) -{ - INIT_RADIX_TREE(&sbi->extent_tree_root, GFP_NOIO); - init_rwsem(&sbi->extent_tree_lock); - INIT_LIST_HEAD(&sbi->extent_list); - spin_lock_init(&sbi->extent_lock); - sbi->total_ext_tree = 0; - atomic_set(&sbi->total_ext_node, 0); -} + if (f2fs_has_inline_data(inode)) + return 0; -int __init create_extent_cache(void) -{ - extent_tree_slab = f2fs_kmem_cache_create("f2fs_extent_tree", - sizeof(struct extent_tree)); - if (!extent_tree_slab) - return -ENOMEM; - extent_node_slab = f2fs_kmem_cache_create("f2fs_extent_node", - sizeof(struct extent_node)); - if (!extent_node_slab) { - kmem_cache_destroy(extent_tree_slab); - return -ENOMEM; - } - return 0; -} + /* make sure allocating whole blocks */ + if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) + filemap_write_and_wait(mapping); -void destroy_extent_cache(void) -{ - kmem_cache_destroy(extent_node_slab); - kmem_cache_destroy(extent_tree_slab); + return generic_block_bmap(mapping, block, get_data_block_bmap); } const struct address_space_operations f2fs_dblock_aops = { diff --git a/fs/f2fs/debug.c b/fs/f2fs/debug.c old mode 100644 new mode 100755 index f5388f37217e..9197c8d87634 --- a/fs/f2fs/debug.c +++ b/fs/f2fs/debug.c @@ -33,24 +33,33 @@ static void update_general_status(struct f2fs_sb_info *sbi) int i; /* validation check of the segment numbers */ - si->hit_ext = sbi->read_hit_ext; - si->total_ext = sbi->total_hit_ext; - si->ext_tree = sbi->total_ext_tree; + si->hit_largest = atomic64_read(&sbi->read_hit_largest); + si->hit_cached = atomic64_read(&sbi->read_hit_cached); + si->hit_rbtree = atomic64_read(&sbi->read_hit_rbtree); + si->hit_total = si->hit_largest + si->hit_cached + si->hit_rbtree; + si->total_ext = atomic64_read(&sbi->total_hit_ext); + si->ext_tree = atomic_read(&sbi->total_ext_tree); + si->zombie_tree = atomic_read(&sbi->total_zombie_tree); si->ext_node = atomic_read(&sbi->total_ext_node); si->ndirty_node = get_pages(sbi, F2FS_DIRTY_NODES); si->ndirty_dent = get_pages(sbi, F2FS_DIRTY_DENTS); - si->ndirty_dirs = sbi->n_dirty_dirs; si->ndirty_meta = get_pages(sbi, F2FS_DIRTY_META); + si->ndirty_data = get_pages(sbi, F2FS_DIRTY_DATA); + si->ndirty_dirs = sbi->ndirty_inode[DIR_INODE]; + si->ndirty_files = sbi->ndirty_inode[FILE_INODE]; + si->ndirty_all = sbi->ndirty_inode[DIRTY_META]; si->inmem_pages = get_pages(sbi, F2FS_INMEM_PAGES); - si->wb_pages = get_pages(sbi, F2FS_WRITEBACK); + si->wb_bios = atomic_read(&sbi->nr_wb_bios); si->total_count = (int)sbi->user_block_count / sbi->blocks_per_seg; si->rsvd_segs = reserved_segments(sbi); si->overp_segs = overprovision_segments(sbi); si->valid_count = valid_user_blocks(sbi); si->valid_node_count = valid_node_count(sbi); si->valid_inode_count = valid_inode_count(sbi); + si->inline_xattr = atomic_read(&sbi->inline_xattr); si->inline_inode = atomic_read(&sbi->inline_inode); si->inline_dir = atomic_read(&sbi->inline_dir); + si->orphans = sbi->im[ORPHAN_INO].ino_num; si->utilization = utilization(sbi); si->free_segs = free_segments(sbi); @@ -94,13 +103,14 @@ static void update_general_status(struct f2fs_sb_info *sbi) static void update_sit_info(struct f2fs_sb_info *sbi) { struct f2fs_stat_info *si = F2FS_STAT(sbi); - unsigned int blks_per_sec, hblks_per_sec, total_vblocks, bimodal, dist; + unsigned long long blks_per_sec, hblks_per_sec, total_vblocks; + unsigned long long bimodal, dist; unsigned int segno, vblocks; int ndirty = 0; bimodal = 0; total_vblocks = 0; - blks_per_sec = sbi->segs_per_sec * (1 << sbi->log_blocks_per_seg); + blks_per_sec = sbi->segs_per_sec * sbi->blocks_per_seg; hblks_per_sec = blks_per_sec / 2; for (segno = 0; segno < MAIN_SEGS(sbi); segno += sbi->segs_per_sec) { vblocks = get_valid_blocks(sbi, segno, sbi->segs_per_sec); @@ -112,10 +122,10 @@ static void update_sit_info(struct f2fs_sb_info *sbi) ndirty++; } } - dist = MAIN_SECS(sbi) * hblks_per_sec * hblks_per_sec / 100; - si->bimodal = bimodal / dist; + dist = div_u64(MAIN_SECS(sbi) * hblks_per_sec * hblks_per_sec, 100); + si->bimodal = div64_u64(bimodal, dist); if (si->dirty_count) - si->avg_vblocks = total_vblocks / ndirty; + si->avg_vblocks = div_u64(total_vblocks, ndirty); else si->avg_vblocks = 0; } @@ -135,6 +145,7 @@ static void update_mem_info(struct f2fs_sb_info *sbi) si->base_mem = sizeof(struct f2fs_sb_info) + sbi->sb->s_blocksize; si->base_mem += 2 * sizeof(struct f2fs_inode_info); si->base_mem += sizeof(*sbi->ckpt); + si->base_mem += sizeof(struct percpu_counter) * NR_COUNT_TYPE; /* build sm */ si->base_mem += sizeof(struct f2fs_sm_info); @@ -143,7 +154,7 @@ static void update_mem_info(struct f2fs_sb_info *sbi) si->base_mem += sizeof(struct sit_info); si->base_mem += MAIN_SEGS(sbi) * sizeof(struct seg_entry); si->base_mem += f2fs_bitmap_size(MAIN_SEGS(sbi)); - si->base_mem += 2 * SIT_VBLOCK_MAP_SIZE * MAIN_SEGS(sbi); + si->base_mem += 3 * SIT_VBLOCK_MAP_SIZE * MAIN_SEGS(sbi); si->base_mem += SIT_VBLOCK_MAP_SIZE; if (sbi->segs_per_sec > 1) si->base_mem += MAIN_SECS(sbi) * sizeof(struct sec_entry); @@ -156,7 +167,7 @@ static void update_mem_info(struct f2fs_sb_info *sbi) /* build curseg */ si->base_mem += sizeof(struct curseg_info) * NR_CURSEG_TYPE; - si->base_mem += PAGE_CACHE_SIZE * NR_CURSEG_TYPE; + si->base_mem += PAGE_SIZE * NR_CURSEG_TYPE; /* build dirty segmap */ si->base_mem += sizeof(struct dirty_seglist_info); @@ -184,18 +195,18 @@ static void update_mem_info(struct f2fs_sb_info *sbi) si->cache_mem += NM_I(sbi)->dirty_nat_cnt * sizeof(struct nat_entry_set); si->cache_mem += si->inmem_pages * sizeof(struct inmem_pages); - si->cache_mem += sbi->n_dirty_dirs * sizeof(struct inode_entry); - for (i = 0; i <= UPDATE_INO; i++) + for (i = 0; i <= ORPHAN_INO; i++) si->cache_mem += sbi->im[i].ino_num * sizeof(struct ino_entry); - si->cache_mem += sbi->total_ext_tree * sizeof(struct extent_tree); + si->cache_mem += atomic_read(&sbi->total_ext_tree) * + sizeof(struct extent_tree); si->cache_mem += atomic_read(&sbi->total_ext_node) * sizeof(struct extent_node); si->page_mem = 0; npages = NODE_MAPPING(sbi)->nrpages; - si->page_mem += npages << PAGE_CACHE_SHIFT; + si->page_mem += (unsigned long long)npages << PAGE_SHIFT; npages = META_MAPPING(sbi)->nrpages; - si->page_mem += npages << PAGE_CACHE_SHIFT; + si->page_mem += (unsigned long long)npages << PAGE_SHIFT; } static int stat_show(struct seq_file *s, void *v) @@ -210,8 +221,9 @@ static int stat_show(struct seq_file *s, void *v) update_general_status(si->sbi); - seq_printf(s, "\n=====[ partition info(%s). #%d ]=====\n", - bdevname(si->sbi->sb->s_bdev, devname), i++); + seq_printf(s, "\n=====[ partition info(%s). #%d, %s]=====\n", + bdevname(si->sbi->sb->s_bdev, devname), i++, + f2fs_readonly(si->sbi->sb) ? "RO": "RW"); seq_printf(s, "[SB: 1] [CP: 2] [SIT: %d] [NAT: %d] ", si->sit_area_segs, si->nat_area_segs); seq_printf(s, "[SSA: %d] [MAIN: %d", @@ -225,10 +237,14 @@ static int stat_show(struct seq_file *s, void *v) seq_printf(s, "Other: %u)\n - Data: %u\n", si->valid_node_count - si->valid_inode_count, si->valid_count - si->valid_node_count); + seq_printf(s, " - Inline_xattr Inode: %u\n", + si->inline_xattr); seq_printf(s, " - Inline_data Inode: %u\n", si->inline_inode); seq_printf(s, " - Inline_dentry Inode: %u\n", si->inline_dir); + seq_printf(s, " - Orphan Inode: %u\n", + si->orphans); seq_printf(s, "\nMain area: %d segs, %d secs %d zones\n", si->main_area_segs, si->main_area_sections, si->main_area_zones); @@ -262,7 +278,8 @@ static int stat_show(struct seq_file *s, void *v) si->dirty_count); seq_printf(s, " - Prefree: %d\n - Free: %d (%d)\n\n", si->prefree_count, si->free_segs, si->free_secs); - seq_printf(s, "CP calls: %d\n", si->cp_count); + seq_printf(s, "CP calls: %d (BG: %d)\n", + si->cp_count, si->bg_cp_count); seq_printf(s, "GC calls: %d (BG: %d)\n", si->call_count, si->bg_gc); seq_printf(s, " - data segments : %d (%d)\n", @@ -275,18 +292,26 @@ static int stat_show(struct seq_file *s, void *v) si->bg_data_blks); seq_printf(s, " - node blocks : %d (%d)\n", si->node_blks, si->bg_node_blks); - seq_printf(s, "\nExtent Hit Ratio: %d / %d\n", - si->hit_ext, si->total_ext); - seq_printf(s, "\nExtent Tree Count: %d\n", si->ext_tree); - seq_printf(s, "\nExtent Node Count: %d\n", si->ext_node); + seq_puts(s, "\nExtent Cache:\n"); + seq_printf(s, " - Hit Count: L1-1:%llu L1-2:%llu L2:%llu\n", + si->hit_largest, si->hit_cached, + si->hit_rbtree); + seq_printf(s, " - Hit Ratio: %llu%% (%llu / %llu)\n", + !si->total_ext ? 0 : + div64_u64(si->hit_total * 100, si->total_ext), + si->hit_total, si->total_ext); + seq_printf(s, " - Inner Struct Count: tree: %d(%d), node: %d\n", + si->ext_tree, si->zombie_tree, si->ext_node); seq_puts(s, "\nBalancing F2FS Async:\n"); - seq_printf(s, " - inmem: %4d, wb: %4d\n", - si->inmem_pages, si->wb_pages); - seq_printf(s, " - nodes: %4d in %4d\n", + seq_printf(s, " - inmem: %4lld, wb_bios: %4d\n", + si->inmem_pages, si->wb_bios); + seq_printf(s, " - nodes: %4lld in %4d\n", si->ndirty_node, si->node_pages); - seq_printf(s, " - dents: %4d in dirs:%4d\n", - si->ndirty_dent, si->ndirty_dirs); - seq_printf(s, " - meta: %4d in %4d\n", + seq_printf(s, " - dents: %4lld in dirs:%4d (%4d)\n", + si->ndirty_dent, si->ndirty_dirs, si->ndirty_all); + seq_printf(s, " - datas: %4lld in files:%4d\n", + si->ndirty_data, si->ndirty_files); + seq_printf(s, " - meta: %4lld in %4d\n", si->ndirty_meta, si->meta_pages); seq_printf(s, " - NATs: %9d/%9d\n - SITs: %9d/%9d\n", si->dirty_nats, si->nats, si->dirty_sits, si->sits); @@ -320,13 +345,13 @@ static int stat_show(struct seq_file *s, void *v) /* memory footprint */ update_mem_info(si->sbi); - seq_printf(s, "\nMemory: %u KB\n", + seq_printf(s, "\nMemory: %llu KB\n", (si->base_mem + si->cache_mem + si->page_mem) >> 10); - seq_printf(s, " - static: %u KB\n", + seq_printf(s, " - static: %llu KB\n", si->base_mem >> 10); - seq_printf(s, " - cached: %u KB\n", + seq_printf(s, " - cached: %llu KB\n", si->cache_mem >> 10); - seq_printf(s, " - paged : %u KB\n", + seq_printf(s, " - paged : %llu KB\n", si->page_mem >> 10); } mutex_unlock(&f2fs_stat_mutex); @@ -365,6 +390,12 @@ int f2fs_build_stats(struct f2fs_sb_info *sbi) si->sbi = sbi; sbi->stat_info = si; + atomic64_set(&sbi->total_hit_ext, 0); + atomic64_set(&sbi->read_hit_rbtree, 0); + atomic64_set(&sbi->read_hit_largest, 0); + atomic64_set(&sbi->read_hit_cached, 0); + + atomic_set(&sbi->inline_xattr, 0); atomic_set(&sbi->inline_inode, 0); atomic_set(&sbi->inline_dir, 0); atomic_set(&sbi->inplace_count, 0); @@ -387,20 +418,23 @@ void f2fs_destroy_stats(struct f2fs_sb_info *sbi) kfree(si); } -void __init f2fs_create_root_stats(void) +int __init f2fs_create_root_stats(void) { struct dentry *file; f2fs_debugfs_root = debugfs_create_dir("f2fs", NULL); if (!f2fs_debugfs_root) - return; + return -ENOMEM; file = debugfs_create_file("status", S_IRUGO, f2fs_debugfs_root, NULL, &stat_fops); if (!file) { debugfs_remove(f2fs_debugfs_root); f2fs_debugfs_root = NULL; + return -ENOMEM; } + + return 0; } void f2fs_destroy_root_stats(void) diff --git a/fs/f2fs/dir.c b/fs/f2fs/dir.c old mode 100644 new mode 100755 index 5575c90fe881..1b57e143b5a9 --- a/fs/f2fs/dir.c +++ b/fs/f2fs/dir.c @@ -17,8 +17,8 @@ static unsigned long dir_blocks(struct inode *inode) { - return ((unsigned long long) (i_size_read(inode) + PAGE_CACHE_SIZE - 1)) - >> PAGE_CACHE_SHIFT; + return ((unsigned long long) (i_size_read(inode) + PAGE_SIZE - 1)) + >> PAGE_SHIFT; } static unsigned int dir_buckets(unsigned int level, int dir_level) @@ -48,7 +48,6 @@ unsigned char f2fs_filetype_table[F2FS_FT_MAX] = { [F2FS_FT_SYMLINK] = DT_LNK, }; -#define S_SHIFT 12 static unsigned char f2fs_type_by_mode[S_IFMT >> S_SHIFT] = { [S_IFREG >> S_SHIFT] = F2FS_FT_REG_FILE, [S_IFDIR >> S_SHIFT] = F2FS_FT_DIR, @@ -64,6 +63,13 @@ void set_de_type(struct f2fs_dir_entry *de, umode_t mode) de->file_type = f2fs_type_by_mode[(mode & S_IFMT) >> S_SHIFT]; } +unsigned char get_de_type(struct f2fs_dir_entry *de) +{ + if (de->file_type < F2FS_FT_MAX) + return f2fs_filetype_table[de->file_type]; + return DT_UNKNOWN; +} + static unsigned long dir_block_index(unsigned int level, int dir_level, unsigned int idx) { @@ -76,20 +82,10 @@ static unsigned long dir_block_index(unsigned int level, return bidx; } -static bool early_match_name(size_t namelen, f2fs_hash_t namehash, - struct f2fs_dir_entry *de) -{ - if (le16_to_cpu(de->name_len) != namelen) - return false; - - if (de->hash_code != namehash) - return false; - - return true; -} - static struct f2fs_dir_entry *find_in_block(struct page *dentry_page, - struct qstr *name, int *max_slots, + struct fscrypt_name *fname, + f2fs_hash_t namehash, + int *max_slots, struct page **res_page) { struct f2fs_dentry_block *dentry_blk; @@ -98,29 +94,25 @@ static struct f2fs_dir_entry *find_in_block(struct page *dentry_page, dentry_blk = (struct f2fs_dentry_block *)kmap(dentry_page); - make_dentry_ptr(&d, (void *)dentry_blk, 1); - de = find_target_dentry(name, max_slots, &d); - + make_dentry_ptr(NULL, &d, (void *)dentry_blk, 1); + de = find_target_dentry(fname, namehash, max_slots, &d); if (de) *res_page = dentry_page; else kunmap(dentry_page); - /* - * For the most part, it should be a bug when name_len is zero. - * We stop here for figuring out where the bugs has occurred. - */ - f2fs_bug_on(F2FS_P_SB(dentry_page), d.max < 0); return de; } -struct f2fs_dir_entry *find_target_dentry(struct qstr *name, int *max_slots, - struct f2fs_dentry_ptr *d) +struct f2fs_dir_entry *find_target_dentry(struct fscrypt_name *fname, + f2fs_hash_t namehash, int *max_slots, + struct f2fs_dentry_ptr *d) { struct f2fs_dir_entry *de; unsigned long bit_pos = 0; - f2fs_hash_t namehash = f2fs_dentry_hash(name); int max_len = 0; + struct fscrypt_str de_name = FSTR_INIT(NULL, 0); + struct fscrypt_str *name = &fname->disk_name; if (max_slots) *max_slots = 0; @@ -132,18 +124,29 @@ struct f2fs_dir_entry *find_target_dentry(struct qstr *name, int *max_slots, } de = &d->dentry[bit_pos]; - if (early_match_name(name->len, namehash, de) && - !memcmp(d->filename[bit_pos], name->name, name->len)) + + if (unlikely(!de->name_len)) { + bit_pos++; + continue; + } + + /* encrypted case */ + de_name.name = d->filename[bit_pos]; + de_name.len = le16_to_cpu(de->name_len); + + /* show encrypted name */ + if (fname->hash) { + if (de->hash_code == fname->hash) + goto found; + } else if (de_name.len == name->len && + de->hash_code == namehash && + !memcmp(de_name.name, name->name, name->len)) goto found; if (max_slots && max_len > *max_slots) *max_slots = max_len; max_len = 0; - /* remain bug on condition */ - if (unlikely(!de->name_len)) - d->max = -1; - bit_pos += GET_DENTRY_SLOTS(le16_to_cpu(de->name_len)); } @@ -155,18 +158,21 @@ struct f2fs_dir_entry *find_target_dentry(struct qstr *name, int *max_slots, } static struct f2fs_dir_entry *find_in_level(struct inode *dir, - unsigned int level, struct qstr *name, - f2fs_hash_t namehash, struct page **res_page) + unsigned int level, + struct fscrypt_name *fname, + struct page **res_page) { - int s = GET_DENTRY_SLOTS(name->len); + struct qstr name = FSTR_TO_QSTR(&fname->disk_name); + int s = GET_DENTRY_SLOTS(name.len); unsigned int nbucket, nblock; unsigned int bidx, end_block; struct page *dentry_page; struct f2fs_dir_entry *de = NULL; bool room = false; int max_slots; + f2fs_hash_t namehash; - f2fs_bug_on(F2FS_I_SB(dir), level > MAX_DIR_HASH_DEPTH); + namehash = f2fs_dentry_hash(&name); nbucket = dir_buckets(level, F2FS_I(dir)->i_dir_level); nblock = bucket_blocks(level); @@ -177,13 +183,19 @@ static struct f2fs_dir_entry *find_in_level(struct inode *dir, for (; bidx < end_block; bidx++) { /* no need to allocate new dentry pages to all the indices */ - dentry_page = find_data_page(dir, bidx, true); + dentry_page = find_data_page(dir, bidx); if (IS_ERR(dentry_page)) { - room = true; - continue; + if (PTR_ERR(dentry_page) == -ENOENT) { + room = true; + continue; + } else { + *res_page = dentry_page; + break; + } } - de = find_in_block(dentry_page, name, &max_slots, res_page); + de = find_in_block(dentry_page, fname, namehash, &max_slots, + res_page); if (de) break; @@ -211,64 +223,66 @@ struct f2fs_dir_entry *f2fs_find_entry(struct inode *dir, { unsigned long npages = dir_blocks(dir); struct f2fs_dir_entry *de = NULL; - f2fs_hash_t name_hash; unsigned int max_depth; unsigned int level; + struct fscrypt_name fname; + int err; - *res_page = NULL; + err = fscrypt_setup_filename(dir, child, 1, &fname); + if (err) { + *res_page = ERR_PTR(err); + return NULL; + } - if (f2fs_has_inline_dentry(dir)) - return find_in_inline_dir(dir, child, res_page); + if (f2fs_has_inline_dentry(dir)) { + *res_page = NULL; + de = find_in_inline_dir(dir, &fname, res_page); + goto out; + } - if (npages == 0) - return NULL; + if (npages == 0) { + *res_page = NULL; + goto out; + } - name_hash = f2fs_dentry_hash(child); max_depth = F2FS_I(dir)->i_current_depth; + if (unlikely(max_depth > MAX_DIR_HASH_DEPTH)) { + f2fs_msg(F2FS_I_SB(dir)->sb, KERN_WARNING, + "Corrupted max_depth of %lu: %u", + dir->i_ino, max_depth); + max_depth = MAX_DIR_HASH_DEPTH; + f2fs_i_depth_write(dir, max_depth); + } for (level = 0; level < max_depth; level++) { - de = find_in_level(dir, level, child, name_hash, res_page); - if (de) + *res_page = NULL; + de = find_in_level(dir, level, &fname, res_page); + if (de || IS_ERR(*res_page)) break; } - if (!de && F2FS_I(dir)->chash != name_hash) { - F2FS_I(dir)->chash = name_hash; - F2FS_I(dir)->clevel = level - 1; - } +out: + fscrypt_free_filename(&fname); return de; } struct f2fs_dir_entry *f2fs_parent_dir(struct inode *dir, struct page **p) { - struct page *page; - struct f2fs_dir_entry *de; - struct f2fs_dentry_block *dentry_blk; - - if (f2fs_has_inline_dentry(dir)) - return f2fs_parent_inline_dir(dir, p); + struct qstr dotdot = QSTR_INIT("..", 2); - page = get_lock_data_page(dir, 0); - if (IS_ERR(page)) - return NULL; - - dentry_blk = kmap(page); - de = &dentry_blk->dentry[1]; - *p = page; - unlock_page(page); - return de; + return f2fs_find_entry(dir, &dotdot, p); } -ino_t f2fs_inode_by_name(struct inode *dir, struct qstr *qstr) +ino_t f2fs_inode_by_name(struct inode *dir, struct qstr *qstr, + struct page **page) { ino_t res = 0; struct f2fs_dir_entry *de; - struct page *page; - de = f2fs_find_entry(dir, qstr, &page); + de = f2fs_find_entry(dir, qstr, page); if (de) { res = le32_to_cpu(de->ino); - f2fs_dentry_kunmap(dir, page); - f2fs_put_page(page, 0); + f2fs_dentry_kunmap(dir, *page); + f2fs_put_page(*page, 0); } return res; @@ -279,14 +293,14 @@ void f2fs_set_link(struct inode *dir, struct f2fs_dir_entry *de, { enum page_type type = f2fs_has_inline_dentry(dir) ? NODE : DATA; lock_page(page); - f2fs_wait_on_page_writeback(page, type); + f2fs_wait_on_page_writeback(page, type, true); de->ino = cpu_to_le32(inode->i_ino); set_de_type(de, inode->i_mode); f2fs_dentry_kunmap(dir, page); set_page_dirty(page); - dir->i_mtime = dir->i_ctime = CURRENT_TIME; - mark_inode_dirty(dir); + dir->i_mtime = dir->i_ctime = CURRENT_TIME; + f2fs_mark_inode_dirty_sync(dir); f2fs_put_page(page, 1); } @@ -294,7 +308,7 @@ static void init_dent_inode(const struct qstr *name, struct page *ipage) { struct f2fs_inode *ri; - f2fs_wait_on_page_writeback(ipage, NODE); + f2fs_wait_on_page_writeback(ipage, NODE, true); /* copy name info. to this inode page */ ri = F2FS_INODE(ipage); @@ -303,10 +317,14 @@ static void init_dent_inode(const struct qstr *name, struct page *ipage) set_page_dirty(ipage); } -int update_dent_inode(struct inode *inode, const struct qstr *name) +int update_dent_inode(struct inode *inode, struct inode *to, + const struct qstr *name) { struct page *page; + if (file_enc_name(to)) + return 0; + page = get_node_page(F2FS_I_SB(inode), inode->i_ino); if (IS_ERR(page)) return PTR_ERR(page); @@ -320,24 +338,14 @@ int update_dent_inode(struct inode *inode, const struct qstr *name) void do_make_empty_dir(struct inode *inode, struct inode *parent, struct f2fs_dentry_ptr *d) { - struct f2fs_dir_entry *de; - - de = &d->dentry[0]; - de->name_len = cpu_to_le16(1); - de->hash_code = 0; - de->ino = cpu_to_le32(inode->i_ino); - memcpy(d->filename[0], ".", 1); - set_de_type(de, inode->i_mode); + struct qstr dot = QSTR_INIT(".", 1); + struct qstr dotdot = QSTR_INIT("..", 2); - de = &d->dentry[1]; - de->hash_code = 0; - de->name_len = cpu_to_le16(2); - de->ino = cpu_to_le32(parent->i_ino); - memcpy(d->filename[1], "..", 2); - set_de_type(de, parent->i_mode); + /* update dirent of "." */ + f2fs_update_dentry(inode->i_ino, inode->i_mode, d, &dot, 0, 0); - test_and_set_bit_le(0, (void *)d->bitmap); - test_and_set_bit_le(1, (void *)d->bitmap); + /* update dirent of ".." */ + f2fs_update_dentry(parent->i_ino, parent->i_mode, d, &dotdot, 0, 1); } static int make_empty_dir(struct inode *inode, @@ -356,7 +364,7 @@ static int make_empty_dir(struct inode *inode, dentry_blk = kmap_atomic(dentry_page); - make_dentry_ptr(&d, (void *)dentry_blk, 1); + make_dentry_ptr(NULL, &d, (void *)dentry_blk, 1); do_make_empty_dir(inode, parent, &d); kunmap_atomic(dentry_blk); @@ -372,15 +380,20 @@ struct page *init_inode_metadata(struct inode *inode, struct inode *dir, struct page *page; int err; - if (is_inode_flag_set(F2FS_I(inode), FI_NEW_INODE)) { + if (is_inode_flag_set(inode, FI_NEW_INODE)) { page = new_inode_page(inode); if (IS_ERR(page)) return page; if (S_ISDIR(inode->i_mode)) { + /* in order to handle error case */ + get_page(page); err = make_empty_dir(inode, dir, page); - if (err) - goto error; + if (err) { + lock_page(page); + goto put_error; + } + put_page(page); } err = f2fs_init_acl(inode, dir, page, dpage); @@ -390,6 +403,12 @@ struct page *init_inode_metadata(struct inode *inode, struct inode *dir, err = f2fs_init_security(inode, dir, name, page); if (err) goto put_error; + + if (f2fs_encrypted_inode(dir) && f2fs_may_encrypt(inode)) { + err = fscrypt_inherit_context(dir, inode, page, false); + if (err) + goto put_error; + } } else { page = get_node_page(F2FS_I_SB(dir), inode->i_ino); if (IS_ERR(page)) @@ -405,7 +424,7 @@ struct page *init_inode_metadata(struct inode *inode, struct inode *dir, * This file should be checkpointed during fsync. * We lost i_pino from now on. */ - if (is_inode_flag_set(F2FS_I(inode), FI_INC_LINK)) { + if (is_inode_flag_set(inode, FI_INC_LINK)) { file_lost_pino(inode); /* * If link the tmpfile to alias through linkat path, @@ -413,41 +432,33 @@ struct page *init_inode_metadata(struct inode *inode, struct inode *dir, */ if (inode->i_nlink == 0) remove_orphan_inode(F2FS_I_SB(dir), inode->i_ino); - inc_nlink(inode); + f2fs_i_links_write(inode, true); } return page; put_error: + clear_nlink(inode); + update_inode(inode, page); f2fs_put_page(page, 1); -error: - /* once the failed inode becomes a bad inode, i_mode is S_IFREG */ - truncate_inode_pages(&inode->i_data, 0); - truncate_blocks(inode, 0, false); - remove_dirty_dir_inode(inode); - remove_inode_page(inode); return ERR_PTR(err); } void update_parent_metadata(struct inode *dir, struct inode *inode, unsigned int current_depth) { - if (inode && is_inode_flag_set(F2FS_I(inode), FI_NEW_INODE)) { - if (S_ISDIR(inode->i_mode)) { - inc_nlink(dir); - set_inode_flag(F2FS_I(dir), FI_UPDATE_DIR); - } - clear_inode_flag(F2FS_I(inode), FI_NEW_INODE); + if (inode && is_inode_flag_set(inode, FI_NEW_INODE)) { + if (S_ISDIR(inode->i_mode)) + f2fs_i_links_write(dir, true); + clear_inode_flag(inode, FI_NEW_INODE); } dir->i_mtime = dir->i_ctime = CURRENT_TIME; - mark_inode_dirty(dir); + f2fs_mark_inode_dirty_sync(dir); - if (F2FS_I(dir)->i_current_depth != current_depth) { - F2FS_I(dir)->i_current_depth = current_depth; - set_inode_flag(F2FS_I(dir), FI_UPDATE_DIR); - } + if (F2FS_I(dir)->i_current_depth != current_depth) + f2fs_i_depth_write(dir, current_depth); - if (inode && is_inode_flag_set(F2FS_I(inode), FI_INC_LINK)) - clear_inode_flag(F2FS_I(inode), FI_INC_LINK); + if (inode && is_inode_flag_set(inode, FI_INC_LINK)) + clear_inode_flag(inode, FI_INC_LINK); } int room_for_filename(const void *bitmap, int slots, int max_slots) @@ -484,15 +495,15 @@ void f2fs_update_dentry(nid_t ino, umode_t mode, struct f2fs_dentry_ptr *d, memcpy(d->filename[bit_pos], name->name, name->len); de->ino = cpu_to_le32(ino); set_de_type(de, mode); - for (i = 0; i < slots; i++) + for (i = 0; i < slots; i++) { test_and_set_bit_le(bit_pos + i, (void *)d->bitmap); + /* avoid wrong garbage data for readdir */ + if (i) + (de + i)->name_len = 0; + } } -/* - * Caller should grab and release a rwsem by calling f2fs_lock_op() and - * f2fs_unlock_op(). - */ -int __f2fs_add_link(struct inode *dir, const struct qstr *name, +int f2fs_add_regular_entry(struct inode *dir, const struct qstr *new_name, struct inode *inode, nid_t ino, umode_t mode) { unsigned int bit_pos; @@ -501,24 +512,16 @@ int __f2fs_add_link(struct inode *dir, const struct qstr *name, unsigned long bidx, block; f2fs_hash_t dentry_hash; unsigned int nbucket, nblock; - size_t namelen = name->len; struct page *dentry_page = NULL; struct f2fs_dentry_block *dentry_blk = NULL; struct f2fs_dentry_ptr d; - int slots = GET_DENTRY_SLOTS(namelen); struct page *page = NULL; - int err = 0; + int slots, err = 0; - if (f2fs_has_inline_dentry(dir)) { - err = f2fs_add_inline_entry(dir, name, inode, ino, mode); - if (!err || err != -EAGAIN) - return err; - else - err = 0; - } - - dentry_hash = f2fs_dentry_hash(name); level = 0; + slots = GET_DENTRY_SLOTS(new_name->len); + dentry_hash = f2fs_dentry_hash(new_name); + current_depth = F2FS_I(dir)->i_current_depth; if (F2FS_I(dir)->chash == dentry_hash) { level = F2FS_I(dir)->clevel; @@ -526,6 +529,10 @@ int __f2fs_add_link(struct inode *dir, const struct qstr *name, } start: +#ifdef CONFIG_F2FS_FAULT_INJECTION + if (time_to_inject(FAULT_DIR_DEPTH)) + return -ENOSPC; +#endif if (unlikely(current_depth == MAX_DIR_HASH_DEPTH)) return -ENOSPC; @@ -558,26 +565,26 @@ int __f2fs_add_link(struct inode *dir, const struct qstr *name, ++level; goto start; add_dentry: - f2fs_wait_on_page_writeback(dentry_page, DATA); + f2fs_wait_on_page_writeback(dentry_page, DATA, true); if (inode) { down_write(&F2FS_I(inode)->i_sem); - page = init_inode_metadata(inode, dir, name, NULL); + page = init_inode_metadata(inode, dir, new_name, NULL); if (IS_ERR(page)) { err = PTR_ERR(page); goto fail; } + if (f2fs_encrypted_inode(dir)) + file_set_enc_name(inode); } - make_dentry_ptr(&d, (void *)dentry_blk, 1); - f2fs_update_dentry(ino, mode, &d, name, dentry_hash, bit_pos); + make_dentry_ptr(NULL, &d, (void *)dentry_blk, 1); + f2fs_update_dentry(ino, mode, &d, new_name, dentry_hash, bit_pos); set_page_dirty(dentry_page); if (inode) { - /* we don't need to mark_inode_dirty now */ - F2FS_I(inode)->i_pino = dir->i_ino; - update_inode(inode, page); + f2fs_i_pino_write(inode, dir->i_ino); f2fs_put_page(page, 1); } @@ -586,12 +593,38 @@ int __f2fs_add_link(struct inode *dir, const struct qstr *name, if (inode) up_write(&F2FS_I(inode)->i_sem); - if (is_inode_flag_set(F2FS_I(dir), FI_UPDATE_DIR)) { - update_inode_page(dir); - clear_inode_flag(F2FS_I(dir), FI_UPDATE_DIR); - } kunmap(dentry_page); f2fs_put_page(dentry_page, 1); + + return err; +} + +/* + * Caller should grab and release a rwsem by calling f2fs_lock_op() and + * f2fs_unlock_op(). + */ +int __f2fs_add_link(struct inode *dir, const struct qstr *name, + struct inode *inode, nid_t ino, umode_t mode) +{ + struct fscrypt_name fname; + struct qstr new_name; + int err; + + err = fscrypt_setup_filename(dir, name, 0, &fname); + if (err) + return err; + + new_name.name = fname_name(&fname); + new_name.len = fname_len(&fname); + + err = -EAGAIN; + if (f2fs_has_inline_dentry(dir)) + err = f2fs_add_inline_entry(dir, &new_name, inode, ino, mode); + if (err == -EAGAIN) + err = f2fs_add_regular_entry(dir, &new_name, inode, ino, mode); + + fscrypt_free_filename(&fname); + f2fs_update_time(F2FS_I_SB(dir), REQ_TIME); return err; } @@ -606,41 +639,34 @@ int f2fs_do_tmpfile(struct inode *inode, struct inode *dir) err = PTR_ERR(page); goto fail; } - /* we don't need to mark_inode_dirty now */ - update_inode(inode, page); f2fs_put_page(page, 1); - clear_inode_flag(F2FS_I(inode), FI_NEW_INODE); + clear_inode_flag(inode, FI_NEW_INODE); fail: up_write(&F2FS_I(inode)->i_sem); + f2fs_update_time(F2FS_I_SB(inode), REQ_TIME); return err; } -void f2fs_drop_nlink(struct inode *dir, struct inode *inode, struct page *page) +void f2fs_drop_nlink(struct inode *dir, struct inode *inode) { struct f2fs_sb_info *sbi = F2FS_I_SB(dir); down_write(&F2FS_I(inode)->i_sem); - if (S_ISDIR(inode->i_mode)) { - drop_nlink(dir); - if (page) - update_inode(dir, page); - else - update_inode_page(dir); - } + if (S_ISDIR(inode->i_mode)) + f2fs_i_links_write(dir, false); inode->i_ctime = CURRENT_TIME; - drop_nlink(inode); + f2fs_i_links_write(inode, false); if (S_ISDIR(inode->i_mode)) { - drop_nlink(inode); - i_size_write(inode, 0); + f2fs_i_links_write(inode, false); + f2fs_i_size_write(inode, 0); } up_write(&F2FS_I(inode)->i_sem); - update_inode_page(inode); if (inode->i_nlink == 0) - add_orphan_inode(sbi, inode->i_ino); + add_orphan_inode(inode); else release_orphan_inode(sbi); } @@ -657,11 +683,13 @@ void f2fs_delete_entry(struct f2fs_dir_entry *dentry, struct page *page, int slots = GET_DENTRY_SLOTS(le16_to_cpu(dentry->name_len)); int i; + f2fs_update_time(F2FS_I_SB(dir), REQ_TIME); + if (f2fs_has_inline_dentry(dir)) return f2fs_delete_inline_entry(dentry, page, dir, inode); lock_page(page); - f2fs_wait_on_page_writeback(page, DATA); + f2fs_wait_on_page_writeback(page, DATA, true); dentry_blk = page_address(page); bit_pos = dentry - dentry_blk->dentry; @@ -676,12 +704,13 @@ void f2fs_delete_entry(struct f2fs_dir_entry *dentry, struct page *page, set_page_dirty(page); dir->i_ctime = dir->i_mtime = CURRENT_TIME; + f2fs_mark_inode_dirty_sync(dir); if (inode) - f2fs_drop_nlink(dir, inode, NULL); + f2fs_drop_nlink(dir, inode); - if (bit_pos == NR_DENTRY_IN_BLOCK) { - truncate_hole(dir, page->index, page->index + 1); + if (bit_pos == NR_DENTRY_IN_BLOCK && + !truncate_hole(dir, page->index, page->index + 1)) { clear_page_dirty_for_io(page); ClearPagePrivate(page); ClearPageUptodate(page); @@ -702,7 +731,7 @@ bool f2fs_empty_dir(struct inode *dir) return f2fs_empty_inline_dir(dir); for (bidx = 0; bidx < nblock; bidx++) { - dentry_page = get_lock_data_page(dir, bidx); + dentry_page = get_lock_data_page(dir, bidx, false); if (IS_ERR(dentry_page)) { if (PTR_ERR(dentry_page) == -ENOENT) continue; @@ -729,12 +758,13 @@ bool f2fs_empty_dir(struct inode *dir) } bool f2fs_fill_dentries(struct file *file, void *dirent, filldir_t filldir, - struct f2fs_dentry_ptr *d, unsigned int n, unsigned int bit_pos) + struct f2fs_dentry_ptr *d, unsigned int n, unsigned int bit_pos, + struct fscrypt_str *fstr) { unsigned int start_bit_pos = bit_pos; unsigned char d_type; struct f2fs_dir_entry *de = NULL; - unsigned char *types = f2fs_filetype_table; + struct fscrypt_str de_name = FSTR_INIT(NULL, 0); int over; while (bit_pos < d->max) { @@ -744,11 +774,39 @@ bool f2fs_fill_dentries(struct file *file, void *dirent, filldir_t filldir, break; de = &d->dentry[bit_pos]; - if (types && de->file_type < F2FS_FT_MAX) - d_type = types[de->file_type]; - over = filldir(dirent, d->filename[bit_pos], - le16_to_cpu(de->name_len), + if (de->name_len == 0) { + bit_pos++; + continue; + } + + d_type = get_de_type(de); + + de_name.name = d->filename[bit_pos]; + de_name.len = le16_to_cpu(de->name_len); + + if (f2fs_encrypted_inode(d->inode)) { + int save_len = fstr->len; + int ret; + + de_name.name = f2fs_kmalloc(de_name.len, GFP_NOFS); + if (!de_name.name) + return false; + + memcpy(de_name.name, d->filename[bit_pos], de_name.len); + + ret = fscrypt_fname_disk_to_usr(d->inode, + (u32)de->hash_code, 0, + &de_name, fstr); + kfree(de_name.name); + if (ret < 0) + return true; + + de_name = *fstr; + fstr->len = save_len; + } + + over = filldir(dirent, de_name.name, de_name.len, (n * d->max) + bit_pos, le32_to_cpu(de->ino), d_type); if (over) { @@ -771,10 +829,24 @@ static int f2fs_readdir(struct file *file, void *dirent, filldir_t filldir) struct page *dentry_page = NULL; struct file_ra_state *ra = &file->f_ra; struct f2fs_dentry_ptr d; + struct fscrypt_str fstr = FSTR_INIT(NULL, 0); unsigned int n = 0; + int err = 0; - if (f2fs_has_inline_dentry(inode)) - return f2fs_read_inline_dir(file, dirent, filldir); + if (f2fs_encrypted_inode(inode)) { + err = fscrypt_get_encryption_info(inode); + if (err && err != -ENOKEY) + return err; + + err = fscrypt_fname_alloc_buffer(inode, F2FS_NAME_LEN, &fstr); + if (err < 0) + return err; + } + + if (f2fs_has_inline_dentry(inode)) { + err = f2fs_read_inline_dir(file, dirent, filldir, &fstr); + goto out; + } bit_pos = (pos % NR_DENTRY_IN_BLOCK); n = (pos / NR_DENTRY_IN_BLOCK); @@ -785,29 +857,41 @@ static int f2fs_readdir(struct file *file, void *dirent, filldir_t filldir) min(npages - n, (pgoff_t)MAX_DIR_RA_PAGES)); for (; n < npages; n++) { - dentry_page = get_lock_data_page(inode, n); - if (IS_ERR(dentry_page)) - continue; + dentry_page = get_lock_data_page(inode, n, false); + if (IS_ERR(dentry_page)) { + err = PTR_ERR(dentry_page); + if (err == -ENOENT) + continue; + else + goto out; + } dentry_blk = kmap(dentry_page); - make_dentry_ptr(&d, (void *)dentry_blk, 1); + make_dentry_ptr(inode, &d, (void *)dentry_blk, 1); - if (f2fs_fill_dentries(file, dirent, filldir, &d, n, bit_pos)) - goto stop; + if (f2fs_fill_dentries(file, dirent, filldir, &d, n, + bit_pos, &fstr)) { + kunmap(dentry_page); + f2fs_put_page(dentry_page, 1); + break; + } bit_pos = 0; file->f_pos = (n + 1) * NR_DENTRY_IN_BLOCK; kunmap(dentry_page); f2fs_put_page(dentry_page, 1); - dentry_page = NULL; - } -stop: - if (dentry_page && !IS_ERR(dentry_page)) { - kunmap(dentry_page); - f2fs_put_page(dentry_page, 1); } + err = 0; +out: + fscrypt_fname_free_buffer(&fstr); + return err; +} +static int f2fs_dir_open(struct inode *inode, struct file *filp) +{ + if (f2fs_encrypted_inode(inode)) + return fscrypt_get_encryption_info(inode) ? -EACCES : 0; return 0; } @@ -816,5 +900,9 @@ const struct file_operations f2fs_dir_operations = { .read = generic_read_dir, .readdir = f2fs_readdir, .fsync = f2fs_sync_file, + .open = f2fs_dir_open, .unlocked_ioctl = f2fs_ioctl, +#ifdef CONFIG_COMPAT + .compat_ioctl = f2fs_compat_ioctl, +#endif }; diff --git a/fs/f2fs/extent_cache.c b/fs/f2fs/extent_cache.c new file mode 100755 index 000000000000..2b06d4fcd954 --- /dev/null +++ b/fs/f2fs/extent_cache.c @@ -0,0 +1,749 @@ +/* + * f2fs extent cache support + * + * Copyright (c) 2015 Motorola Mobility + * Copyright (c) 2015 Samsung Electronics + * Authors: Jaegeuk Kim + * Chao Yu + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include +#include + +#include "f2fs.h" +#include "node.h" +#include + +static struct kmem_cache *extent_tree_slab; +static struct kmem_cache *extent_node_slab; + +static struct extent_node *__attach_extent_node(struct f2fs_sb_info *sbi, + struct extent_tree *et, struct extent_info *ei, + struct rb_node *parent, struct rb_node **p) +{ + struct extent_node *en; + + en = kmem_cache_alloc(extent_node_slab, GFP_ATOMIC); + if (!en) + return NULL; + + en->ei = *ei; + INIT_LIST_HEAD(&en->list); + en->et = et; + + rb_link_node(&en->rb_node, parent, p); + rb_insert_color(&en->rb_node, &et->root); + atomic_inc(&et->node_cnt); + atomic_inc(&sbi->total_ext_node); + return en; +} + +static void __detach_extent_node(struct f2fs_sb_info *sbi, + struct extent_tree *et, struct extent_node *en) +{ + rb_erase(&en->rb_node, &et->root); + atomic_dec(&et->node_cnt); + atomic_dec(&sbi->total_ext_node); + + if (et->cached_en == en) + et->cached_en = NULL; + kmem_cache_free(extent_node_slab, en); +} + +/* + * Flow to release an extent_node: + * 1. list_del_init + * 2. __detach_extent_node + * 3. kmem_cache_free. + */ +static void __release_extent_node(struct f2fs_sb_info *sbi, + struct extent_tree *et, struct extent_node *en) +{ + spin_lock(&sbi->extent_lock); + f2fs_bug_on(sbi, list_empty(&en->list)); + list_del_init(&en->list); + spin_unlock(&sbi->extent_lock); + + __detach_extent_node(sbi, et, en); +} + +static struct extent_tree *__grab_extent_tree(struct inode *inode) +{ + struct f2fs_sb_info *sbi = F2FS_I_SB(inode); + struct extent_tree *et; + nid_t ino = inode->i_ino; + + down_write(&sbi->extent_tree_lock); + et = radix_tree_lookup(&sbi->extent_tree_root, ino); + if (!et) { + et = f2fs_kmem_cache_alloc(extent_tree_slab, GFP_NOFS); + f2fs_radix_tree_insert(&sbi->extent_tree_root, ino, et); + memset(et, 0, sizeof(struct extent_tree)); + et->ino = ino; + et->root = RB_ROOT; + et->cached_en = NULL; + rwlock_init(&et->lock); + INIT_LIST_HEAD(&et->list); + atomic_set(&et->node_cnt, 0); + atomic_inc(&sbi->total_ext_tree); + } else { + atomic_dec(&sbi->total_zombie_tree); + list_del_init(&et->list); + } + up_write(&sbi->extent_tree_lock); + + /* never died until evict_inode */ + F2FS_I(inode)->extent_tree = et; + + return et; +} + +static struct extent_node *__lookup_extent_tree(struct f2fs_sb_info *sbi, + struct extent_tree *et, unsigned int fofs) +{ + struct rb_node *node = et->root.rb_node; + struct extent_node *en = et->cached_en; + + if (en) { + struct extent_info *cei = &en->ei; + + if (cei->fofs <= fofs && cei->fofs + cei->len > fofs) { + stat_inc_cached_node_hit(sbi); + return en; + } + } + + while (node) { + en = rb_entry(node, struct extent_node, rb_node); + + if (fofs < en->ei.fofs) { + node = node->rb_left; + } else if (fofs >= en->ei.fofs + en->ei.len) { + node = node->rb_right; + } else { + stat_inc_rbtree_node_hit(sbi); + return en; + } + } + return NULL; +} + +static struct extent_node *__init_extent_tree(struct f2fs_sb_info *sbi, + struct extent_tree *et, struct extent_info *ei) +{ + struct rb_node **p = &et->root.rb_node; + struct extent_node *en; + + en = __attach_extent_node(sbi, et, ei, NULL, p); + if (!en) + return NULL; + + et->largest = en->ei; + et->cached_en = en; + return en; +} + +static unsigned int __free_extent_tree(struct f2fs_sb_info *sbi, + struct extent_tree *et) +{ + struct rb_node *node, *next; + struct extent_node *en; + unsigned int count = atomic_read(&et->node_cnt); + + node = rb_first(&et->root); + while (node) { + next = rb_next(node); + en = rb_entry(node, struct extent_node, rb_node); + __release_extent_node(sbi, et, en); + node = next; + } + + return count - atomic_read(&et->node_cnt); +} + +static void __drop_largest_extent(struct inode *inode, + pgoff_t fofs, unsigned int len) +{ + struct extent_info *largest = &F2FS_I(inode)->extent_tree->largest; + + if (fofs < largest->fofs + largest->len && fofs + len > largest->fofs) { + largest->len = 0; + f2fs_mark_inode_dirty_sync(inode); + } +} + +/* return true, if inode page is changed */ +bool f2fs_init_extent_tree(struct inode *inode, struct f2fs_extent *i_ext) +{ + struct f2fs_sb_info *sbi = F2FS_I_SB(inode); + struct extent_tree *et; + struct extent_node *en; + struct extent_info ei; + + if (!f2fs_may_extent_tree(inode)) { + /* drop largest extent */ + if (i_ext && i_ext->len) { + i_ext->len = 0; + return true; + } + return false; + } + + et = __grab_extent_tree(inode); + + if (!i_ext || !i_ext->len) + return false; + + get_extent_info(&ei, i_ext); + + write_lock(&et->lock); + if (atomic_read(&et->node_cnt)) + goto out; + + en = __init_extent_tree(sbi, et, &ei); + if (en) { + spin_lock(&sbi->extent_lock); + list_add_tail(&en->list, &sbi->extent_list); + spin_unlock(&sbi->extent_lock); + } +out: + write_unlock(&et->lock); + return false; +} + +static bool f2fs_lookup_extent_tree(struct inode *inode, pgoff_t pgofs, + struct extent_info *ei) +{ + struct f2fs_sb_info *sbi = F2FS_I_SB(inode); + struct extent_tree *et = F2FS_I(inode)->extent_tree; + struct extent_node *en; + bool ret = false; + + f2fs_bug_on(sbi, !et); + + trace_f2fs_lookup_extent_tree_start(inode, pgofs); + + read_lock(&et->lock); + + if (et->largest.fofs <= pgofs && + et->largest.fofs + et->largest.len > pgofs) { + *ei = et->largest; + ret = true; + stat_inc_largest_node_hit(sbi); + goto out; + } + + en = __lookup_extent_tree(sbi, et, pgofs); + if (en) { + *ei = en->ei; + spin_lock(&sbi->extent_lock); + if (!list_empty(&en->list)) { + list_move_tail(&en->list, &sbi->extent_list); + et->cached_en = en; + } + spin_unlock(&sbi->extent_lock); + ret = true; + } +out: + stat_inc_total_hit(sbi); + read_unlock(&et->lock); + + trace_f2fs_lookup_extent_tree_end(inode, pgofs, ei); + return ret; +} + + +/* + * lookup extent at @fofs, if hit, return the extent + * if not, return NULL and + * @prev_ex: extent before fofs + * @next_ex: extent after fofs + * @insert_p: insert point for new extent at fofs + * in order to simpfy the insertion after. + * tree must stay unchanged between lookup and insertion. + */ +static struct extent_node *__lookup_extent_tree_ret(struct extent_tree *et, + unsigned int fofs, + struct extent_node **prev_ex, + struct extent_node **next_ex, + struct rb_node ***insert_p, + struct rb_node **insert_parent) +{ + struct rb_node **pnode = &et->root.rb_node; + struct rb_node *parent = NULL, *tmp_node; + struct extent_node *en = et->cached_en; + + *insert_p = NULL; + *insert_parent = NULL; + *prev_ex = NULL; + *next_ex = NULL; + + if (RB_EMPTY_ROOT(&et->root)) + return NULL; + + if (en) { + struct extent_info *cei = &en->ei; + + if (cei->fofs <= fofs && cei->fofs + cei->len > fofs) + goto lookup_neighbors; + } + + while (*pnode) { + parent = *pnode; + en = rb_entry(*pnode, struct extent_node, rb_node); + + if (fofs < en->ei.fofs) + pnode = &(*pnode)->rb_left; + else if (fofs >= en->ei.fofs + en->ei.len) + pnode = &(*pnode)->rb_right; + else + goto lookup_neighbors; + } + + *insert_p = pnode; + *insert_parent = parent; + + en = rb_entry(parent, struct extent_node, rb_node); + tmp_node = parent; + if (parent && fofs > en->ei.fofs) + tmp_node = rb_next(parent); + *next_ex = tmp_node ? + rb_entry(tmp_node, struct extent_node, rb_node) : NULL; + + tmp_node = parent; + if (parent && fofs < en->ei.fofs) + tmp_node = rb_prev(parent); + *prev_ex = tmp_node ? + rb_entry(tmp_node, struct extent_node, rb_node) : NULL; + return NULL; + +lookup_neighbors: + if (fofs == en->ei.fofs) { + /* lookup prev node for merging backward later */ + tmp_node = rb_prev(&en->rb_node); + *prev_ex = tmp_node ? + rb_entry(tmp_node, struct extent_node, rb_node) : NULL; + } + if (fofs == en->ei.fofs + en->ei.len - 1) { + /* lookup next node for merging frontward later */ + tmp_node = rb_next(&en->rb_node); + *next_ex = tmp_node ? + rb_entry(tmp_node, struct extent_node, rb_node) : NULL; + } + return en; +} + +static struct extent_node *__try_merge_extent_node(struct inode *inode, + struct extent_tree *et, struct extent_info *ei, + struct extent_node *prev_ex, + struct extent_node *next_ex) +{ + struct f2fs_sb_info *sbi = F2FS_I_SB(inode); + struct extent_node *en = NULL; + + if (prev_ex && __is_back_mergeable(ei, &prev_ex->ei)) { + prev_ex->ei.len += ei->len; + ei = &prev_ex->ei; + en = prev_ex; + } + + if (next_ex && __is_front_mergeable(ei, &next_ex->ei)) { + if (en) + __release_extent_node(sbi, et, prev_ex); + next_ex->ei.fofs = ei->fofs; + next_ex->ei.blk = ei->blk; + next_ex->ei.len += ei->len; + en = next_ex; + } + + if (!en) + return NULL; + + __try_update_largest_extent(inode, et, en); + + spin_lock(&sbi->extent_lock); + if (!list_empty(&en->list)) { + list_move_tail(&en->list, &sbi->extent_list); + et->cached_en = en; + } + spin_unlock(&sbi->extent_lock); + return en; +} + +static struct extent_node *__insert_extent_tree(struct inode *inode, + struct extent_tree *et, struct extent_info *ei, + struct rb_node **insert_p, + struct rb_node *insert_parent) +{ + struct f2fs_sb_info *sbi = F2FS_I_SB(inode); + struct rb_node **p = &et->root.rb_node; + struct rb_node *parent = NULL; + struct extent_node *en = NULL; + + if (insert_p && insert_parent) { + parent = insert_parent; + p = insert_p; + goto do_insert; + } + + while (*p) { + parent = *p; + en = rb_entry(parent, struct extent_node, rb_node); + + if (ei->fofs < en->ei.fofs) + p = &(*p)->rb_left; + else if (ei->fofs >= en->ei.fofs + en->ei.len) + p = &(*p)->rb_right; + else + f2fs_bug_on(sbi, 1); + } +do_insert: + en = __attach_extent_node(sbi, et, ei, parent, p); + if (!en) + return NULL; + + __try_update_largest_extent(inode, et, en); + + /* update in global extent list */ + spin_lock(&sbi->extent_lock); + list_add_tail(&en->list, &sbi->extent_list); + et->cached_en = en; + spin_unlock(&sbi->extent_lock); + return en; +} + +static unsigned int f2fs_update_extent_tree_range(struct inode *inode, + pgoff_t fofs, block_t blkaddr, unsigned int len) +{ + struct f2fs_sb_info *sbi = F2FS_I_SB(inode); + struct extent_tree *et = F2FS_I(inode)->extent_tree; + struct extent_node *en = NULL, *en1 = NULL; + struct extent_node *prev_en = NULL, *next_en = NULL; + struct extent_info ei, dei, prev; + struct rb_node **insert_p = NULL, *insert_parent = NULL; + unsigned int end = fofs + len; + unsigned int pos = (unsigned int)fofs; + + if (!et) + return false; + + trace_f2fs_update_extent_tree_range(inode, fofs, blkaddr, len); + + write_lock(&et->lock); + + if (is_inode_flag_set(inode, FI_NO_EXTENT)) { + write_unlock(&et->lock); + return false; + } + + prev = et->largest; + dei.len = 0; + + /* + * drop largest extent before lookup, in case it's already + * been shrunk from extent tree + */ + __drop_largest_extent(inode, fofs, len); + + /* 1. lookup first extent node in range [fofs, fofs + len - 1] */ + en = __lookup_extent_tree_ret(et, fofs, &prev_en, &next_en, + &insert_p, &insert_parent); + if (!en) + en = next_en; + + /* 2. invlidate all extent nodes in range [fofs, fofs + len - 1] */ + while (en && en->ei.fofs < end) { + unsigned int org_end; + int parts = 0; /* # of parts current extent split into */ + + next_en = en1 = NULL; + + dei = en->ei; + org_end = dei.fofs + dei.len; + f2fs_bug_on(sbi, pos >= org_end); + + if (pos > dei.fofs && pos - dei.fofs >= F2FS_MIN_EXTENT_LEN) { + en->ei.len = pos - en->ei.fofs; + prev_en = en; + parts = 1; + } + + if (end < org_end && org_end - end >= F2FS_MIN_EXTENT_LEN) { + if (parts) { + set_extent_info(&ei, end, + end - dei.fofs + dei.blk, + org_end - end); + en1 = __insert_extent_tree(inode, et, &ei, + NULL, NULL); + next_en = en1; + } else { + en->ei.fofs = end; + en->ei.blk += end - dei.fofs; + en->ei.len -= end - dei.fofs; + next_en = en; + } + parts++; + } + + if (!next_en) { + struct rb_node *node = rb_next(&en->rb_node); + + next_en = node ? + rb_entry(node, struct extent_node, rb_node) + : NULL; + } + + if (parts) + __try_update_largest_extent(inode, et, en); + else + __release_extent_node(sbi, et, en); + + /* + * if original extent is split into zero or two parts, extent + * tree has been altered by deletion or insertion, therefore + * invalidate pointers regard to tree. + */ + if (parts != 1) { + insert_p = NULL; + insert_parent = NULL; + } + en = next_en; + } + + /* 3. update extent in extent cache */ + if (blkaddr) { + + set_extent_info(&ei, fofs, blkaddr, len); + if (!__try_merge_extent_node(inode, et, &ei, prev_en, next_en)) + __insert_extent_tree(inode, et, &ei, + insert_p, insert_parent); + + /* give up extent_cache, if split and small updates happen */ + if (dei.len >= 1 && + prev.len < F2FS_MIN_EXTENT_LEN && + et->largest.len < F2FS_MIN_EXTENT_LEN) { + __drop_largest_extent(inode, 0, UINT_MAX); + set_inode_flag(inode, FI_NO_EXTENT); + } + } + + if (is_inode_flag_set(inode, FI_NO_EXTENT)) + __free_extent_tree(sbi, et); + + write_unlock(&et->lock); + + return !__is_extent_same(&prev, &et->largest); +} + +unsigned int f2fs_shrink_extent_tree(struct f2fs_sb_info *sbi, int nr_shrink) +{ + struct extent_tree *et, *next; + struct extent_node *en; + unsigned int node_cnt = 0, tree_cnt = 0; + int remained; + + if (!test_opt(sbi, EXTENT_CACHE)) + return 0; + + if (!atomic_read(&sbi->total_zombie_tree)) + goto free_node; + + if (!down_write_trylock(&sbi->extent_tree_lock)) + goto out; + + /* 1. remove unreferenced extent tree */ + list_for_each_entry_safe(et, next, &sbi->zombie_list, list) { + if (atomic_read(&et->node_cnt)) { + write_lock(&et->lock); + node_cnt += __free_extent_tree(sbi, et); + write_unlock(&et->lock); + } + f2fs_bug_on(sbi, atomic_read(&et->node_cnt)); + list_del_init(&et->list); + radix_tree_delete(&sbi->extent_tree_root, et->ino); + kmem_cache_free(extent_tree_slab, et); + atomic_dec(&sbi->total_ext_tree); + atomic_dec(&sbi->total_zombie_tree); + tree_cnt++; + + if (node_cnt + tree_cnt >= nr_shrink) + goto unlock_out; + cond_resched(); + } + up_write(&sbi->extent_tree_lock); + +free_node: + /* 2. remove LRU extent entries */ + if (!down_write_trylock(&sbi->extent_tree_lock)) + goto out; + + remained = nr_shrink - (node_cnt + tree_cnt); + + spin_lock(&sbi->extent_lock); + for (; remained > 0; remained--) { + if (list_empty(&sbi->extent_list)) + break; + en = list_first_entry(&sbi->extent_list, + struct extent_node, list); + et = en->et; + if (!write_trylock(&et->lock)) { + /* refresh this extent node's position in extent list */ + list_move_tail(&en->list, &sbi->extent_list); + continue; + } + + list_del_init(&en->list); + spin_unlock(&sbi->extent_lock); + + __detach_extent_node(sbi, et, en); + + write_unlock(&et->lock); + node_cnt++; + spin_lock(&sbi->extent_lock); + } + spin_unlock(&sbi->extent_lock); + +unlock_out: + up_write(&sbi->extent_tree_lock); +out: + trace_f2fs_shrink_extent_tree(sbi, node_cnt, tree_cnt); + + return node_cnt + tree_cnt; +} + +unsigned int f2fs_destroy_extent_node(struct inode *inode) +{ + struct f2fs_sb_info *sbi = F2FS_I_SB(inode); + struct extent_tree *et = F2FS_I(inode)->extent_tree; + unsigned int node_cnt = 0; + + if (!et || !atomic_read(&et->node_cnt)) + return 0; + + write_lock(&et->lock); + node_cnt = __free_extent_tree(sbi, et); + write_unlock(&et->lock); + + return node_cnt; +} + +void f2fs_drop_extent_tree(struct inode *inode) +{ + struct f2fs_sb_info *sbi = F2FS_I_SB(inode); + struct extent_tree *et = F2FS_I(inode)->extent_tree; + + set_inode_flag(inode, FI_NO_EXTENT); + + write_lock(&et->lock); + __free_extent_tree(sbi, et); + __drop_largest_extent(inode, 0, UINT_MAX); + write_unlock(&et->lock); +} + +void f2fs_destroy_extent_tree(struct inode *inode) +{ + struct f2fs_sb_info *sbi = F2FS_I_SB(inode); + struct extent_tree *et = F2FS_I(inode)->extent_tree; + unsigned int node_cnt = 0; + + if (!et) + return; + + if (inode->i_nlink && !is_bad_inode(inode) && + atomic_read(&et->node_cnt)) { + down_write(&sbi->extent_tree_lock); + list_add_tail(&et->list, &sbi->zombie_list); + atomic_inc(&sbi->total_zombie_tree); + up_write(&sbi->extent_tree_lock); + return; + } + + /* free all extent info belong to this extent tree */ + node_cnt = f2fs_destroy_extent_node(inode); + + /* delete extent tree entry in radix tree */ + down_write(&sbi->extent_tree_lock); + f2fs_bug_on(sbi, atomic_read(&et->node_cnt)); + radix_tree_delete(&sbi->extent_tree_root, inode->i_ino); + kmem_cache_free(extent_tree_slab, et); + atomic_dec(&sbi->total_ext_tree); + up_write(&sbi->extent_tree_lock); + + F2FS_I(inode)->extent_tree = NULL; + + trace_f2fs_destroy_extent_tree(inode, node_cnt); +} + +bool f2fs_lookup_extent_cache(struct inode *inode, pgoff_t pgofs, + struct extent_info *ei) +{ + if (!f2fs_may_extent_tree(inode)) + return false; + + return f2fs_lookup_extent_tree(inode, pgofs, ei); +} + +void f2fs_update_extent_cache(struct dnode_of_data *dn) +{ + pgoff_t fofs; + block_t blkaddr; + + if (!f2fs_may_extent_tree(dn->inode)) + return; + + if (dn->data_blkaddr == NEW_ADDR) + blkaddr = NULL_ADDR; + else + blkaddr = dn->data_blkaddr; + + fofs = start_bidx_of_node(ofs_of_node(dn->node_page), dn->inode) + + dn->ofs_in_node; + f2fs_update_extent_tree_range(dn->inode, fofs, blkaddr, 1); +} + +void f2fs_update_extent_cache_range(struct dnode_of_data *dn, + pgoff_t fofs, block_t blkaddr, unsigned int len) + +{ + if (!f2fs_may_extent_tree(dn->inode)) + return; + + f2fs_update_extent_tree_range(dn->inode, fofs, blkaddr, len); +} + +void init_extent_cache_info(struct f2fs_sb_info *sbi) +{ + INIT_RADIX_TREE(&sbi->extent_tree_root, GFP_NOIO); + init_rwsem(&sbi->extent_tree_lock); + INIT_LIST_HEAD(&sbi->extent_list); + spin_lock_init(&sbi->extent_lock); + atomic_set(&sbi->total_ext_tree, 0); + INIT_LIST_HEAD(&sbi->zombie_list); + atomic_set(&sbi->total_zombie_tree, 0); + atomic_set(&sbi->total_ext_node, 0); +} + +int __init create_extent_cache(void) +{ + extent_tree_slab = f2fs_kmem_cache_create("f2fs_extent_tree", + sizeof(struct extent_tree)); + if (!extent_tree_slab) + return -ENOMEM; + extent_node_slab = f2fs_kmem_cache_create("f2fs_extent_node", + sizeof(struct extent_node)); + if (!extent_node_slab) { + kmem_cache_destroy(extent_tree_slab); + return -ENOMEM; + } + return 0; +} + +void destroy_extent_cache(void) +{ + kmem_cache_destroy(extent_node_slab); + kmem_cache_destroy(extent_tree_slab); +} diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h old mode 100644 new mode 100755 index df94cc338c9c..6ef8c890a57c --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -11,6 +11,10 @@ #ifndef _LINUX_F2FS_H #define _LINUX_F2FS_H +#ifdef CONFIG_F2FS_FS_ENCRYPTION +#undef CONFIG_F2FS_FS_ENCRYPTION +#endif + #include #include #include @@ -19,6 +23,10 @@ #include #include #include +#include +#include +#include +#include #ifdef CONFIG_F2FS_CHECK_FS #define f2fs_bug_on(sbi, condition) BUG_ON(condition) @@ -31,7 +39,60 @@ set_sbi_flag(sbi, SBI_NEED_FSCK); \ } \ } while (0) -#define f2fs_down_write(x, y) down_write(x) +#endif + +#ifdef CONFIG_F2FS_FAULT_INJECTION +enum { + FAULT_KMALLOC, + FAULT_PAGE_ALLOC, + FAULT_ALLOC_NID, + FAULT_ORPHAN, + FAULT_BLOCK, + FAULT_DIR_DEPTH, + FAULT_EVICT_INODE, + FAULT_MAX, +}; + +struct f2fs_fault_info { + atomic_t inject_ops; + unsigned int inject_rate; + unsigned int inject_type; +}; + +extern struct f2fs_fault_info f2fs_fault; +extern char *fault_name[FAULT_MAX]; +#define IS_FAULT_SET(type) (f2fs_fault.inject_type & (1 << (type))) + +static inline bool time_to_inject(int type) +{ + if (!f2fs_fault.inject_rate) + return false; + if (type == FAULT_KMALLOC && !IS_FAULT_SET(type)) + return false; + else if (type == FAULT_PAGE_ALLOC && !IS_FAULT_SET(type)) + return false; + else if (type == FAULT_ALLOC_NID && !IS_FAULT_SET(type)) + return false; + else if (type == FAULT_ORPHAN && !IS_FAULT_SET(type)) + return false; + else if (type == FAULT_BLOCK && !IS_FAULT_SET(type)) + return false; + else if (type == FAULT_DIR_DEPTH && !IS_FAULT_SET(type)) + return false; + else if (type == FAULT_EVICT_INODE && !IS_FAULT_SET(type)) + return false; + + atomic_inc(&f2fs_fault.inject_ops); + if (atomic_read(&f2fs_fault.inject_ops) >= f2fs_fault.inject_rate) { + atomic_set(&f2fs_fault.inject_ops, 0); + printk("%sF2FS-fs : inject %s in %pF\n", + KERN_INFO, + fault_name[type], + __builtin_return_address(0)); + return true; + } + return false; +} #endif /* @@ -52,6 +113,11 @@ #define F2FS_MOUNT_NOBARRIER 0x00000800 #define F2FS_MOUNT_FASTBOOT 0x00001000 #define F2FS_MOUNT_EXTENT_CACHE 0x00002000 +#define F2FS_MOUNT_FORCE_FG_GC 0x00004000 +#define F2FS_MOUNT_DATA_FLUSH 0x00008000 +#define F2FS_MOUNT_FAULT_INJECTION 0x00010000 +#define F2FS_MOUNT_ADAPTIVE 0x00020000 +#define F2FS_MOUNT_LFS 0x00040000 #define clear_opt(sbi, option) (sbi->mount_opt.opt &= ~F2FS_MOUNT_##option) #define set_opt(sbi, option) (sbi->mount_opt.opt |= F2FS_MOUNT_##option) @@ -71,6 +137,16 @@ struct f2fs_mount_info { unsigned int opt; }; +#define F2FS_FEATURE_ENCRYPT 0x0001 +#define F2FS_FEATURE_HMSMR 0x0002 + +#define F2FS_HAS_FEATURE(sb, mask) \ + ((F2FS_SB(sb)->raw_super->feature & cpu_to_le32(mask)) != 0) +#define F2FS_SET_FEATURE(sb, mask) \ + F2FS_SB(sb)->raw_super->feature |= cpu_to_le32(mask) +#define F2FS_CLEAR_FEATURE(sb, mask) \ + F2FS_SB(sb)->raw_super->feature &= ~cpu_to_le32(mask) + #define CRCPOLY_LE 0xedb88320 static inline __u32 f2fs_crc32(void *buf, size_t len) @@ -92,6 +168,52 @@ static inline bool f2fs_crc_valid(__u32 blk_crc, void *buf, size_t buf_size) return f2fs_crc32(buf, buf_size) == blk_crc; } +static inline void inode_lock(struct inode *inode) +{ + mutex_lock(&inode->i_mutex); +} + +static inline void inode_unlock(struct inode *inode) +{ + mutex_unlock(&inode->i_mutex); +} + +/** + * wq_has_sleeper - check if there are any waiting processes + * @wq: wait queue head + * + * Returns true if wq has waiting processes + * + * Please refer to the comment for waitqueue_active. + */ +static inline bool wq_has_sleeper(wait_queue_head_t *wq) +{ + /* + * We need to be sure we are in sync with the + * add_wait_queue modifications to the wait queue. + * + * This memory barrier should be paired with one on the + * waiting side. + */ + smp_mb(); + return waitqueue_active(wq); +} + +static inline struct inode *d_inode(const struct dentry *dentry) +{ + return dentry->d_inode; +} + +static inline struct dentry *file_dentry(const struct file *file) +{ + return file->f_path.dentry; +} + +static inline void inode_nohighmem(struct inode *inode) +{ + mapping_set_gfp_mask(inode->i_mapping, GFP_USER); +} + /* * For checkpoint manager */ @@ -111,6 +233,10 @@ enum { #define DEF_BATCHED_TRIM_SECTIONS 32 #define BATCHED_TRIM_SEGMENTS(sbi) \ (SM_I(sbi)->trim_sections * (sbi)->segs_per_sec) +#define BATCHED_TRIM_BLOCKS(sbi) \ + (BATCHED_TRIM_SEGMENTS(sbi) << (sbi)->log_blocks_per_seg) +#define DEF_CP_INTERVAL 60 /* 60 secs */ +#define DEF_IDLE_INTERVAL 5 /* 5 secs */ struct cp_control { int reason; @@ -144,13 +270,7 @@ struct ino_entry { nid_t ino; /* inode number */ }; -/* - * for the list of directory inodes or gc inodes. - * NOTE: there are two slab users for this structure, if we add/modify/delete - * fields in structure for one of slab users, it may affect fields or size of - * other one, in this condition, it's better to split both of slab and related - * data structure. - */ +/* for the list of inodes to be GCed */ struct inode_entry { struct list_head list; /* list head */ struct inode *inode; /* vfs inode pointer */ @@ -169,40 +289,39 @@ struct fsync_inode_entry { struct inode *inode; /* vfs inode pointer */ block_t blkaddr; /* block address locating the last fsync */ block_t last_dentry; /* block address locating the last dentry */ - block_t last_inode; /* block address locating the last inode */ }; -#define nats_in_cursum(sum) (le16_to_cpu(sum->n_nats)) -#define sits_in_cursum(sum) (le16_to_cpu(sum->n_sits)) +#define nats_in_cursum(jnl) (le16_to_cpu(jnl->n_nats)) +#define sits_in_cursum(jnl) (le16_to_cpu(jnl->n_sits)) -#define nat_in_journal(sum, i) (sum->nat_j.entries[i].ne) -#define nid_in_journal(sum, i) (sum->nat_j.entries[i].nid) -#define sit_in_journal(sum, i) (sum->sit_j.entries[i].se) -#define segno_in_journal(sum, i) (sum->sit_j.entries[i].segno) +#define nat_in_journal(jnl, i) (jnl->nat_j.entries[i].ne) +#define nid_in_journal(jnl, i) (jnl->nat_j.entries[i].nid) +#define sit_in_journal(jnl, i) (jnl->sit_j.entries[i].se) +#define segno_in_journal(jnl, i) (jnl->sit_j.entries[i].segno) -#define MAX_NAT_JENTRIES(sum) (NAT_JOURNAL_ENTRIES - nats_in_cursum(sum)) -#define MAX_SIT_JENTRIES(sum) (SIT_JOURNAL_ENTRIES - sits_in_cursum(sum)) +#define MAX_NAT_JENTRIES(jnl) (NAT_JOURNAL_ENTRIES - nats_in_cursum(jnl)) +#define MAX_SIT_JENTRIES(jnl) (SIT_JOURNAL_ENTRIES - sits_in_cursum(jnl)) -static inline int update_nats_in_cursum(struct f2fs_summary_block *rs, int i) +static inline int update_nats_in_cursum(struct f2fs_journal *journal, int i) { - int before = nats_in_cursum(rs); - rs->n_nats = cpu_to_le16(before + i); + int before = nats_in_cursum(journal); + journal->n_nats = cpu_to_le16(before + i); return before; } -static inline int update_sits_in_cursum(struct f2fs_summary_block *rs, int i) +static inline int update_sits_in_cursum(struct f2fs_journal *journal, int i) { - int before = sits_in_cursum(rs); - rs->n_sits = cpu_to_le16(before + i); + int before = sits_in_cursum(journal); + journal->n_sits = cpu_to_le16(before + i); return before; } -static inline bool __has_cursum_space(struct f2fs_summary_block *sum, int size, - int type) +static inline bool __has_cursum_space(struct f2fs_journal *journal, + int size, int type) { if (type == NAT_JOURNAL) - return size <= MAX_NAT_JENTRIES(sum); - return size <= MAX_SIT_JENTRIES(sum); + return size <= MAX_NAT_JENTRIES(journal); + return size <= MAX_SIT_JENTRIES(journal); } /* @@ -218,6 +337,13 @@ static inline bool __has_cursum_space(struct f2fs_summary_block *sum, int size, #define F2FS_IOC_START_VOLATILE_WRITE _IO(F2FS_IOCTL_MAGIC, 3) #define F2FS_IOC_RELEASE_VOLATILE_WRITE _IO(F2FS_IOCTL_MAGIC, 4) #define F2FS_IOC_ABORT_VOLATILE_WRITE _IO(F2FS_IOCTL_MAGIC, 5) +#define F2FS_IOC_GARBAGE_COLLECT _IO(F2FS_IOCTL_MAGIC, 6) +#define F2FS_IOC_WRITE_CHECKPOINT _IO(F2FS_IOCTL_MAGIC, 7) +#define F2FS_IOC_DEFRAGMENT _IO(F2FS_IOCTL_MAGIC, 8) + +#define F2FS_IOC_SET_ENCRYPTION_POLICY FS_IOC_SET_ENCRYPTION_POLICY +#define F2FS_IOC_GET_ENCRYPTION_POLICY FS_IOC_GET_ENCRYPTION_POLICY +#define F2FS_IOC_GET_ENCRYPTION_PWSALT FS_IOC_GET_ENCRYPTION_PWSALT /* * should be same as XFS_IOC_GOINGDOWN. @@ -227,29 +353,39 @@ static inline bool __has_cursum_space(struct f2fs_summary_block *sum, int size, #define F2FS_GOING_DOWN_FULLSYNC 0x0 /* going down with full sync */ #define F2FS_GOING_DOWN_METASYNC 0x1 /* going down with metadata */ #define F2FS_GOING_DOWN_NOSYNC 0x2 /* going down */ +#define F2FS_GOING_DOWN_METAFLUSH 0x3 /* going down with meta flush */ #if defined(__KERNEL__) && defined(CONFIG_COMPAT) /* * ioctl commands in 32 bit emulation */ -#define F2FS_IOC32_GETFLAGS FS_IOC32_GETFLAGS -#define F2FS_IOC32_SETFLAGS FS_IOC32_SETFLAGS +#define F2FS_IOC32_GETFLAGS FS_IOC32_GETFLAGS +#define F2FS_IOC32_SETFLAGS FS_IOC32_SETFLAGS +#define F2FS_IOC32_GETVERSION FS_IOC32_GETVERSION #endif +struct f2fs_defragment { + u64 start; + u64 len; +}; + /* * For INODE and NODE manager */ /* for directory operations */ struct f2fs_dentry_ptr { + struct inode *inode; const void *bitmap; struct f2fs_dir_entry *dentry; __u8 (*filename)[F2FS_SLOT_LEN]; int max; }; -static inline void make_dentry_ptr(struct f2fs_dentry_ptr *d, - void *src, int type) +static inline void make_dentry_ptr(struct inode *inode, + struct f2fs_dentry_ptr *d, void *src, int type) { + d->inode = inode; + if (type == 1) { struct f2fs_dentry_block *t = (struct f2fs_dentry_block *)src; d->max = NR_DENTRY_IN_BLOCK; @@ -281,7 +417,7 @@ enum { */ }; -#define F2FS_LINK_MAX 32000 /* maximum link count per file */ +#define F2FS_LINK_MAX 0xffffffff /* maximum link count per file */ #define MAX_DIR_RA_PAGES 4 /* maximum ra pages of dir */ @@ -304,22 +440,65 @@ struct extent_node { struct rb_node rb_node; /* rb node located in rb-tree */ struct list_head list; /* node in global extent list of sbi */ struct extent_info ei; /* extent info */ + struct extent_tree *et; /* extent tree pointer */ }; struct extent_tree { nid_t ino; /* inode number */ struct rb_root root; /* root of extent info rb-tree */ struct extent_node *cached_en; /* recently accessed extent node */ + struct extent_info largest; /* largested extent info */ + struct list_head list; /* to be used by sbi->zombie_list */ rwlock_t lock; /* protect extent info rb-tree */ - atomic_t refcount; /* reference count of rb-tree */ - unsigned int count; /* # of extent node in rb-tree*/ + atomic_t node_cnt; /* # of extent node in rb-tree*/ +}; + +/* + * This structure is taken from ext4_map_blocks. + * + * Note that, however, f2fs uses NEW and MAPPED flags for f2fs_map_blocks(). + */ +#define F2FS_MAP_NEW (1 << BH_New) +#define F2FS_MAP_MAPPED (1 << BH_Mapped) +#define F2FS_MAP_UNWRITTEN (1 << BH_Unwritten) +#define F2FS_MAP_FLAGS (F2FS_MAP_NEW | F2FS_MAP_MAPPED |\ + F2FS_MAP_UNWRITTEN) + +struct f2fs_map_blocks { + block_t m_pblk; + block_t m_lblk; + unsigned int m_len; + unsigned int m_flags; + pgoff_t *m_next_pgofs; /* point next possible non-hole pgofs */ }; +/* for flag in get_data_block */ +#define F2FS_GET_BLOCK_READ 0 +#define F2FS_GET_BLOCK_DIO 1 +#define F2FS_GET_BLOCK_FIEMAP 2 +#define F2FS_GET_BLOCK_BMAP 3 +#define F2FS_GET_BLOCK_PRE_DIO 4 +#define F2FS_GET_BLOCK_PRE_AIO 5 + /* * i_advise uses FADVISE_XXX_BIT. We can add additional hints later. */ #define FADVISE_COLD_BIT 0x01 #define FADVISE_LOST_PINO_BIT 0x02 +#define FADVISE_ENCRYPT_BIT 0x04 +#define FADVISE_ENC_NAME_BIT 0x08 + +#define file_is_cold(inode) is_file(inode, FADVISE_COLD_BIT) +#define file_wrong_pino(inode) is_file(inode, FADVISE_LOST_PINO_BIT) +#define file_set_cold(inode) set_file(inode, FADVISE_COLD_BIT) +#define file_lost_pino(inode) set_file(inode, FADVISE_LOST_PINO_BIT) +#define file_clear_cold(inode) clear_file(inode, FADVISE_COLD_BIT) +#define file_got_pino(inode) clear_file(inode, FADVISE_LOST_PINO_BIT) +#define file_is_encrypt(inode) is_file(inode, FADVISE_ENCRYPT_BIT) +#define file_set_encrypt(inode) set_file(inode, FADVISE_ENCRYPT_BIT) +#define file_clear_encrypt(inode) clear_file(inode, FADVISE_ENCRYPT_BIT) +#define file_enc_name(inode) is_file(inode, FADVISE_ENC_NAME_BIT) +#define file_set_enc_name(inode) set_file(inode, FADVISE_ENC_NAME_BIT) #define DEF_DIR_LEVEL 0 @@ -335,26 +514,27 @@ struct f2fs_inode_info { /* Use below internally in f2fs*/ unsigned long flags; /* use to pass per-file flags */ struct rw_semaphore i_sem; /* protect fi info */ - atomic_t dirty_pages; /* # of dirty pages */ + struct percpu_counter dirty_pages; /* # of dirty pages */ f2fs_hash_t chash; /* hash value of given file name */ unsigned int clevel; /* maximum level of given file name */ nid_t i_xattr_nid; /* node id that contains xattrs */ unsigned long long xattr_ver; /* cp version of xattr modification */ - struct extent_info ext; /* in-memory extent cache entry */ - rwlock_t ext_lock; /* rwlock for single extent cache */ - struct inode_entry *dirty_dir; /* the pointer of dirty dir */ + loff_t last_disk_size; /* lastly written file size */ - struct radix_tree_root inmem_root; /* radix tree for inmem pages */ + struct list_head dirty_list; /* dirty list for dirs and files */ + struct list_head gdirty_list; /* linked in global dirty list */ struct list_head inmem_pages; /* inmemory pages managed by f2fs */ struct mutex inmem_lock; /* lock for inmemory pages */ + struct extent_tree *extent_tree; /* cached extent_tree entry */ + struct rw_semaphore dio_rwsem[2];/* avoid racing between dio and gc */ }; static inline void get_extent_info(struct extent_info *ext, - struct f2fs_extent i_ext) + struct f2fs_extent *i_ext) { - ext->fofs = le32_to_cpu(i_ext.fofs); - ext->blk = le32_to_cpu(i_ext.blk); - ext->len = le32_to_cpu(i_ext.len); + ext->fofs = le32_to_cpu(i_ext->fofs); + ext->blk = le32_to_cpu(i_ext->blk); + ext->len = le32_to_cpu(i_ext->len); } static inline void set_raw_extent(struct extent_info *ext, @@ -399,12 +579,24 @@ static inline bool __is_front_mergeable(struct extent_info *cur, return __is_extent_mergeable(cur, front); } +extern void f2fs_mark_inode_dirty_sync(struct inode *); +static inline void __try_update_largest_extent(struct inode *inode, + struct extent_tree *et, struct extent_node *en) +{ + if (en->ei.len > et->largest.len) { + et->largest = en->ei; + f2fs_mark_inode_dirty_sync(inode); + } +} + struct f2fs_nm_info { block_t nat_blkaddr; /* base disk address of NAT */ nid_t max_nid; /* maximum possible node ids */ nid_t available_nids; /* maximum available node ids */ nid_t next_scan_nid; /* the next nid to be scanned */ unsigned int ram_thresh; /* control the memory footprint */ + unsigned int ra_nid_pages; /* # of nid pages to be readaheaded */ + unsigned int dirty_nats_ratio; /* control dirty nats ratio threshold */ /* NAT cache management */ struct radix_tree_root nat_root;/* root of the nat entry cache */ @@ -438,6 +630,9 @@ struct dnode_of_data { nid_t nid; /* node id of the direct node block */ unsigned int ofs_in_node; /* data offset in the node page */ bool inode_page_locked; /* inode page is locked or not */ + bool node_changed; /* is node block changed */ + char cur_level; /* level of hole node page */ + char max_level; /* level of current page located */ block_t data_blkaddr; /* block address of the node block */ }; @@ -488,6 +683,7 @@ struct flush_cmd { struct flush_cmd_control { struct task_struct *f2fs_issue_flush; /* flush thread */ wait_queue_head_t flush_wait_queue; /* waiting queue for wake-up */ + atomic_t submit_flush; /* # of issued flushes */ struct llist_head issue_list; /* list for command issue */ struct llist_node *dispatch_list; /* list for command dispatch */ }; @@ -539,11 +735,12 @@ struct f2fs_sm_info { * dirty dentry blocks, dirty node blocks, and dirty meta blocks. */ enum count_type { - F2FS_WRITEBACK, F2FS_DIRTY_DENTS, + F2FS_DIRTY_DATA, F2FS_DIRTY_NODES, F2FS_DIRTY_META, F2FS_INMEM_PAGES, + F2FS_DIRTY_IMETA, NR_COUNT_TYPE, }; @@ -567,17 +764,23 @@ enum page_type { META_FLUSH, INMEM, /* the below types are used by tracepoints only. */ INMEM_DROP, + INMEM_REVOKE, IPU, OPU, }; struct f2fs_io_info { + struct f2fs_sb_info *sbi; /* f2fs_sb_info pointer */ enum page_type type; /* contains DATA/NODE/META/META_FLUSH */ int rw; /* contains R/RS/W/WS with REQ_META/REQ_PRIO */ - block_t blk_addr; /* block address to be written */ + block_t new_blkaddr; /* new block address to be written */ + block_t old_blkaddr; /* old block address before Cow */ + struct page *page; /* page to be written */ + struct page *encrypted_page; /* encrypted page */ }; #define is_read_io(rw) (((rw) & 1) == READ) + struct f2fs_bio_info { struct f2fs_sb_info *sbi; /* f2fs superblock */ struct bio *bio; /* bios to merge */ @@ -586,6 +789,13 @@ struct f2fs_bio_info { struct rw_semaphore io_rwsem; /* blocking op for bio */ }; +enum inode_type { + DIR_INODE, /* for dirty dir inode */ + FILE_INODE, /* for dirty regular/symlink inode */ + DIRTY_META, /* for all dirtied inode metadata */ + NR_INODE_TYPE, +}; + /* for inner inode cache management */ struct inode_management { struct radix_tree_root ino_root; /* ino entry array */ @@ -600,15 +810,30 @@ enum { SBI_IS_CLOSE, /* specify unmounting */ SBI_NEED_FSCK, /* need fsck.f2fs to fix */ SBI_POR_DOING, /* recovery is doing or not */ + SBI_NEED_SB_WRITE, /* need to recover superblock */ }; +enum { + CP_TIME, + REQ_TIME, + MAX_TIME, +}; + +#ifdef CONFIG_F2FS_FS_ENCRYPTION +#define F2FS_KEY_DESC_PREFIX "f2fs:" +#define F2FS_KEY_DESC_PREFIX_SIZE 5 +#endif struct f2fs_sb_info { struct super_block *sb; /* pointer to VFS super block */ struct proc_dir_entry *s_proc; /* proc entry */ - struct buffer_head *raw_super_buf; /* buffer head of raw sb */ struct f2fs_super_block *raw_super; /* raw super block pointer */ + int valid_super_block; /* valid super block no */ int s_flag; /* flags for sbi */ +#ifdef CONFIG_F2FS_FS_ENCRYPTION + u8 key_prefix[F2FS_KEY_DESC_PREFIX_SIZE]; + u8 key_prefix_size; +#endif /* for node-related operations */ struct f2fs_nm_info *nm_info; /* node manager */ struct inode *node_inode; /* cache node blocks */ @@ -619,6 +844,7 @@ struct f2fs_sb_info { /* for bio operations */ struct f2fs_bio_info read_io; /* for read bios */ struct f2fs_bio_info write_io[NR_PAGE_TYPE]; /* for write bios */ + struct mutex wio_mutex[NODE + 1]; /* bio ordering for NODE/DATA */ /* for checkpoint */ struct f2fs_checkpoint *ckpt; /* raw checkpoint pointer */ @@ -627,22 +853,26 @@ struct f2fs_sb_info { struct rw_semaphore cp_rwsem; /* blocking FS operations */ struct rw_semaphore node_write; /* locking node writes */ wait_queue_head_t cp_wait; + unsigned long last_time[MAX_TIME]; /* to store time in jiffies */ + long interval_time[MAX_TIME]; /* to store thresholds */ struct inode_management im[MAX_INO_ENTRY]; /* manage inode cache */ /* for orphan inode, use 0'th array */ unsigned int max_orphans; /* max orphan inodes */ - /* for directory inode management */ - struct list_head dir_inode_list; /* dir inode list */ - spinlock_t dir_inode_lock; /* for dir inode list lock */ + /* for inode management */ + struct list_head inode_list[NR_INODE_TYPE]; /* dirty inode list */ + spinlock_t inode_lock[NR_INODE_TYPE]; /* for dirty inode list lock */ /* for extent tree cache */ struct radix_tree_root extent_tree_root;/* cache extent cache entries */ struct rw_semaphore extent_tree_lock; /* locking extent radix tree */ struct list_head extent_list; /* lru list for shrinker */ spinlock_t extent_lock; /* locking extent lru list */ - int total_ext_tree; /* extent tree count */ + atomic_t total_ext_tree; /* extent tree count */ + struct list_head zombie_list; /* extent zombie tree list */ + atomic_t total_zombie_tree; /* extent zombie tree count */ atomic_t total_ext_node; /* extent info count */ /* basic filesystem units */ @@ -659,16 +889,24 @@ struct f2fs_sb_info { unsigned int total_sections; /* total section count */ unsigned int total_node_count; /* total node block count */ unsigned int total_valid_node_count; /* valid node block count */ - unsigned int total_valid_inode_count; /* valid inode count */ + loff_t max_file_blocks; /* max block index of file */ int active_logs; /* # of active logs */ int dir_level; /* directory level */ block_t user_block_count; /* # of user blocks */ block_t total_valid_block_count; /* # of valid blocks */ - block_t alloc_valid_block_count; /* # of allocated blocks */ + block_t discard_blks; /* discard command candidats */ block_t last_valid_block_count; /* for recovery */ u32 s_next_generation; /* for NFS support */ - atomic_t nr_pages[NR_COUNT_TYPE]; /* # of pages, see count_type */ + atomic_t nr_wb_bios; /* # of writeback bios */ + + /* # of pages, see count_type */ + struct percpu_counter nr_pages[NR_COUNT_TYPE]; + /* # of allocated blocks */ + struct percpu_counter alloc_valid_block_count; + + /* valid inode count */ + struct percpu_counter total_valid_inode_count; struct f2fs_mount_info mount_opt; /* mount options */ @@ -689,11 +927,15 @@ struct f2fs_sb_info { unsigned int segment_count[2]; /* # of allocated segments */ unsigned int block_count[2]; /* # of allocated blocks */ atomic_t inplace_count; /* # of inplace update */ - int total_hit_ext, read_hit_ext; /* extent cache hit ratio */ + atomic64_t total_hit_ext; /* # of lookup extent cache */ + atomic64_t read_hit_rbtree; /* # of hit rbtree extent node */ + atomic64_t read_hit_largest; /* # of hit largest extent node */ + atomic64_t read_hit_cached; /* # of hit cached extent node */ + atomic_t inline_xattr; /* # of inline_xattr inodes */ atomic_t inline_inode; /* # of inline_data inodes */ atomic_t inline_dir; /* # of inline_dentry inodes */ int bg_gc; /* background gc calls */ - unsigned int n_dirty_dirs; /* # of dir inodes */ + unsigned int ndirty_inode[NR_INODE_TYPE]; /* # of dirty inodes */ #endif unsigned int last_victim[2]; /* last victim segment # */ spinlock_t stat_lock; /* lock for stat operations */ @@ -701,8 +943,49 @@ struct f2fs_sb_info { /* For sysfs suppport */ struct kobject s_kobj; struct completion s_kobj_unregister; + + /* For shrinker support */ + struct list_head s_list; + struct mutex umount_mutex; + unsigned int shrinker_run_no; + + /* For write statistics */ + u64 sectors_written_start; + u64 kbytes_written; }; +/* For write statistics. Suppose sector size is 512 bytes, + * and the return value is in kbytes. s is of struct f2fs_sb_info. + */ +#define BD_PART_WRITTEN(s) \ +(((u64)part_stat_read(s->sb->s_bdev->bd_part, sectors[1]) - \ + s->sectors_written_start) >> 1) + +static inline void f2fs_update_time(struct f2fs_sb_info *sbi, int type) +{ + sbi->last_time[type] = jiffies; +} + +static inline bool f2fs_time_over(struct f2fs_sb_info *sbi, int type) +{ + struct timespec ts = {sbi->interval_time[type], 0}; + unsigned long interval = timespec_to_jiffies(&ts); + + return time_after(jiffies, sbi->last_time[type] + interval); +} + +static inline bool is_idle(struct f2fs_sb_info *sbi) +{ + struct block_device *bdev = sbi->sb->s_bdev; + struct request_queue *q = bdev_get_queue(bdev); + struct request_list *rl = &q->rq; + + if (rl->count[BLK_RW_SYNC] || rl->count[BLK_RW_ASYNC]) + return 0; + + return f2fs_time_over(sbi, REQ_TIME); +} + /* * Inline functions */ @@ -838,7 +1121,7 @@ static inline void f2fs_unlock_op(struct f2fs_sb_info *sbi) static inline void f2fs_lock_all(struct f2fs_sb_info *sbi) { - f2fs_down_write(&sbi->cp_rwsem, &sbi->cp_mutex); + down_write(&sbi->cp_rwsem); } static inline void f2fs_unlock_all(struct f2fs_sb_info *sbi) @@ -898,22 +1181,37 @@ static inline bool f2fs_has_xattr_block(unsigned int ofs) return ofs == XATTR_NODE_OFFSET; } +static inline void f2fs_i_blocks_write(struct inode *, blkcnt_t, bool); static inline bool inc_valid_block_count(struct f2fs_sb_info *sbi, - struct inode *inode, blkcnt_t count) + struct inode *inode, blkcnt_t *count) { - block_t valid_block_count; + blkcnt_t diff; - spin_lock(&sbi->stat_lock); - valid_block_count = - sbi->total_valid_block_count + (block_t)count; - if (unlikely(valid_block_count > sbi->user_block_count)) { - spin_unlock(&sbi->stat_lock); +#ifdef CONFIG_F2FS_FAULT_INJECTION + if (time_to_inject(FAULT_BLOCK)) return false; +#endif + /* + * let's increase this in prior to actual block count change in order + * for f2fs_sync_file to avoid data races when deciding checkpoint. + */ + percpu_counter_add(&sbi->alloc_valid_block_count, (*count)); + + spin_lock(&sbi->stat_lock); + sbi->total_valid_block_count += (block_t)(*count); + if (unlikely(sbi->total_valid_block_count > sbi->user_block_count)) { + diff = sbi->total_valid_block_count - sbi->user_block_count; + *count -= diff; + sbi->total_valid_block_count = sbi->user_block_count; + if (!*count) { + spin_unlock(&sbi->stat_lock); + percpu_counter_sub(&sbi->alloc_valid_block_count, diff); + return false; + } } - inode->i_blocks += count; - sbi->total_valid_block_count = valid_block_count; - sbi->alloc_valid_block_count += (block_t)count; spin_unlock(&sbi->stat_lock); + + f2fs_i_blocks_write(inode, *count, true); return true; } @@ -924,56 +1222,57 @@ static inline void dec_valid_block_count(struct f2fs_sb_info *sbi, spin_lock(&sbi->stat_lock); f2fs_bug_on(sbi, sbi->total_valid_block_count < (block_t) count); f2fs_bug_on(sbi, inode->i_blocks < count); - inode->i_blocks -= count; sbi->total_valid_block_count -= (block_t)count; spin_unlock(&sbi->stat_lock); + f2fs_i_blocks_write(inode, count, false); } static inline void inc_page_count(struct f2fs_sb_info *sbi, int count_type) { - atomic_inc(&sbi->nr_pages[count_type]); + percpu_counter_inc(&sbi->nr_pages[count_type]); set_sbi_flag(sbi, SBI_IS_DIRTY); } static inline void inode_inc_dirty_pages(struct inode *inode) { - atomic_inc(&F2FS_I(inode)->dirty_pages); - if (S_ISDIR(inode->i_mode)) - inc_page_count(F2FS_I_SB(inode), F2FS_DIRTY_DENTS); + percpu_counter_inc(&F2FS_I(inode)->dirty_pages); + inc_page_count(F2FS_I_SB(inode), S_ISDIR(inode->i_mode) ? + F2FS_DIRTY_DENTS : F2FS_DIRTY_DATA); } static inline void dec_page_count(struct f2fs_sb_info *sbi, int count_type) { - atomic_dec(&sbi->nr_pages[count_type]); + percpu_counter_dec(&sbi->nr_pages[count_type]); } static inline void inode_dec_dirty_pages(struct inode *inode) { - if (!S_ISDIR(inode->i_mode) && !S_ISREG(inode->i_mode)) + if (!S_ISDIR(inode->i_mode) && !S_ISREG(inode->i_mode) && + !S_ISLNK(inode->i_mode)) return; - atomic_dec(&F2FS_I(inode)->dirty_pages); - - if (S_ISDIR(inode->i_mode)) - dec_page_count(F2FS_I_SB(inode), F2FS_DIRTY_DENTS); + percpu_counter_dec(&F2FS_I(inode)->dirty_pages); + dec_page_count(F2FS_I_SB(inode), S_ISDIR(inode->i_mode) ? + F2FS_DIRTY_DENTS : F2FS_DIRTY_DATA); } -static inline int get_pages(struct f2fs_sb_info *sbi, int count_type) +static inline s64 get_pages(struct f2fs_sb_info *sbi, int count_type) { - return atomic_read(&sbi->nr_pages[count_type]); + return percpu_counter_sum_positive(&sbi->nr_pages[count_type]); } -static inline int get_dirty_pages(struct inode *inode) +static inline s64 get_dirty_pages(struct inode *inode) { - return atomic_read(&F2FS_I(inode)->dirty_pages); + return percpu_counter_sum_positive(&F2FS_I(inode)->dirty_pages); } static inline int get_blocktype_secs(struct f2fs_sb_info *sbi, int block_type) { - unsigned int pages_per_sec = sbi->segs_per_sec * - (1 << sbi->log_blocks_per_seg); - return ((get_pages(sbi, block_type) + pages_per_sec - 1) - >> sbi->log_blocks_per_seg) / sbi->segs_per_sec; + unsigned int pages_per_sec = sbi->segs_per_sec * sbi->blocks_per_seg; + unsigned int segs = (get_pages(sbi, block_type) + pages_per_sec - 1) >> + sbi->log_blocks_per_seg; + + return segs / sbi->segs_per_sec; } static inline block_t valid_user_blocks(struct f2fs_sb_info *sbi) @@ -1060,13 +1359,13 @@ static inline bool inc_valid_node_count(struct f2fs_sb_info *sbi, } if (inode) - inode->i_blocks++; + f2fs_i_blocks_write(inode, 1, true); - sbi->alloc_valid_block_count++; sbi->total_valid_node_count++; sbi->total_valid_block_count++; spin_unlock(&sbi->stat_lock); + percpu_counter_inc(&sbi->alloc_valid_block_count); return true; } @@ -1079,7 +1378,7 @@ static inline void dec_valid_node_count(struct f2fs_sb_info *sbi, f2fs_bug_on(sbi, !sbi->total_valid_node_count); f2fs_bug_on(sbi, !inode->i_blocks); - inode->i_blocks--; + f2fs_i_blocks_write(inode, 1, false); sbi->total_valid_node_count--; sbi->total_valid_block_count--; @@ -1093,23 +1392,43 @@ static inline unsigned int valid_node_count(struct f2fs_sb_info *sbi) static inline void inc_valid_inode_count(struct f2fs_sb_info *sbi) { - spin_lock(&sbi->stat_lock); - f2fs_bug_on(sbi, sbi->total_valid_inode_count == sbi->total_node_count); - sbi->total_valid_inode_count++; - spin_unlock(&sbi->stat_lock); + percpu_counter_inc(&sbi->total_valid_inode_count); } static inline void dec_valid_inode_count(struct f2fs_sb_info *sbi) { - spin_lock(&sbi->stat_lock); - f2fs_bug_on(sbi, !sbi->total_valid_inode_count); - sbi->total_valid_inode_count--; - spin_unlock(&sbi->stat_lock); + percpu_counter_dec(&sbi->total_valid_inode_count); } -static inline unsigned int valid_inode_count(struct f2fs_sb_info *sbi) +static inline s64 valid_inode_count(struct f2fs_sb_info *sbi) { - return sbi->total_valid_inode_count; + return percpu_counter_sum_positive(&sbi->total_valid_inode_count); +} + +static inline struct page *f2fs_grab_cache_page(struct address_space *mapping, + pgoff_t index, bool for_write) +{ +#ifdef CONFIG_F2FS_FAULT_INJECTION + struct page *page = find_lock_page(mapping, index); + if (page) + return page; + + if (time_to_inject(FAULT_PAGE_ALLOC)) + return NULL; +#endif + if (!for_write) + return grab_cache_page(mapping, index); + return grab_cache_page_write_begin(mapping, index, AOP_FLAG_NOFS); +} + +static inline void f2fs_copy_page(struct page *src, struct page *dst) +{ + char *src_kaddr = kmap(src); + char *dst_kaddr = kmap(dst); + + memcpy(dst_kaddr, src_kaddr, PAGE_SIZE); + kunmap(dst); + kunmap(src); } static inline void f2fs_put_page(struct page *page, int unlock) @@ -1121,7 +1440,7 @@ static inline void f2fs_put_page(struct page *page, int unlock) f2fs_bug_on(F2FS_P_SB(page), !PageLocked(page)); unlock_page(page); } - page_cache_release(page); + put_page(page); } static inline void f2fs_put_dnode(struct dnode_of_data *dn) @@ -1144,16 +1463,24 @@ static inline void *f2fs_kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags) { void *entry; -retry: - entry = kmem_cache_alloc(cachep, flags); - if (!entry) { - cond_resched(); - goto retry; - } + entry = kmem_cache_alloc(cachep, flags); + if (!entry) + entry = kmem_cache_alloc(cachep, flags | __GFP_NOFAIL); return entry; } +static inline struct bio *f2fs_bio_alloc(int npages) +{ + struct bio *bio; + + /* No failure on bio allocation */ + bio = bio_alloc(GFP_NOIO, npages); + if (!bio) + bio = bio_alloc(GFP_NOIO | __GFP_NOFAIL, npages); + return bio; +} + static inline void f2fs_radix_tree_insert(struct radix_tree_root *root, unsigned long index, void *item) { @@ -1193,6 +1520,24 @@ static inline int f2fs_test_bit(unsigned int nr, char *addr) return mask & *addr; } +static inline void f2fs_set_bit(unsigned int nr, char *addr) +{ + int mask; + + addr += (nr >> 3); + mask = 1 << (7 - (nr & 0x07)); + *addr |= mask; +} + +static inline void f2fs_clear_bit(unsigned int nr, char *addr) +{ + int mask; + + addr += (nr >> 3); + mask = 1 << (7 - (nr & 0x07)); + *addr &= ~mask; +} + static inline int f2fs_test_and_set_bit(unsigned int nr, char *addr) { int mask; @@ -1230,12 +1575,12 @@ static inline void f2fs_change_bit(unsigned int nr, char *addr) enum { FI_NEW_INODE, /* indicate newly allocated inode */ FI_DIRTY_INODE, /* indicate inode is dirty or not */ + FI_AUTO_RECOVER, /* indicate inode is recoverable */ FI_DIRTY_DIR, /* indicate directory has dirty pages */ FI_INC_LINK, /* need to increment i_nlink */ FI_ACL_MODE, /* indicate acl mode */ FI_NO_ALLOC, /* should not allocate any blocks */ - FI_UPDATE_DIR, /* should update inode block for consistency */ - FI_DELAY_IPUT, /* used for the recovery */ + FI_FREE_NID, /* free allocated nide */ FI_NO_EXTENT, /* not to use the extent cache */ FI_INLINE_XATTR, /* used for inline xattr */ FI_INLINE_DATA, /* used for inline data*/ @@ -1249,71 +1594,152 @@ enum { FI_DROP_CACHE, /* drop dirty page cache */ FI_DATA_EXIST, /* indicate data exists */ FI_INLINE_DOTS, /* indicate inline dot dentries */ + FI_DO_DEFRAG, /* indicate defragment is running */ + FI_DIRTY_FILE, /* indicate regular/symlink has dirty pages */ }; -static inline void set_inode_flag(struct f2fs_inode_info *fi, int flag) +static inline void __mark_inode_dirty_flag(struct inode *inode, + int flag, bool set) +{ + switch (flag) { + case FI_INLINE_XATTR: + case FI_INLINE_DATA: + case FI_INLINE_DENTRY: + if (set) + return; + case FI_DATA_EXIST: + case FI_INLINE_DOTS: + f2fs_mark_inode_dirty_sync(inode); + } +} + +static inline void set_inode_flag(struct inode *inode, int flag) +{ + if (!test_bit(flag, &F2FS_I(inode)->flags)) + set_bit(flag, &F2FS_I(inode)->flags); + __mark_inode_dirty_flag(inode, flag, true); +} + +static inline int is_inode_flag_set(struct inode *inode, int flag) +{ + return test_bit(flag, &F2FS_I(inode)->flags); +} + +static inline void clear_inode_flag(struct inode *inode, int flag) +{ + if (test_bit(flag, &F2FS_I(inode)->flags)) + clear_bit(flag, &F2FS_I(inode)->flags); + __mark_inode_dirty_flag(inode, flag, false); +} + +static inline void set_acl_inode(struct inode *inode, umode_t mode) +{ + F2FS_I(inode)->i_acl_mode = mode; + set_inode_flag(inode, FI_ACL_MODE); + f2fs_mark_inode_dirty_sync(inode); +} + +static inline void f2fs_i_links_write(struct inode *inode, bool inc) +{ + if (inc) + inc_nlink(inode); + else + drop_nlink(inode); + f2fs_mark_inode_dirty_sync(inode); +} + +static inline void f2fs_i_blocks_write(struct inode *inode, + blkcnt_t diff, bool add) +{ + bool clean = !is_inode_flag_set(inode, FI_DIRTY_INODE); + bool recover = is_inode_flag_set(inode, FI_AUTO_RECOVER); + + inode->i_blocks = add ? inode->i_blocks + diff : + inode->i_blocks - diff; + f2fs_mark_inode_dirty_sync(inode); + if (clean || recover) + set_inode_flag(inode, FI_AUTO_RECOVER); +} + +static inline void f2fs_i_size_write(struct inode *inode, loff_t i_size) +{ + bool clean = !is_inode_flag_set(inode, FI_DIRTY_INODE); + bool recover = is_inode_flag_set(inode, FI_AUTO_RECOVER); + + if (i_size_read(inode) == i_size) + return; + + i_size_write(inode, i_size); + f2fs_mark_inode_dirty_sync(inode); + if (clean || recover) + set_inode_flag(inode, FI_AUTO_RECOVER); +} + +static inline bool f2fs_skip_inode_update(struct inode *inode) { - if (!test_bit(flag, &fi->flags)) - set_bit(flag, &fi->flags); + if (!is_inode_flag_set(inode, FI_AUTO_RECOVER)) + return false; + return F2FS_I(inode)->last_disk_size == i_size_read(inode); } -static inline int is_inode_flag_set(struct f2fs_inode_info *fi, int flag) +static inline void f2fs_i_depth_write(struct inode *inode, unsigned int depth) { - return test_bit(flag, &fi->flags); + F2FS_I(inode)->i_current_depth = depth; + f2fs_mark_inode_dirty_sync(inode); } -static inline void clear_inode_flag(struct f2fs_inode_info *fi, int flag) +static inline void f2fs_i_xnid_write(struct inode *inode, nid_t xnid) { - if (test_bit(flag, &fi->flags)) - clear_bit(flag, &fi->flags); + F2FS_I(inode)->i_xattr_nid = xnid; + f2fs_mark_inode_dirty_sync(inode); } -static inline void set_acl_inode(struct f2fs_inode_info *fi, umode_t mode) +static inline void f2fs_i_pino_write(struct inode *inode, nid_t pino) { - fi->i_acl_mode = mode; - set_inode_flag(fi, FI_ACL_MODE); + F2FS_I(inode)->i_pino = pino; + f2fs_mark_inode_dirty_sync(inode); } -static inline void get_inline_info(struct f2fs_inode_info *fi, - struct f2fs_inode *ri) +static inline void get_inline_info(struct inode *inode, struct f2fs_inode *ri) { + struct f2fs_inode_info *fi = F2FS_I(inode); + if (ri->i_inline & F2FS_INLINE_XATTR) - set_inode_flag(fi, FI_INLINE_XATTR); + set_bit(FI_INLINE_XATTR, &fi->flags); if (ri->i_inline & F2FS_INLINE_DATA) - set_inode_flag(fi, FI_INLINE_DATA); + set_bit(FI_INLINE_DATA, &fi->flags); if (ri->i_inline & F2FS_INLINE_DENTRY) - set_inode_flag(fi, FI_INLINE_DENTRY); + set_bit(FI_INLINE_DENTRY, &fi->flags); if (ri->i_inline & F2FS_DATA_EXIST) - set_inode_flag(fi, FI_DATA_EXIST); + set_bit(FI_DATA_EXIST, &fi->flags); if (ri->i_inline & F2FS_INLINE_DOTS) - set_inode_flag(fi, FI_INLINE_DOTS); + set_bit(FI_INLINE_DOTS, &fi->flags); } -static inline void set_raw_inline(struct f2fs_inode_info *fi, - struct f2fs_inode *ri) +static inline void set_raw_inline(struct inode *inode, struct f2fs_inode *ri) { ri->i_inline = 0; - if (is_inode_flag_set(fi, FI_INLINE_XATTR)) + if (is_inode_flag_set(inode, FI_INLINE_XATTR)) ri->i_inline |= F2FS_INLINE_XATTR; - if (is_inode_flag_set(fi, FI_INLINE_DATA)) + if (is_inode_flag_set(inode, FI_INLINE_DATA)) ri->i_inline |= F2FS_INLINE_DATA; - if (is_inode_flag_set(fi, FI_INLINE_DENTRY)) + if (is_inode_flag_set(inode, FI_INLINE_DENTRY)) ri->i_inline |= F2FS_INLINE_DENTRY; - if (is_inode_flag_set(fi, FI_DATA_EXIST)) + if (is_inode_flag_set(inode, FI_DATA_EXIST)) ri->i_inline |= F2FS_DATA_EXIST; - if (is_inode_flag_set(fi, FI_INLINE_DOTS)) + if (is_inode_flag_set(inode, FI_INLINE_DOTS)) ri->i_inline |= F2FS_INLINE_DOTS; } static inline int f2fs_has_inline_xattr(struct inode *inode) { - return is_inode_flag_set(F2FS_I(inode), FI_INLINE_XATTR); + return is_inode_flag_set(inode, FI_INLINE_XATTR); } -static inline unsigned int addrs_per_inode(struct f2fs_inode_info *fi) +static inline unsigned int addrs_per_inode(struct inode *inode) { - if (f2fs_has_inline_xattr(&fi->vfs_inode)) + if (f2fs_has_inline_xattr(inode)) return DEF_ADDRS_PER_INODE - F2FS_INLINE_XATTR_ADDRS; return DEF_ADDRS_PER_INODE; } @@ -1335,43 +1761,43 @@ static inline int inline_xattr_size(struct inode *inode) static inline int f2fs_has_inline_data(struct inode *inode) { - return is_inode_flag_set(F2FS_I(inode), FI_INLINE_DATA); + return is_inode_flag_set(inode, FI_INLINE_DATA); } static inline void f2fs_clear_inline_inode(struct inode *inode) { - clear_inode_flag(F2FS_I(inode), FI_INLINE_DATA); - clear_inode_flag(F2FS_I(inode), FI_DATA_EXIST); + clear_inode_flag(inode, FI_INLINE_DATA); + clear_inode_flag(inode, FI_DATA_EXIST); } static inline int f2fs_exist_data(struct inode *inode) { - return is_inode_flag_set(F2FS_I(inode), FI_DATA_EXIST); + return is_inode_flag_set(inode, FI_DATA_EXIST); } static inline int f2fs_has_inline_dots(struct inode *inode) { - return is_inode_flag_set(F2FS_I(inode), FI_INLINE_DOTS); + return is_inode_flag_set(inode, FI_INLINE_DOTS); } static inline bool f2fs_is_atomic_file(struct inode *inode) { - return is_inode_flag_set(F2FS_I(inode), FI_ATOMIC_FILE); + return is_inode_flag_set(inode, FI_ATOMIC_FILE); } static inline bool f2fs_is_volatile_file(struct inode *inode) { - return is_inode_flag_set(F2FS_I(inode), FI_VOLATILE_FILE); + return is_inode_flag_set(inode, FI_VOLATILE_FILE); } static inline bool f2fs_is_first_block_written(struct inode *inode) { - return is_inode_flag_set(F2FS_I(inode), FI_FIRST_BLOCK_WRITTEN); + return is_inode_flag_set(inode, FI_FIRST_BLOCK_WRITTEN); } static inline bool f2fs_is_drop_cache(struct inode *inode) { - return is_inode_flag_set(F2FS_I(inode), FI_DROP_CACHE); + return is_inode_flag_set(inode, FI_DROP_CACHE); } static inline void *inline_data_addr(struct page *page) @@ -1382,7 +1808,7 @@ static inline void *inline_data_addr(struct page *page) static inline int f2fs_has_inline_dentry(struct inode *inode) { - return is_inode_flag_set(F2FS_I(inode), FI_INLINE_DENTRY); + return is_inode_flag_set(inode, FI_INLINE_DENTRY); } static inline void f2fs_dentry_kunmap(struct inode *dir, struct page *page) @@ -1391,6 +1817,23 @@ static inline void f2fs_dentry_kunmap(struct inode *dir, struct page *page) kunmap(page); } +static inline int is_file(struct inode *inode, int type) +{ + return F2FS_I(inode)->i_advise & type; +} + +static inline void set_file(struct inode *inode, int type) +{ + F2FS_I(inode)->i_advise |= type; + f2fs_mark_inode_dirty_sync(inode); +} + +static inline void clear_file(struct inode *inode, int type) +{ + F2FS_I(inode)->i_advise &= ~type; + f2fs_mark_inode_dirty_sync(inode); +} + static inline int f2fs_readonly(struct super_block *sb) { return sb->s_flags & MS_RDONLY; @@ -1401,26 +1844,79 @@ static inline bool f2fs_cp_error(struct f2fs_sb_info *sbi) return is_set_ckpt_flags(sbi->ckpt, CP_ERROR_FLAG); } -static inline void f2fs_stop_checkpoint(struct f2fs_sb_info *sbi) +static inline struct inode *file_inode(struct file *f) { - set_ckpt_flags(sbi->ckpt, CP_ERROR_FLAG); - sbi->sb->s_flags |= MS_RDONLY; + return f->f_path.dentry->d_inode; } -static inline struct inode *file_inode(struct file *f) +static inline bool is_dot_dotdot(const struct qstr *str) { - return f->f_path.dentry->d_inode; + if (str->len == 1 && str->name[0] == '.') + return true; + + if (str->len == 2 && str->name[0] == '.' && str->name[1] == '.') + return true; + + return false; +} + +static inline bool f2fs_may_extent_tree(struct inode *inode) +{ + mode_t mode = inode->i_mode; + + if (!test_opt(F2FS_I_SB(inode), EXTENT_CACHE) || + is_inode_flag_set(inode, FI_NO_EXTENT)) + return false; + + return S_ISREG(mode); +} + +static inline void *f2fs_kmalloc(size_t size, gfp_t flags) +{ +#ifdef CONFIG_F2FS_FAULT_INJECTION + if (time_to_inject(FAULT_KMALLOC)) + return NULL; +#endif + return kmalloc(size, flags); +} + +static inline void *f2fs_kvmalloc(size_t size, gfp_t flags) +{ + void *ret; + + ret = kmalloc(size, flags | __GFP_NOWARN); + if (!ret) + ret = __vmalloc(size, flags, PAGE_KERNEL); + return ret; +} + +static inline void *f2fs_kvzalloc(size_t size, gfp_t flags) +{ + void *ret; + + ret = kzalloc(size, flags | __GFP_NOWARN); + if (!ret) + ret = __vmalloc(size, flags | __GFP_ZERO, PAGE_KERNEL); + return ret; +} + +static inline void f2fs_kvfree(void *ptr) +{ + if (is_vmalloc_addr(ptr)) + vfree(ptr); + else + kfree(ptr); } #define get_inode_mode(i) \ - ((is_inode_flag_set(F2FS_I(i), FI_ACL_MODE)) ? \ + ((is_inode_flag_set(i, FI_ACL_MODE)) ? \ (F2FS_I(i)->i_acl_mode) : ((i)->i_mode)) /* get offset of first page in next direct node */ -#define PGOFS_OF_NEXT_DNODE(pgofs, fi) \ - ((pgofs < ADDRS_PER_INODE(fi)) ? ADDRS_PER_INODE(fi) : \ - (pgofs - ADDRS_PER_INODE(fi) + ADDRS_PER_BLOCK) / \ - ADDRS_PER_BLOCK * ADDRS_PER_BLOCK + ADDRS_PER_INODE(fi)) +#define PGOFS_OF_NEXT_DNODE(pgofs, inode) \ + ((pgofs < ADDRS_PER_INODE(inode)) ? ADDRS_PER_INODE(inode) : \ + (pgofs - ADDRS_PER_INODE(inode) + ADDRS_PER_BLOCK) / \ + ADDRS_PER_BLOCK * ADDRS_PER_BLOCK + ADDRS_PER_INODE(inode)) /* * file.c @@ -1428,7 +1924,7 @@ static inline struct inode *file_inode(struct file *f) int f2fs_sync_file(struct file *, loff_t, loff_t, int); void truncate_data_blocks(struct dnode_of_data *); int truncate_blocks(struct inode *, u64, bool); -void f2fs_truncate(struct inode *); +int f2fs_truncate(struct inode *); int f2fs_getattr(struct vfsmount *, struct dentry *, struct kstat *); int f2fs_setattr(struct dentry *, struct iattr *); int truncate_hole(struct inode *, pgoff_t, pgoff_t); @@ -1442,8 +1938,8 @@ long f2fs_compat_ioctl(struct file *, unsigned int, unsigned long); void f2fs_set_inode_flags(struct inode *); struct inode *f2fs_iget(struct super_block *, unsigned long); int try_to_free_nats(struct f2fs_sb_info *, int); -void update_inode(struct inode *, struct page *); -void update_inode_page(struct inode *); +int update_inode(struct inode *, struct page *); +int update_inode_page(struct inode *); int f2fs_write_inode(struct inode *, struct writeback_control *); void f2fs_evict_inode(struct inode *); void handle_failed_inode(struct inode *); @@ -1458,46 +1954,53 @@ struct dentry *f2fs_get_parent(struct dentry *child); */ extern unsigned char f2fs_filetype_table[F2FS_FT_MAX]; void set_de_type(struct f2fs_dir_entry *, umode_t); -struct f2fs_dir_entry *find_target_dentry(struct qstr *, int *, - struct f2fs_dentry_ptr *); +unsigned char get_de_type(struct f2fs_dir_entry *); +struct f2fs_dir_entry *find_target_dentry(struct fscrypt_name *, + f2fs_hash_t, int *, struct f2fs_dentry_ptr *); bool f2fs_fill_dentries(struct file *, void *, filldir_t, - struct f2fs_dentry_ptr *, unsigned int, unsigned int); + struct f2fs_dentry_ptr *, unsigned int, + unsigned int, struct fscrypt_str *); void do_make_empty_dir(struct inode *, struct inode *, struct f2fs_dentry_ptr *); struct page *init_inode_metadata(struct inode *, struct inode *, const struct qstr *, struct page *); void update_parent_metadata(struct inode *, struct inode *, unsigned int); int room_for_filename(const void *, int, int); -void f2fs_drop_nlink(struct inode *, struct inode *, struct page *); +void f2fs_drop_nlink(struct inode *, struct inode *); struct f2fs_dir_entry *f2fs_find_entry(struct inode *, struct qstr *, struct page **); struct f2fs_dir_entry *f2fs_parent_dir(struct inode *, struct page **); -ino_t f2fs_inode_by_name(struct inode *, struct qstr *); +ino_t f2fs_inode_by_name(struct inode *, struct qstr *, struct page **); void f2fs_set_link(struct inode *, struct f2fs_dir_entry *, struct page *, struct inode *); -int update_dent_inode(struct inode *, const struct qstr *); +int update_dent_inode(struct inode *, struct inode *, const struct qstr *); void f2fs_update_dentry(nid_t ino, umode_t mode, struct f2fs_dentry_ptr *, const struct qstr *, f2fs_hash_t , unsigned int); +int f2fs_add_regular_entry(struct inode *, const struct qstr *, + struct inode *, nid_t, umode_t); int __f2fs_add_link(struct inode *, const struct qstr *, struct inode *, nid_t, umode_t); void f2fs_delete_entry(struct f2fs_dir_entry *, struct page *, struct inode *, struct inode *); int f2fs_do_tmpfile(struct inode *, struct inode *); -int f2fs_make_empty(struct inode *, struct inode *); bool f2fs_empty_dir(struct inode *); static inline int f2fs_add_link(struct dentry *dentry, struct inode *inode) { - return __f2fs_add_link(dentry->d_parent->d_inode, &dentry->d_name, + return __f2fs_add_link(d_inode(dentry->d_parent), &dentry->d_name, inode, inode->i_ino, inode->i_mode); } /* * super.c */ +int f2fs_inode_dirtied(struct inode *); +void f2fs_inode_synced(struct inode *); +int f2fs_commit_super(struct f2fs_sb_info *, bool); int f2fs_sync_fs(struct super_block *, int); extern __printf(3, 4) void f2fs_msg(struct super_block *, const char *, const char *, ...); +int sanity_check_ckpt(struct f2fs_sb_info *sbi); /* * hash.c @@ -1511,25 +2014,30 @@ struct dnode_of_data; struct node_info; bool available_free_memory(struct f2fs_sb_info *, int); +int need_dentry_mark(struct f2fs_sb_info *, nid_t); bool is_checkpointed_node(struct f2fs_sb_info *, nid_t); -bool has_fsynced_inode(struct f2fs_sb_info *, nid_t); bool need_inode_block_update(struct f2fs_sb_info *, nid_t); void get_node_info(struct f2fs_sb_info *, nid_t, struct node_info *); +pgoff_t get_next_page_offset(struct dnode_of_data *, pgoff_t); int get_dnode_of_data(struct dnode_of_data *, pgoff_t, int); int truncate_inode_blocks(struct inode *, pgoff_t); int truncate_xattr_node(struct inode *, struct page *); int wait_on_node_pages_writeback(struct f2fs_sb_info *, nid_t); -void remove_inode_page(struct inode *); +int remove_inode_page(struct inode *); struct page *new_inode_page(struct inode *); struct page *new_node_page(struct dnode_of_data *, unsigned int, struct page *); void ra_node_page(struct f2fs_sb_info *, nid_t); struct page *get_node_page(struct f2fs_sb_info *, pgoff_t); struct page *get_node_page_ra(struct page *, int); -void sync_inode_page(struct dnode_of_data *); -int sync_node_pages(struct f2fs_sb_info *, nid_t, struct writeback_control *); +void move_node_page(struct page *, int); +int fsync_node_pages(struct f2fs_sb_info *, struct inode *, + struct writeback_control *, bool); +int sync_node_pages(struct f2fs_sb_info *, struct writeback_control *); +void build_free_nids(struct f2fs_sb_info *); bool alloc_nid(struct f2fs_sb_info *, nid_t *); void alloc_nid_done(struct f2fs_sb_info *, nid_t); void alloc_nid_failed(struct f2fs_sb_info *, nid_t); +int try_to_free_nids(struct f2fs_sb_info *, int); void recover_inline_xattr(struct inode *, struct page *); void recover_xattr_data(struct inode *, struct page *, block_t); int recover_inode_page(struct f2fs_sb_info *, struct page *); @@ -1545,36 +2053,39 @@ void destroy_node_manager_caches(void); * segment.c */ void register_inmem_page(struct inode *, struct page *); -void commit_inmem_pages(struct inode *, bool); -void f2fs_balance_fs(struct f2fs_sb_info *); +void drop_inmem_pages(struct inode *); +int commit_inmem_pages(struct inode *); +void f2fs_balance_fs(struct f2fs_sb_info *, bool); void f2fs_balance_fs_bg(struct f2fs_sb_info *); int f2fs_issue_flush(struct f2fs_sb_info *); int create_flush_cmd_control(struct f2fs_sb_info *); void destroy_flush_cmd_control(struct f2fs_sb_info *); void invalidate_blocks(struct f2fs_sb_info *, block_t); +bool is_checkpointed_data(struct f2fs_sb_info *, block_t); void refresh_sit_entry(struct f2fs_sb_info *, block_t, block_t); -void clear_prefree_segments(struct f2fs_sb_info *); +void clear_prefree_segments(struct f2fs_sb_info *, struct cp_control *); void release_discard_addrs(struct f2fs_sb_info *); -void discard_next_dnode(struct f2fs_sb_info *, block_t); +bool discard_next_dnode(struct f2fs_sb_info *, block_t); int npages_for_summary_flush(struct f2fs_sb_info *, bool); void allocate_new_segments(struct f2fs_sb_info *); int f2fs_trim_fs(struct f2fs_sb_info *, struct fstrim_range *); struct page *get_sum_page(struct f2fs_sb_info *, unsigned int); +void update_meta_page(struct f2fs_sb_info *, void *, block_t); void write_meta_page(struct f2fs_sb_info *, struct page *); -void write_node_page(struct f2fs_sb_info *, struct page *, - unsigned int, struct f2fs_io_info *); -void write_data_page(struct page *, struct dnode_of_data *, - struct f2fs_io_info *); -void rewrite_data_page(struct page *, struct f2fs_io_info *); -void recover_data_page(struct f2fs_sb_info *, struct page *, - struct f2fs_summary *, block_t, block_t); +void write_node_page(unsigned int, struct f2fs_io_info *); +void write_data_page(struct dnode_of_data *, struct f2fs_io_info *); +void rewrite_data_page(struct f2fs_io_info *); +void __f2fs_replace_block(struct f2fs_sb_info *, struct f2fs_summary *, + block_t, block_t, bool, bool); +void f2fs_replace_block(struct f2fs_sb_info *, struct dnode_of_data *, + block_t, block_t, unsigned char, bool, bool); void allocate_data_block(struct f2fs_sb_info *, struct page *, block_t, block_t *, struct f2fs_summary *, int); -void f2fs_wait_on_page_writeback(struct page *, enum page_type); +void f2fs_wait_on_page_writeback(struct page *, enum page_type, bool); +void f2fs_wait_on_encrypted_page_writeback(struct f2fs_sb_info *, block_t); void write_data_summaries(struct f2fs_sb_info *, block_t); void write_node_summaries(struct f2fs_sb_info *, block_t); -int lookup_journal_in_cursum(struct f2fs_summary_block *, - int, unsigned int, int); +int lookup_journal_in_cursum(struct f2fs_journal *, int, unsigned int, int); void flush_sit_entries(struct f2fs_sb_info *, struct cp_control *); int build_segment_manager(struct f2fs_sb_info *); void destroy_segment_manager(struct f2fs_sb_info *); @@ -1584,26 +2095,29 @@ void destroy_segment_manager_caches(void); /* * checkpoint.c */ +void f2fs_stop_checkpoint(struct f2fs_sb_info *, bool); struct page *grab_meta_page(struct f2fs_sb_info *, pgoff_t); struct page *get_meta_page(struct f2fs_sb_info *, pgoff_t); -int ra_meta_pages(struct f2fs_sb_info *, block_t, int, int); +struct page *get_tmp_page(struct f2fs_sb_info *, pgoff_t); +bool is_valid_blkaddr(struct f2fs_sb_info *, block_t, int); +int ra_meta_pages(struct f2fs_sb_info *, block_t, int, int, bool); void ra_meta_pages_cond(struct f2fs_sb_info *, pgoff_t); long sync_meta_pages(struct f2fs_sb_info *, enum page_type, long); -void add_dirty_inode(struct f2fs_sb_info *, nid_t, int type); -void remove_dirty_inode(struct f2fs_sb_info *, nid_t, int type); -void release_dirty_inode(struct f2fs_sb_info *); +void add_ino_entry(struct f2fs_sb_info *, nid_t, int type); +void remove_ino_entry(struct f2fs_sb_info *, nid_t, int type); +void release_ino_entry(struct f2fs_sb_info *, bool); bool exist_written_data(struct f2fs_sb_info *, nid_t, int); +int f2fs_sync_inode_meta(struct f2fs_sb_info *); int acquire_orphan_inode(struct f2fs_sb_info *); void release_orphan_inode(struct f2fs_sb_info *); -void add_orphan_inode(struct f2fs_sb_info *, nid_t); +void add_orphan_inode(struct inode *); void remove_orphan_inode(struct f2fs_sb_info *, nid_t); -void recover_orphan_inodes(struct f2fs_sb_info *); +int recover_orphan_inodes(struct f2fs_sb_info *); int get_valid_checkpoint(struct f2fs_sb_info *); void update_dirty_page(struct inode *, struct page *); -void add_dirty_dir_inode(struct inode *); -void remove_dirty_dir_inode(struct inode *); -void sync_dirty_dir_inodes(struct f2fs_sb_info *); -void write_checkpoint(struct f2fs_sb_info *, struct cp_control *); +void remove_dirty_inode(struct inode *); +int sync_dirty_inodes(struct f2fs_sb_info *, enum inode_type); +int write_checkpoint(struct f2fs_sb_info *, struct cp_control *); void init_ino_entry_info(struct f2fs_sb_info *); int __init create_checkpoint_caches(void); void destroy_checkpoint_caches(void); @@ -1612,26 +2126,26 @@ void destroy_checkpoint_caches(void); * data.c */ void f2fs_submit_merged_bio(struct f2fs_sb_info *, enum page_type, int); -int f2fs_submit_page_bio(struct f2fs_sb_info *, struct page *, - struct f2fs_io_info *); -void f2fs_submit_page_mbio(struct f2fs_sb_info *, struct page *, - struct f2fs_io_info *); +void f2fs_submit_merged_bio_cond(struct f2fs_sb_info *, struct inode *, + struct page *, nid_t, enum page_type, int); +void f2fs_flush_merged_bios(struct f2fs_sb_info *); +int f2fs_submit_page_bio(struct f2fs_io_info *); +void f2fs_submit_page_mbio(struct f2fs_io_info *); void set_data_blkaddr(struct dnode_of_data *); +void f2fs_update_data_blkaddr(struct dnode_of_data *, block_t); +int reserve_new_blocks(struct dnode_of_data *, blkcnt_t); int reserve_new_block(struct dnode_of_data *); +int f2fs_get_block(struct dnode_of_data *, pgoff_t); +ssize_t f2fs_preallocate_blocks(struct inode *, loff_t, size_t, bool); int f2fs_reserve_block(struct dnode_of_data *, pgoff_t); -void f2fs_shrink_extent_tree(struct f2fs_sb_info *, int); -void f2fs_destroy_extent_tree(struct inode *); -void f2fs_init_extent_cache(struct inode *, struct f2fs_extent *); -void f2fs_update_extent_cache(struct dnode_of_data *); -void f2fs_preserve_extent_tree(struct inode *); -struct page *find_data_page(struct inode *, pgoff_t, bool); -struct page *get_lock_data_page(struct inode *, pgoff_t); +struct page *get_read_data_page(struct inode *, pgoff_t, int, bool); +struct page *find_data_page(struct inode *, pgoff_t); +struct page *get_lock_data_page(struct inode *, pgoff_t, bool); struct page *get_new_data_page(struct inode *, struct page *, pgoff_t, bool); -int do_write_data_page(struct page *, struct f2fs_io_info *); +int do_write_data_page(struct f2fs_io_info *); +int f2fs_map_blocks(struct inode *, struct f2fs_map_blocks *, int, int); int f2fs_fiemap(struct inode *inode, struct fiemap_extent_info *, u64, u64); -void init_extent_cache_info(struct f2fs_sb_info *); -int __init create_extent_cache(void); -void destroy_extent_cache(void); +void f2fs_set_page_dirty_nobuffers(struct page *); void f2fs_invalidate_page(struct page *, unsigned long); int f2fs_release_page(struct page *, gfp_t); @@ -1640,14 +2154,14 @@ int f2fs_release_page(struct page *, gfp_t); */ int start_gc_thread(struct f2fs_sb_info *); void stop_gc_thread(struct f2fs_sb_info *); -block_t start_bidx_of_node(unsigned int, struct f2fs_inode_info *); -int f2fs_gc(struct f2fs_sb_info *); +block_t start_bidx_of_node(unsigned int, struct inode *); +int f2fs_gc(struct f2fs_sb_info *, bool); void build_gc_manager(struct f2fs_sb_info *); /* * recovery.c */ -int recover_fsync_data(struct f2fs_sb_info *); +int recover_fsync_data(struct f2fs_sb_info *, bool); bool space_for_roll_forward(struct f2fs_sb_info *); /* @@ -1659,17 +2173,21 @@ struct f2fs_stat_info { struct f2fs_sb_info *sbi; int all_area_segs, sit_area_segs, nat_area_segs, ssa_area_segs; int main_area_segs, main_area_sections, main_area_zones; - int hit_ext, total_ext, ext_tree, ext_node; - int ndirty_node, ndirty_dent, ndirty_dirs, ndirty_meta; + unsigned long long hit_largest, hit_cached, hit_rbtree; + unsigned long long hit_total, total_ext; + int ext_tree, zombie_tree, ext_node; + s64 ndirty_node, ndirty_dent, ndirty_meta, ndirty_data, inmem_pages; + unsigned int ndirty_dirs, ndirty_files, ndirty_all; int nats, dirty_nats, sits, dirty_sits, fnids; int total_count, utilization; - int bg_gc, inline_inode, inline_dir, inmem_pages, wb_pages; + int bg_gc, wb_bios; + int inline_xattr, inline_inode, inline_dir, orphans; unsigned int valid_count, valid_node_count, valid_inode_count; unsigned int bimodal, avg_vblocks; int util_free, util_valid, util_invalid; int rsvd_segs, overp_segs; int dirty_count, node_pages, meta_pages; - int prefree_count, call_count, cp_count; + int prefree_count, call_count, cp_count, bg_cp_count; int tot_segs, node_segs, data_segs, free_segs, free_secs; int bg_node_segs, bg_data_segs; int tot_blks, data_blks, node_blks; @@ -1681,7 +2199,7 @@ struct f2fs_stat_info { unsigned int segment_count[2]; unsigned int block_count[2]; unsigned int inplace_count; - unsigned base_mem, cache_mem, page_mem; + unsigned long long base_mem, cache_mem, page_mem; }; static inline struct f2fs_stat_info *F2FS_STAT(struct f2fs_sb_info *sbi) @@ -1690,12 +2208,25 @@ static inline struct f2fs_stat_info *F2FS_STAT(struct f2fs_sb_info *sbi) } #define stat_inc_cp_count(si) ((si)->cp_count++) +#define stat_inc_bg_cp_count(si) ((si)->bg_cp_count++) #define stat_inc_call_count(si) ((si)->call_count++) #define stat_inc_bggc_count(sbi) ((sbi)->bg_gc++) -#define stat_inc_dirty_dir(sbi) ((sbi)->n_dirty_dirs++) -#define stat_dec_dirty_dir(sbi) ((sbi)->n_dirty_dirs--) -#define stat_inc_total_hit(sb) ((F2FS_SB(sb))->total_hit_ext++) -#define stat_inc_read_hit(sb) ((F2FS_SB(sb))->read_hit_ext++) +#define stat_inc_dirty_inode(sbi, type) ((sbi)->ndirty_inode[type]++) +#define stat_dec_dirty_inode(sbi, type) ((sbi)->ndirty_inode[type]--) +#define stat_inc_total_hit(sbi) (atomic64_inc(&(sbi)->total_hit_ext)) +#define stat_inc_rbtree_node_hit(sbi) (atomic64_inc(&(sbi)->read_hit_rbtree)) +#define stat_inc_largest_node_hit(sbi) (atomic64_inc(&(sbi)->read_hit_largest)) +#define stat_inc_cached_node_hit(sbi) (atomic64_inc(&(sbi)->read_hit_cached)) +#define stat_inc_inline_xattr(inode) \ + do { \ + if (f2fs_has_inline_xattr(inode)) \ + (atomic_inc(&F2FS_I_SB(inode)->inline_xattr)); \ + } while (0) +#define stat_dec_inline_xattr(inode) \ + do { \ + if (f2fs_has_inline_xattr(inode)) \ + (atomic_dec(&F2FS_I_SB(inode)->inline_xattr)); \ + } while (0) #define stat_inc_inline_inode(inode) \ do { \ if (f2fs_has_inline_data(inode)) \ @@ -1756,16 +2287,21 @@ static inline struct f2fs_stat_info *F2FS_STAT(struct f2fs_sb_info *sbi) int f2fs_build_stats(struct f2fs_sb_info *); void f2fs_destroy_stats(struct f2fs_sb_info *); -void __init f2fs_create_root_stats(void); +int __init f2fs_create_root_stats(void); void f2fs_destroy_root_stats(void); #else #define stat_inc_cp_count(si) +#define stat_inc_bg_cp_count(si) #define stat_inc_call_count(si) #define stat_inc_bggc_count(si) -#define stat_inc_dirty_dir(sbi) -#define stat_dec_dirty_dir(sbi) +#define stat_inc_dirty_inode(sbi, type) +#define stat_dec_dirty_inode(sbi, type) #define stat_inc_total_hit(sb) -#define stat_inc_read_hit(sb) +#define stat_inc_rbtree_node_hit(sb) +#define stat_inc_largest_node_hit(sbi) +#define stat_inc_cached_node_hit(sbi) +#define stat_inc_inline_xattr(inode) +#define stat_dec_inline_xattr(inode) #define stat_inc_inline_inode(inode) #define stat_dec_inline_inode(inode) #define stat_inc_inline_dir(inode) @@ -1780,7 +2316,7 @@ void f2fs_destroy_root_stats(void); static inline int f2fs_build_stats(struct f2fs_sb_info *sbi) { return 0; } static inline void f2fs_destroy_stats(struct f2fs_sb_info *sbi) { } -static inline void __init f2fs_create_root_stats(void) { } +static inline int __init f2fs_create_root_stats(void) { return 0; } static inline void f2fs_destroy_root_stats(void) { } #endif @@ -1792,13 +2328,15 @@ extern const struct address_space_operations f2fs_node_aops; extern const struct address_space_operations f2fs_meta_aops; extern const struct inode_operations f2fs_dir_inode_operations; extern const struct inode_operations f2fs_symlink_inode_operations; +extern const struct inode_operations f2fs_encrypted_symlink_inode_operations; extern const struct inode_operations f2fs_special_inode_operations; extern struct kmem_cache *inode_entry_slab; /* * inline.c */ -bool f2fs_may_inline(struct inode *); +bool f2fs_may_inline_data(struct inode *); +bool f2fs_may_inline_dentry(struct inode *); void read_inline_data(struct page *, struct page *); bool truncate_inline_inode(struct page *, u64); int f2fs_read_inline_data(struct inode *, struct page *); @@ -1806,14 +2344,121 @@ int f2fs_convert_inline_page(struct dnode_of_data *, struct page *); int f2fs_convert_inline_inode(struct inode *); int f2fs_write_inline_data(struct inode *, struct page *); bool recover_inline_data(struct inode *, struct page *); -struct f2fs_dir_entry *find_in_inline_dir(struct inode *, struct qstr *, - struct page **); -struct f2fs_dir_entry *f2fs_parent_inline_dir(struct inode *, struct page **); +struct f2fs_dir_entry *find_in_inline_dir(struct inode *, + struct fscrypt_name *, struct page **); int make_empty_inline_dir(struct inode *inode, struct inode *, struct page *); int f2fs_add_inline_entry(struct inode *, const struct qstr *, struct inode *, nid_t, umode_t); void f2fs_delete_inline_entry(struct f2fs_dir_entry *, struct page *, struct inode *, struct inode *); bool f2fs_empty_inline_dir(struct inode *); -int f2fs_read_inline_dir(struct file *, void *, filldir_t); +int f2fs_read_inline_dir(struct file *, void *, filldir_t, + struct fscrypt_str *); +int f2fs_inline_data_fiemap(struct inode *, + struct fiemap_extent_info *, __u64, __u64); + +/* + * shrinker.c + */ +int f2fs_shrink_count(struct shrinker *, struct shrink_control *); +int f2fs_shrink_scan(struct shrinker *, struct shrink_control *); +void f2fs_join_shrinker(struct f2fs_sb_info *); +void f2fs_leave_shrinker(struct f2fs_sb_info *); + +/* + * extent_cache.c + */ +unsigned int f2fs_shrink_extent_tree(struct f2fs_sb_info *, int); +bool f2fs_init_extent_tree(struct inode *, struct f2fs_extent *); +void f2fs_drop_extent_tree(struct inode *); +unsigned int f2fs_destroy_extent_node(struct inode *); +void f2fs_destroy_extent_tree(struct inode *); +bool f2fs_lookup_extent_cache(struct inode *, pgoff_t, struct extent_info *); +void f2fs_update_extent_cache(struct dnode_of_data *); +void f2fs_update_extent_cache_range(struct dnode_of_data *dn, + pgoff_t, block_t, unsigned int); +void init_extent_cache_info(struct f2fs_sb_info *); +int __init create_extent_cache(void); +void destroy_extent_cache(void); + +/* + * crypto support + */ +static inline bool f2fs_encrypted_inode(struct inode *inode) +{ + return file_is_encrypt(inode); +} + +static inline void f2fs_set_encrypted_inode(struct inode *inode) +{ +#ifdef CONFIG_F2FS_FS_ENCRYPTION + file_set_encrypt(inode); +#endif +} + +static inline bool f2fs_bio_encrypted(struct bio *bio) +{ + return bio->bi_private != NULL; +} + +static inline int f2fs_sb_has_crypto(struct super_block *sb) +{ + return F2FS_HAS_FEATURE(sb, F2FS_FEATURE_ENCRYPT); +} + +static inline int f2fs_sb_mounted_hmsmr(struct super_block *sb) +{ + return F2FS_HAS_FEATURE(sb, F2FS_FEATURE_HMSMR); +} + +static inline void set_opt_mode(struct f2fs_sb_info *sbi, unsigned int mt) +{ + clear_opt(sbi, ADAPTIVE); + clear_opt(sbi, LFS); + + switch (mt) { + case F2FS_MOUNT_ADAPTIVE: + set_opt(sbi, ADAPTIVE); + break; + case F2FS_MOUNT_LFS: + set_opt(sbi, LFS); + break; + } +} + +static inline bool f2fs_may_encrypt(struct inode *inode) +{ +#ifdef CONFIG_F2FS_FS_ENCRYPTION + mode_t mode = inode->i_mode; + + return (S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode)); +#else + return 0; +#endif +} + +#ifndef CONFIG_F2FS_FS_ENCRYPTION +#define fscrypt_set_d_op(i) +#define fscrypt_get_ctx fscrypt_notsupp_get_ctx +#define fscrypt_release_ctx fscrypt_notsupp_release_ctx +#define fscrypt_encrypt_page fscrypt_notsupp_encrypt_page +#define fscrypt_decrypt_page fscrypt_notsupp_decrypt_page +#define fscrypt_decrypt_bio_pages fscrypt_notsupp_decrypt_bio_pages +#define fscrypt_pullback_bio_page fscrypt_notsupp_pullback_bio_page +#define fscrypt_restore_control_page fscrypt_notsupp_restore_control_page +#define fscrypt_zeroout_range fscrypt_notsupp_zeroout_range +#define fscrypt_process_policy fscrypt_notsupp_process_policy +#define fscrypt_get_policy fscrypt_notsupp_get_policy +#define fscrypt_has_permitted_context fscrypt_notsupp_has_permitted_context +#define fscrypt_inherit_context fscrypt_notsupp_inherit_context +#define fscrypt_get_encryption_info fscrypt_notsupp_get_encryption_info +#define fscrypt_put_encryption_info fscrypt_notsupp_put_encryption_info +#define fscrypt_setup_filename fscrypt_notsupp_setup_filename +#define fscrypt_free_filename fscrypt_notsupp_free_filename +#define fscrypt_fname_encrypted_size fscrypt_notsupp_fname_encrypted_size +#define fscrypt_fname_alloc_buffer fscrypt_notsupp_fname_alloc_buffer +#define fscrypt_fname_free_buffer fscrypt_notsupp_fname_free_buffer +#define fscrypt_fname_disk_to_usr fscrypt_notsupp_fname_disk_to_usr +#define fscrypt_fname_usr_to_disk fscrypt_notsupp_fname_usr_to_disk +#endif #endif diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c old mode 100644 new mode 100755 index 5d8f0ec77fd7..8ec1e70fd0e3 --- a/fs/f2fs/file.c +++ b/fs/f2fs/file.c @@ -20,12 +20,15 @@ #include #include #include +#include +#include #include "f2fs.h" #include "node.h" #include "segment.h" #include "xattr.h" #include "acl.h" +#include "gc.h" #include "trace.h" #include @@ -38,8 +41,6 @@ static int f2fs_vm_page_mkwrite(struct vm_area_struct *vma, struct dnode_of_data dn; int err; - f2fs_balance_fs(sbi); - vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE); f2fs_bug_on(sbi, f2fs_has_inline_data(inode)); @@ -55,6 +56,8 @@ static int f2fs_vm_page_mkwrite(struct vm_area_struct *vma, f2fs_put_dnode(&dn); f2fs_unlock_op(sbi); + f2fs_balance_fs(sbi, dn.node_changed); + file_update_time(vma->vm_file); lock_page(page); if (unlikely(page->mapping != inode->i_mapping || @@ -72,19 +75,29 @@ static int f2fs_vm_page_mkwrite(struct vm_area_struct *vma, goto mapped; /* page is wholly or partially inside EOF */ - if (((page->index + 1) << PAGE_CACHE_SHIFT) > i_size_read(inode)) { + if (((loff_t)(page->index + 1) << PAGE_SHIFT) > + i_size_read(inode)) { unsigned offset; - offset = i_size_read(inode) & ~PAGE_CACHE_MASK; - zero_user_segment(page, offset, PAGE_CACHE_SIZE); + offset = i_size_read(inode) & ~PAGE_MASK; + zero_user_segment(page, offset, PAGE_SIZE); } set_page_dirty(page); - SetPageUptodate(page); + if (!PageUptodate(page)) + SetPageUptodate(page); trace_f2fs_vm_page_mkwrite(page, DATA); mapped: /* fill the page */ - f2fs_wait_on_page_writeback(page, DATA); + f2fs_wait_on_page_writeback(page, DATA, false); + + /* wait for GCed encrypted page writeback */ + if (f2fs_encrypted_inode(inode) && S_ISREG(inode->i_mode)) + f2fs_wait_on_encrypted_page_writeback(sbi, dn.data_blkaddr); + + /* if gced page is attached, don't write to cold segment */ + clear_cold_data(page); out: + f2fs_update_time(sbi, REQ_TIME); return block_page_mkwrite_return(err); } @@ -103,7 +116,7 @@ static int get_parent_ino(struct inode *inode, nid_t *pino) if (!dentry) return 0; - if (update_dent_inode(inode, &dentry->d_name)) { + if (update_dent_inode(inode, inode, &dentry->d_name)) { dput(dentry); return 0; } @@ -120,6 +133,8 @@ static inline bool need_do_checkpoint(struct inode *inode) if (!S_ISREG(inode->i_mode) || inode->i_nlink != 1) need_cp = true; + else if (file_enc_name(inode) && need_dentry_mark(sbi, inode->i_ino)) + need_cp = true; else if (file_wrong_pino(inode)) need_cp = true; else if (!space_for_roll_forward(sbi)) @@ -156,21 +171,16 @@ static void try_to_fix_pino(struct inode *inode) fi->xattr_ver = 0; if (file_wrong_pino(inode) && inode->i_nlink == 1 && get_parent_ino(inode, &pino)) { - fi->i_pino = pino; + f2fs_i_pino_write(inode, pino); file_got_pino(inode); - up_write(&fi->i_sem); - - mark_inode_dirty_sync(inode); - f2fs_write_inode(inode, NULL); - } else { - up_write(&fi->i_sem); } + up_write(&fi->i_sem); } -int f2fs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) +static int f2fs_do_sync_file(struct file *file, loff_t start, loff_t end, + int datasync, bool atomic) { struct inode *inode = file->f_mapping->host; - struct f2fs_inode_info *fi = F2FS_I(inode); struct f2fs_sb_info *sbi = F2FS_I_SB(inode); nid_t ino = inode->i_ino; int ret = 0; @@ -187,10 +197,10 @@ int f2fs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) trace_f2fs_sync_file_enter(inode); /* if fdatasync is triggered, let's do in-place-update */ - if (get_dirty_pages(inode) <= SM_I(sbi)->min_fsync_blocks) - set_inode_flag(fi, FI_NEED_IPU); + if (datasync || get_dirty_pages(inode) <= SM_I(sbi)->min_fsync_blocks) + set_inode_flag(inode, FI_NEED_IPU); ret = filemap_write_and_wait_range(inode->i_mapping, start, end); - clear_inode_flag(fi, FI_NEED_IPU); + clear_inode_flag(inode, FI_NEED_IPU); if (ret) { trace_f2fs_sync_file_exit(inode, need_cp, datasync, ret); @@ -198,37 +208,34 @@ int f2fs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) } /* if the inode is dirty, let's recover all the time */ - if (!datasync && is_inode_flag_set(fi, FI_DIRTY_INODE)) { - update_inode_page(inode); + if (!datasync && !f2fs_skip_inode_update(inode)) { + f2fs_write_inode(inode, NULL); goto go_write; } /* * if there is no written data, don't waste time to write recovery info. */ - if (!is_inode_flag_set(fi, FI_APPEND_WRITE) && + if (!is_inode_flag_set(inode, FI_APPEND_WRITE) && !exist_written_data(sbi, ino, APPEND_INO)) { /* it may call write_inode just prior to fsync */ if (need_inode_page_update(sbi, ino)) goto go_write; - if (is_inode_flag_set(fi, FI_UPDATE_WRITE) || + if (is_inode_flag_set(inode, FI_UPDATE_WRITE) || exist_written_data(sbi, ino, UPDATE_INO)) goto flush_out; goto out; } go_write: - /* guarantee free sections for fsync */ - f2fs_balance_fs(sbi); - /* * Both of fdatasync() and fsync() are able to be recovered from * sudden-power-off. */ - down_read(&fi->i_sem); + down_read(&F2FS_I(inode)->i_sem); need_cp = need_do_checkpoint(inode); - up_read(&fi->i_sem); + up_read(&F2FS_I(inode)->i_sem); if (need_cp) { /* all the dirty node pages should be flushed for POR */ @@ -239,19 +246,23 @@ int f2fs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) * will be used only for fsynced inodes after checkpoint. */ try_to_fix_pino(inode); - clear_inode_flag(fi, FI_APPEND_WRITE); - clear_inode_flag(fi, FI_UPDATE_WRITE); + clear_inode_flag(inode, FI_APPEND_WRITE); + clear_inode_flag(inode, FI_UPDATE_WRITE); goto out; } sync_nodes: - sync_node_pages(sbi, ino, &wbc); + ret = fsync_node_pages(sbi, inode, &wbc, atomic); + if (ret) + goto out; /* if cp_error was enabled, we should avoid infinite loop */ - if (unlikely(f2fs_cp_error(sbi))) + if (unlikely(f2fs_cp_error(sbi))) { + ret = -EIO; goto out; + } if (need_inode_block_update(sbi, ino)) { - mark_inode_dirty_sync(inode); + f2fs_mark_inode_dirty_sync(inode); f2fs_write_inode(inode, NULL); goto sync_nodes; } @@ -261,18 +272,24 @@ int f2fs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) goto out; /* once recovery info is written, don't need to tack this */ - remove_dirty_inode(sbi, ino, APPEND_INO); - clear_inode_flag(fi, FI_APPEND_WRITE); + remove_ino_entry(sbi, ino, APPEND_INO); + clear_inode_flag(inode, FI_APPEND_WRITE); flush_out: - remove_dirty_inode(sbi, ino, UPDATE_INO); - clear_inode_flag(fi, FI_UPDATE_WRITE); + remove_ino_entry(sbi, ino, UPDATE_INO); + clear_inode_flag(inode, FI_UPDATE_WRITE); ret = f2fs_issue_flush(sbi); + f2fs_update_time(sbi, REQ_TIME); out: trace_f2fs_sync_file_exit(inode, need_cp, datasync, ret); - f2fs_trace_ios(NULL, NULL, 1); + f2fs_trace_ios(NULL, 1); return ret; } +int f2fs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) +{ + return f2fs_do_sync_file(file, start, end, datasync, false); +} + static pgoff_t __get_first_dirty_index(struct address_space *mapping, pgoff_t pgofs, int whence) { @@ -286,7 +303,7 @@ static pgoff_t __get_first_dirty_index(struct address_space *mapping, pagevec_init(&pvec, 0); nr_pages = pagevec_lookup_tag(&pvec, mapping, &pgofs, PAGECACHE_TAG_DIRTY, 1); - pgofs = nr_pages ? pvec.pages[0]->index : LONG_MAX; + pgofs = nr_pages ? pvec.pages[0]->index : ULONG_MAX; pagevec_release(&pvec); return pgofs; } @@ -337,7 +354,7 @@ static loff_t f2fs_seek_block(struct file *file, loff_t offset, int whence) loff_t isize; int err = 0; - mutex_lock(&inode->i_mutex); + inode_lock(inode); isize = i_size_read(inode); if (offset >= isize) @@ -350,32 +367,31 @@ static loff_t f2fs_seek_block(struct file *file, loff_t offset, int whence) goto found; } - pgofs = (pgoff_t)(offset >> PAGE_CACHE_SHIFT); + pgofs = (pgoff_t)(offset >> PAGE_SHIFT); dirty = __get_first_dirty_index(inode->i_mapping, pgofs, whence); - for (; data_ofs < isize; data_ofs = pgofs << PAGE_CACHE_SHIFT) { + for (; data_ofs < isize; data_ofs = (loff_t)pgofs << PAGE_SHIFT) { set_new_dnode(&dn, inode, NULL, NULL, 0); - err = get_dnode_of_data(&dn, pgofs, LOOKUP_NODE_RA); + err = get_dnode_of_data(&dn, pgofs, LOOKUP_NODE); if (err && err != -ENOENT) { goto fail; } else if (err == -ENOENT) { /* direct node does not exists */ if (whence == SEEK_DATA) { - pgofs = PGOFS_OF_NEXT_DNODE(pgofs, - F2FS_I(inode)); + pgofs = get_next_page_offset(&dn, pgofs); continue; } else { goto found; } } - end_offset = ADDRS_PER_PAGE(dn.node_page, F2FS_I(inode)); + end_offset = ADDRS_PER_PAGE(dn.node_page, inode); /* find data/hole in dnode block */ for (; dn.ofs_in_node < end_offset; dn.ofs_in_node++, pgofs++, - data_ofs = (loff_t)pgofs << PAGE_CACHE_SHIFT) { + data_ofs = (loff_t)pgofs << PAGE_SHIFT) { block_t blkaddr; blkaddr = datablock_addr(dn.node_page, dn.ofs_in_node); @@ -392,10 +408,10 @@ static loff_t f2fs_seek_block(struct file *file, loff_t offset, int whence) found: if (whence == SEEK_HOLE && data_ofs > isize) data_ofs = isize; - mutex_unlock(&inode->i_mutex); + inode_unlock(inode); return vfs_setpos(file, data_ofs, maxbytes); fail: - mutex_unlock(&inode->i_mutex); + inode_unlock(inode); return -ENXIO; } @@ -423,24 +439,53 @@ static loff_t f2fs_llseek(struct file *file, loff_t offset, int whence) static int f2fs_file_mmap(struct file *file, struct vm_area_struct *vma) { struct inode *inode = file_inode(file); + int err; - /* we don't need to use inline_data strictly */ - if (f2fs_has_inline_data(inode)) { - int err = f2fs_convert_inline_inode(inode); + if (f2fs_encrypted_inode(inode)) { + err = fscrypt_get_encryption_info(inode); if (err) - return err; + return 0; + if (!f2fs_encrypted_inode(inode)) + return -ENOKEY; } + /* we don't need to use inline_data strictly */ + err = f2fs_convert_inline_inode(inode); + if (err) + return err; + file_accessed(file); vma->vm_ops = &f2fs_file_vm_ops; return 0; } +static int f2fs_file_open(struct inode *inode, struct file *filp) +{ + int ret = generic_file_open(inode, filp); + struct dentry *dir; + + if (!ret && f2fs_encrypted_inode(inode)) { + ret = fscrypt_get_encryption_info(inode); + if (ret) + return -EACCES; + if (!fscrypt_has_encryption_key(inode)) + return -ENOKEY; + } + dir = dget_parent(file_dentry(filp)); + if (f2fs_encrypted_inode(d_inode(dir)) && + !fscrypt_has_permitted_context(d_inode(dir), inode)) { + dput(dir); + return -EPERM; + } + dput(dir); + return ret; +} + int truncate_data_blocks_range(struct dnode_of_data *dn, int count) { - int nr_free = 0, ofs = dn->ofs_in_node; struct f2fs_sb_info *sbi = F2FS_I_SB(dn->inode); struct f2fs_node *raw_node; + int nr_free = 0, ofs = dn->ofs_in_node, len = count; __le32 *addr; raw_node = F2FS_NODE(dn->node_page); @@ -453,20 +498,26 @@ int truncate_data_blocks_range(struct dnode_of_data *dn, int count) dn->data_blkaddr = NULL_ADDR; set_data_blkaddr(dn); - f2fs_update_extent_cache(dn); invalidate_blocks(sbi, blkaddr); if (dn->ofs_in_node == 0 && IS_INODE(dn->node_page)) - clear_inode_flag(F2FS_I(dn->inode), - FI_FIRST_BLOCK_WRITTEN); + clear_inode_flag(dn->inode, FI_FIRST_BLOCK_WRITTEN); nr_free++; } + if (nr_free) { + pgoff_t fofs; + /* + * once we invalidate valid blkaddr in range [ofs, ofs + count], + * we will invalidate all blkaddr in the whole range. + */ + fofs = start_bidx_of_node(ofs_of_node(dn->node_page), + dn->inode) + ofs; + f2fs_update_extent_cache_range(dn, fofs, 0, len); dec_valid_block_count(sbi, dn->inode, nr_free); - set_page_dirty(dn->node_page); - sync_inode_page(dn); } dn->ofs_in_node = ofs; + f2fs_update_time(sbi, REQ_TIME); trace_f2fs_truncate_data_blocks_range(dn->inode, dn->nid, dn->ofs_in_node, nr_free); return nr_free; @@ -478,28 +529,33 @@ void truncate_data_blocks(struct dnode_of_data *dn) } static int truncate_partial_data_page(struct inode *inode, u64 from, - bool force) + bool cache_only) { - unsigned offset = from & (PAGE_CACHE_SIZE - 1); + unsigned offset = from & (PAGE_SIZE - 1); + pgoff_t index = from >> PAGE_SHIFT; + struct address_space *mapping = inode->i_mapping; struct page *page; - if (!offset && !force) + if (!offset && !cache_only) return 0; - page = find_data_page(inode, from >> PAGE_CACHE_SHIFT, force); - if (IS_ERR(page)) + if (cache_only) { + page = f2fs_grab_cache_page(mapping, index, false); + if (page && PageUptodate(page)) + goto truncate_out; + f2fs_put_page(page, 1); return 0; + } - lock_page(page); - if (unlikely(!PageUptodate(page) || - page->mapping != inode->i_mapping)) - goto out; - - f2fs_wait_on_page_writeback(page, DATA); - zero_user(page, offset, PAGE_CACHE_SIZE - offset); - if (!force) + page = get_lock_data_page(inode, index, true); + if (IS_ERR(page)) + return 0; +truncate_out: + f2fs_wait_on_page_writeback(page, DATA, true); + zero_user(page, offset, PAGE_SIZE - offset); + if (!cache_only || !f2fs_encrypted_inode(inode) || + !S_ISREG(inode->i_mode)) set_page_dirty(page); -out: f2fs_put_page(page, 1); return 0; } @@ -518,6 +574,9 @@ int truncate_blocks(struct inode *inode, u64 from, bool lock) free_from = (pgoff_t)F2FS_BYTES_TO_BLK(from + blocksize - 1); + if (free_from >= sbi->max_file_blocks) + goto free_partial; + if (lock) f2fs_lock_op(sbi); @@ -536,14 +595,14 @@ int truncate_blocks(struct inode *inode, u64 from, bool lock) } set_new_dnode(&dn, inode, ipage, NULL, 0); - err = get_dnode_of_data(&dn, free_from, LOOKUP_NODE); + err = get_dnode_of_data(&dn, free_from, LOOKUP_NODE_RA); if (err) { if (err == -ENOENT) goto free_next; goto out; } - count = ADDRS_PER_PAGE(dn.node_page, F2FS_I(inode)); + count = ADDRS_PER_PAGE(dn.node_page, inode); count -= dn.ofs_in_node; f2fs_bug_on(sbi, count < 0); @@ -559,7 +618,7 @@ int truncate_blocks(struct inode *inode, u64 from, bool lock) out: if (lock) f2fs_unlock_op(sbi); - +free_partial: /* lastly zero out the first data page */ if (!err) err = truncate_partial_data_page(inode, from, truncate_page); @@ -568,30 +627,36 @@ int truncate_blocks(struct inode *inode, u64 from, bool lock) return err; } -void f2fs_truncate(struct inode *inode) +int f2fs_truncate(struct inode *inode) { + int err; + if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) || S_ISLNK(inode->i_mode))) - return; + return 0; trace_f2fs_truncate(inode); /* we should check inline_data size */ - if (f2fs_has_inline_data(inode) && !f2fs_may_inline(inode)) { - if (f2fs_convert_inline_inode(inode)) - return; + if (!f2fs_may_inline_data(inode)) { + err = f2fs_convert_inline_inode(inode); + if (err) + return err; } - if (!truncate_blocks(inode, i_size_read(inode), true)) { - inode->i_mtime = inode->i_ctime = CURRENT_TIME; - mark_inode_dirty(inode); - } + err = truncate_blocks(inode, i_size_read(inode), true); + if (err) + return err; + + inode->i_mtime = inode->i_ctime = CURRENT_TIME; + f2fs_mark_inode_dirty_sync(inode); + return 0; } int f2fs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat) { - struct inode *inode = dentry->d_inode; + struct inode *inode = d_inode(dentry); generic_fillattr(inode, stat); stat->blocks <<= 3; return 0; @@ -600,7 +665,6 @@ int f2fs_getattr(struct vfsmount *mnt, #ifdef CONFIG_F2FS_FS_POSIX_ACL static void __setattr_copy(struct inode *inode, const struct iattr *attr) { - struct f2fs_inode_info *fi = F2FS_I(inode); unsigned int ia_valid = attr->ia_valid; if (ia_valid & ATTR_UID) @@ -621,7 +685,7 @@ static void __setattr_copy(struct inode *inode, const struct iattr *attr) if (!in_group_p(inode->i_gid) && !capable(CAP_FSETID)) mode &= ~S_ISGID; - set_acl_inode(fi, mode); + set_acl_inode(inode, mode); } } #else @@ -630,8 +694,7 @@ static void __setattr_copy(struct inode *inode, const struct iattr *attr) int f2fs_setattr(struct dentry *dentry, struct iattr *attr) { - struct inode *inode = dentry->d_inode; - struct f2fs_inode_info *fi = F2FS_I(inode); + struct inode *inode = d_inode(dentry); int err; err = inode_change_ok(inode, attr); @@ -639,16 +702,30 @@ int f2fs_setattr(struct dentry *dentry, struct iattr *attr) return err; if (attr->ia_valid & ATTR_SIZE) { - if (attr->ia_size != i_size_read(inode)) { + if (f2fs_encrypted_inode(inode) && + fscrypt_get_encryption_info(inode)) + return -EACCES; + + if (attr->ia_size <= i_size_read(inode)) { truncate_setsize(inode, attr->ia_size); - f2fs_truncate(inode); - f2fs_balance_fs(F2FS_I_SB(inode)); + err = f2fs_truncate(inode); + if (err) + return err; + f2fs_balance_fs(F2FS_I_SB(inode), true); } else { /* - * giving a chance to truncate blocks past EOF which - * are fallocated with FALLOC_FL_KEEP_SIZE. + * do not trim all blocks after i_size if target size is + * larger than i_size. */ - f2fs_truncate(inode); + truncate_setsize(inode, attr->ia_size); + + /* should convert inline inode here */ + if (!f2fs_may_inline_data(inode)) { + err = f2fs_convert_inline_inode(inode); + if (err) + return err; + } + inode->i_mtime = inode->i_ctime = CURRENT_TIME; } } @@ -656,13 +733,13 @@ int f2fs_setattr(struct dentry *dentry, struct iattr *attr) if (attr->ia_valid & ATTR_MODE) { err = f2fs_acl_chmod(inode); - if (err || is_inode_flag_set(fi, FI_ACL_MODE)) { - inode->i_mode = fi->i_acl_mode; - clear_inode_flag(fi, FI_ACL_MODE); + if (err || is_inode_flag_set(inode, FI_ACL_MODE)) { + inode->i_mode = F2FS_I(inode)->i_acl_mode; + clear_inode_flag(inode, FI_ACL_MODE); } } - mark_inode_dirty(inode); + f2fs_mark_inode_dirty_sync(inode); return err; } @@ -679,48 +756,58 @@ const struct inode_operations f2fs_file_inode_operations = { .fiemap = f2fs_fiemap, }; -static void fill_zero(struct inode *inode, pgoff_t index, +static int fill_zero(struct inode *inode, pgoff_t index, loff_t start, loff_t len) { struct f2fs_sb_info *sbi = F2FS_I_SB(inode); struct page *page; if (!len) - return; + return 0; - f2fs_balance_fs(sbi); + f2fs_balance_fs(sbi, true); f2fs_lock_op(sbi); page = get_new_data_page(inode, NULL, index, false); f2fs_unlock_op(sbi); - if (!IS_ERR(page)) { - f2fs_wait_on_page_writeback(page, DATA); - zero_user(page, start, len); - set_page_dirty(page); - f2fs_put_page(page, 1); - } + if (IS_ERR(page)) + return PTR_ERR(page); + + f2fs_wait_on_page_writeback(page, DATA, true); + zero_user(page, start, len); + set_page_dirty(page); + f2fs_put_page(page, 1); + return 0; } int truncate_hole(struct inode *inode, pgoff_t pg_start, pgoff_t pg_end) { - pgoff_t index; int err; - for (index = pg_start; index < pg_end; index++) { + while (pg_start < pg_end) { struct dnode_of_data dn; + pgoff_t end_offset, count; set_new_dnode(&dn, inode, NULL, NULL, 0); - err = get_dnode_of_data(&dn, index, LOOKUP_NODE); + err = get_dnode_of_data(&dn, pg_start, LOOKUP_NODE); if (err) { - if (err == -ENOENT) + if (err == -ENOENT) { + pg_start++; continue; + } return err; } - if (dn.data_blkaddr != NULL_ADDR) - truncate_data_blocks_range(&dn, 1); + end_offset = ADDRS_PER_PAGE(dn.node_page, inode); + count = min(end_offset - dn.ofs_in_node, pg_end - pg_start); + + f2fs_bug_on(F2FS_I_SB(inode), count == 0 || count > end_offset); + + truncate_data_blocks_range(&dn, count); f2fs_put_dnode(&dn); + + pg_start += count; } return 0; } @@ -729,46 +816,45 @@ static int punch_hole(struct inode *inode, loff_t offset, loff_t len) { pgoff_t pg_start, pg_end; loff_t off_start, off_end; - int ret = 0; - - if (!S_ISREG(inode->i_mode)) - return -EOPNOTSUPP; + int ret; - /* skip punching hole beyond i_size */ - if (offset >= inode->i_size) + ret = f2fs_convert_inline_inode(inode); + if (ret) return ret; - if (f2fs_has_inline_data(inode)) { - ret = f2fs_convert_inline_inode(inode); - if (ret) - return ret; - } - - pg_start = ((unsigned long long) offset) >> PAGE_CACHE_SHIFT; - pg_end = ((unsigned long long) offset + len) >> PAGE_CACHE_SHIFT; + pg_start = ((unsigned long long) offset) >> PAGE_SHIFT; + pg_end = ((unsigned long long) offset + len) >> PAGE_SHIFT; - off_start = offset & (PAGE_CACHE_SIZE - 1); - off_end = (offset + len) & (PAGE_CACHE_SIZE - 1); + off_start = offset & (PAGE_SIZE - 1); + off_end = (offset + len) & (PAGE_SIZE - 1); if (pg_start == pg_end) { - fill_zero(inode, pg_start, off_start, + ret = fill_zero(inode, pg_start, off_start, off_end - off_start); + if (ret) + return ret; } else { - if (off_start) - fill_zero(inode, pg_start++, off_start, - PAGE_CACHE_SIZE - off_start); - if (off_end) - fill_zero(inode, pg_end, 0, off_end); + if (off_start) { + ret = fill_zero(inode, pg_start++, off_start, + PAGE_SIZE - off_start); + if (ret) + return ret; + } + if (off_end) { + ret = fill_zero(inode, pg_end, 0, off_end); + if (ret) + return ret; + } if (pg_start < pg_end) { struct address_space *mapping = inode->i_mapping; loff_t blk_start, blk_end; struct f2fs_sb_info *sbi = F2FS_I_SB(inode); - f2fs_balance_fs(sbi); + f2fs_balance_fs(sbi, true); - blk_start = pg_start << PAGE_CACHE_SHIFT; - blk_end = pg_end << PAGE_CACHE_SHIFT; + blk_start = (loff_t)pg_start << PAGE_SHIFT; + blk_end = (loff_t)pg_end << PAGE_SHIFT; truncate_inode_pages_range(mapping, blk_start, blk_end - 1); @@ -781,255 +867,794 @@ static int punch_hole(struct inode *inode, loff_t offset, loff_t len) return ret; } -static int expand_inode_data(struct inode *inode, loff_t offset, - loff_t len, int mode) +static int __read_out_blkaddrs(struct inode *inode, block_t *blkaddr, + int *do_replace, pgoff_t off, pgoff_t len) { struct f2fs_sb_info *sbi = F2FS_I_SB(inode); - pgoff_t index, pg_start, pg_end; - loff_t new_size = i_size_read(inode); - loff_t off_start, off_end; - int ret = 0; - - f2fs_balance_fs(sbi); + struct dnode_of_data dn; + int ret, done, i; - ret = inode_newsize_ok(inode, (len + offset)); - if (ret) +next_dnode: + set_new_dnode(&dn, inode, NULL, NULL, 0); + ret = get_dnode_of_data(&dn, off, LOOKUP_NODE_RA); + if (ret && ret != -ENOENT) { return ret; - - if (f2fs_has_inline_data(inode)) { - ret = f2fs_convert_inline_inode(inode); - if (ret) - return ret; + } else if (ret == -ENOENT) { + if (dn.max_level == 0) + return -ENOENT; + done = min((pgoff_t)ADDRS_PER_BLOCK - dn.ofs_in_node, len); + blkaddr += done; + do_replace += done; + goto next; } - pg_start = ((unsigned long long) offset) >> PAGE_CACHE_SHIFT; - pg_end = ((unsigned long long) offset + len) >> PAGE_CACHE_SHIFT; + done = min((pgoff_t)ADDRS_PER_PAGE(dn.node_page, inode) - + dn.ofs_in_node, len); + for (i = 0; i < done; i++, blkaddr++, do_replace++, dn.ofs_in_node++) { + *blkaddr = datablock_addr(dn.node_page, dn.ofs_in_node); + if (!is_checkpointed_data(sbi, *blkaddr)) { - off_start = offset & (PAGE_CACHE_SIZE - 1); - off_end = (offset + len) & (PAGE_CACHE_SIZE - 1); + if (test_opt(sbi, LFS)) { + f2fs_put_dnode(&dn); + return -ENOTSUPP; + } - f2fs_lock_op(sbi); + /* do not invalidate this block address */ + f2fs_update_data_blkaddr(&dn, NULL_ADDR); + *do_replace = 1; + } + } + f2fs_put_dnode(&dn); +next: + len -= done; + off += done; + if (len) + goto next_dnode; + return 0; +} - for (index = pg_start; index <= pg_end; index++) { - struct dnode_of_data dn; +static int __roll_back_blkaddrs(struct inode *inode, block_t *blkaddr, + int *do_replace, pgoff_t off, int len) +{ + struct f2fs_sb_info *sbi = F2FS_I_SB(inode); + struct dnode_of_data dn; + int ret, i; - if (index == pg_end && !off_end) - goto noalloc; + for (i = 0; i < len; i++, do_replace++, blkaddr++) { + if (*do_replace == 0) + continue; set_new_dnode(&dn, inode, NULL, NULL, 0); - ret = f2fs_reserve_block(&dn, index); - if (ret) - break; -noalloc: - if (pg_start == pg_end) - new_size = offset + len; - else if (index == pg_start && off_start) - new_size = (index + 1) << PAGE_CACHE_SHIFT; - else if (index == pg_end) - new_size = (index << PAGE_CACHE_SHIFT) + off_end; - else - new_size += PAGE_CACHE_SIZE; + ret = get_dnode_of_data(&dn, off + i, LOOKUP_NODE_RA); + if (ret) { + dec_valid_block_count(sbi, inode, 1); + invalidate_blocks(sbi, *blkaddr); + } else { + f2fs_update_data_blkaddr(&dn, *blkaddr); + } + f2fs_put_dnode(&dn); } + return 0; +} - if (!(mode & FALLOC_FL_KEEP_SIZE) && - i_size_read(inode) < new_size) { - i_size_write(inode, new_size); - mark_inode_dirty(inode); - update_inode_page(inode); - } - f2fs_unlock_op(sbi); +static int __clone_blkaddrs(struct inode *src_inode, struct inode *dst_inode, + block_t *blkaddr, int *do_replace, + pgoff_t src, pgoff_t dst, pgoff_t len, bool full) +{ + struct f2fs_sb_info *sbi = F2FS_I_SB(src_inode); + pgoff_t i = 0; + int ret; - return ret; + while (i < len) { + if (blkaddr[i] == NULL_ADDR && !full) { + i++; + continue; + } + + if (do_replace[i] || blkaddr[i] == NULL_ADDR) { + struct dnode_of_data dn; + struct node_info ni; + size_t new_size; + pgoff_t ilen; + + set_new_dnode(&dn, dst_inode, NULL, NULL, 0); + ret = get_dnode_of_data(&dn, dst + i, ALLOC_NODE); + if (ret) + return ret; + + get_node_info(sbi, dn.nid, &ni); + ilen = min((pgoff_t) + ADDRS_PER_PAGE(dn.node_page, dst_inode) - + dn.ofs_in_node, len - i); + do { + dn.data_blkaddr = datablock_addr(dn.node_page, + dn.ofs_in_node); + truncate_data_blocks_range(&dn, 1); + + if (do_replace[i]) { + f2fs_i_blocks_write(src_inode, + 1, false); + f2fs_i_blocks_write(dst_inode, + 1, true); + f2fs_replace_block(sbi, &dn, dn.data_blkaddr, + blkaddr[i], ni.version, true, false); + + do_replace[i] = 0; + } + dn.ofs_in_node++; + i++; + new_size = (dst + i) << PAGE_SHIFT; + if (dst_inode->i_size < new_size) + f2fs_i_size_write(dst_inode, new_size); + } while ((do_replace[i] || blkaddr[i] == NULL_ADDR) && --ilen); + + f2fs_put_dnode(&dn); + } else { + struct page *psrc, *pdst; + + psrc = get_lock_data_page(src_inode, src + i, true); + if (IS_ERR(psrc)) + return PTR_ERR(psrc); + pdst = get_new_data_page(dst_inode, NULL, dst + i, + true); + if (IS_ERR(pdst)) { + f2fs_put_page(psrc, 1); + return PTR_ERR(pdst); + } + f2fs_copy_page(psrc, pdst); + set_page_dirty(pdst); + f2fs_put_page(pdst, 1); + f2fs_put_page(psrc, 1); + + ret = truncate_hole(src_inode, src + i, src + i + 1); + if (ret) + return ret; + i++; + } + } + return 0; } -static long f2fs_fallocate(struct file *file, int mode, - loff_t offset, loff_t len) +static int __exchange_data_block(struct inode *src_inode, + struct inode *dst_inode, pgoff_t src, pgoff_t dst, + pgoff_t len, bool full) { - struct inode *inode = file_inode(file); - long ret; + block_t *src_blkaddr; + int *do_replace; + pgoff_t olen; + int ret; - if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) - return -EOPNOTSUPP; + while (len) { + olen = min((pgoff_t)4 * ADDRS_PER_BLOCK, len); - mutex_lock(&inode->i_mutex); + src_blkaddr = f2fs_kvzalloc(sizeof(block_t) * olen, GFP_KERNEL); + if (!src_blkaddr) + return -ENOMEM; - if (mode & FALLOC_FL_PUNCH_HOLE) - ret = punch_hole(inode, offset, len); - else - ret = expand_inode_data(inode, offset, len, mode); + do_replace = f2fs_kvzalloc(sizeof(int) * olen, GFP_KERNEL); + if (!do_replace) { + f2fs_kvfree(src_blkaddr); + return -ENOMEM; + } - if (!ret) { - inode->i_mtime = inode->i_ctime = CURRENT_TIME; - mark_inode_dirty(inode); - } + ret = __read_out_blkaddrs(src_inode, src_blkaddr, + do_replace, src, olen); + if (ret) + goto roll_back; - mutex_unlock(&inode->i_mutex); + ret = __clone_blkaddrs(src_inode, dst_inode, src_blkaddr, + do_replace, src, dst, olen, full); + if (ret) + goto roll_back; - trace_f2fs_fallocate(inode, mode, offset, len, ret); - return ret; -} + src += olen; + dst += olen; + len -= olen; -static int f2fs_release_file(struct inode *inode, struct file *filp) -{ - /* some remained atomic pages should discarded */ - if (f2fs_is_atomic_file(inode)) - commit_inmem_pages(inode, true); - if (f2fs_is_volatile_file(inode)) { - set_inode_flag(F2FS_I(inode), FI_DROP_CACHE); - filemap_fdatawrite(inode->i_mapping); - clear_inode_flag(F2FS_I(inode), FI_DROP_CACHE); + f2fs_kvfree(src_blkaddr); + f2fs_kvfree(do_replace); } return 0; -} - -#define F2FS_REG_FLMASK (~(FS_DIRSYNC_FL | FS_TOPDIR_FL)) -#define F2FS_OTHER_FLMASK (FS_NODUMP_FL | FS_NOATIME_FL) -static inline __u32 f2fs_mask_flags(umode_t mode, __u32 flags) -{ - if (S_ISDIR(mode)) - return flags; - else if (S_ISREG(mode)) - return flags & F2FS_REG_FLMASK; - else - return flags & F2FS_OTHER_FLMASK; +roll_back: + __roll_back_blkaddrs(src_inode, src_blkaddr, do_replace, src, len); + f2fs_kvfree(src_blkaddr); + f2fs_kvfree(do_replace); + return ret; } -static int f2fs_ioc_getflags(struct file *filp, unsigned long arg) +static int f2fs_do_collapse(struct inode *inode, pgoff_t start, pgoff_t end) { - struct inode *inode = file_inode(filp); - struct f2fs_inode_info *fi = F2FS_I(inode); - unsigned int flags = fi->i_flags & FS_FL_USER_VISIBLE; - return put_user(flags, (int __user *)arg); + struct f2fs_sb_info *sbi = F2FS_I_SB(inode); + pgoff_t nrpages = (i_size_read(inode) + PAGE_SIZE - 1) / PAGE_SIZE; + int ret; + + f2fs_balance_fs(sbi, true); + f2fs_lock_op(sbi); + + f2fs_drop_extent_tree(inode); + + ret = __exchange_data_block(inode, inode, end, start, nrpages - end, true); + f2fs_unlock_op(sbi); + return ret; } -static int f2fs_ioc_setflags(struct file *filp, unsigned long arg) +static int f2fs_collapse_range(struct inode *inode, loff_t offset, loff_t len) { - struct inode *inode = file_inode(filp); - struct f2fs_inode_info *fi = F2FS_I(inode); - unsigned int flags = fi->i_flags & FS_FL_USER_VISIBLE; - unsigned int oldflags; + pgoff_t pg_start, pg_end; + loff_t new_size; int ret; - ret = mnt_want_write_file(filp); + if (offset + len >= i_size_read(inode)) + return -EINVAL; + + /* collapse range should be aligned to block size of f2fs. */ + if (offset & (F2FS_BLKSIZE - 1) || len & (F2FS_BLKSIZE - 1)) + return -EINVAL; + + ret = f2fs_convert_inline_inode(inode); if (ret) return ret; - if (!inode_owner_or_capable(inode)) { - ret = -EACCES; - goto out; - } + pg_start = offset >> PAGE_SHIFT; + pg_end = (offset + len) >> PAGE_SHIFT; - if (get_user(flags, (int __user *)arg)) { - ret = -EFAULT; - goto out; - } + /* write out all dirty pages from offset */ + ret = filemap_write_and_wait_range(inode->i_mapping, offset, LLONG_MAX); + if (ret) + return ret; - flags = f2fs_mask_flags(inode->i_mode, flags); + truncate_pagecache(inode, 0, offset); - mutex_lock(&inode->i_mutex); + ret = f2fs_do_collapse(inode, pg_start, pg_end); + if (ret) + return ret; - oldflags = fi->i_flags; + /* write out all moved pages, if possible */ + filemap_write_and_wait_range(inode->i_mapping, offset, LLONG_MAX); + truncate_pagecache(inode, 0, offset); - if ((flags ^ oldflags) & (FS_APPEND_FL | FS_IMMUTABLE_FL)) { - if (!capable(CAP_LINUX_IMMUTABLE)) { - mutex_unlock(&inode->i_mutex); - ret = -EPERM; - goto out; - } - } + new_size = i_size_read(inode) - len; + truncate_pagecache(inode, 0, new_size); - flags = flags & FS_FL_USER_MODIFIABLE; - flags |= oldflags & ~FS_FL_USER_MODIFIABLE; - fi->i_flags = flags; - mutex_unlock(&inode->i_mutex); + ret = truncate_blocks(inode, new_size, true); + if (!ret) + f2fs_i_size_write(inode, new_size); - f2fs_set_inode_flags(inode); - inode->i_ctime = CURRENT_TIME; - mark_inode_dirty(inode); -out: - mnt_drop_write_file(filp); return ret; } -static int f2fs_ioc_getversion(struct file *filp, unsigned long arg) -{ - struct inode *inode = file_inode(filp); - - return put_user(inode->i_generation, (int __user *)arg); -} - -static int f2fs_ioc_start_atomic_write(struct file *filp) +static int f2fs_do_zero_range(struct dnode_of_data *dn, pgoff_t start, + pgoff_t end) { - struct inode *inode = file_inode(filp); + struct f2fs_sb_info *sbi = F2FS_I_SB(dn->inode); + pgoff_t index = start; + unsigned int ofs_in_node = dn->ofs_in_node; + blkcnt_t count = 0; + int ret; - if (!inode_owner_or_capable(inode)) - return -EACCES; + for (; index < end; index++, dn->ofs_in_node++) { + if (datablock_addr(dn->node_page, dn->ofs_in_node) == NULL_ADDR) + count++; + } - f2fs_balance_fs(F2FS_I_SB(inode)); + dn->ofs_in_node = ofs_in_node; + ret = reserve_new_blocks(dn, count); + if (ret) + return ret; - if (f2fs_is_atomic_file(inode)) - return 0; + dn->ofs_in_node = ofs_in_node; + for (index = start; index < end; index++, dn->ofs_in_node++) { + dn->data_blkaddr = + datablock_addr(dn->node_page, dn->ofs_in_node); + /* + * reserve_new_blocks will not guarantee entire block + * allocation. + */ + if (dn->data_blkaddr == NULL_ADDR) { + ret = -ENOSPC; + break; + } + if (dn->data_blkaddr != NEW_ADDR) { + invalidate_blocks(sbi, dn->data_blkaddr); + dn->data_blkaddr = NEW_ADDR; + set_data_blkaddr(dn); + } + } - set_inode_flag(F2FS_I(inode), FI_ATOMIC_FILE); + f2fs_update_extent_cache_range(dn, start, 0, index - start); - return f2fs_convert_inline_inode(inode); + return ret; } -static int f2fs_ioc_commit_atomic_write(struct file *filp) +static int f2fs_zero_range(struct inode *inode, loff_t offset, loff_t len, + int mode) { - struct inode *inode = file_inode(filp); - int ret; + struct f2fs_sb_info *sbi = F2FS_I_SB(inode); + struct address_space *mapping = inode->i_mapping; + pgoff_t index, pg_start, pg_end; + loff_t new_size = i_size_read(inode); + loff_t off_start, off_end; + int ret = 0; + + ret = inode_newsize_ok(inode, (len + offset)); + if (ret) + return ret; + + ret = f2fs_convert_inline_inode(inode); + if (ret) + return ret; + + ret = filemap_write_and_wait_range(mapping, offset, offset + len - 1); + if (ret) + return ret; + + truncate_pagecache_range(inode, offset, offset + len - 1); + + pg_start = ((unsigned long long) offset) >> PAGE_SHIFT; + pg_end = ((unsigned long long) offset + len) >> PAGE_SHIFT; + + off_start = offset & (PAGE_SIZE - 1); + off_end = (offset + len) & (PAGE_SIZE - 1); + + if (pg_start == pg_end) { + ret = fill_zero(inode, pg_start, off_start, + off_end - off_start); + if (ret) + return ret; + + if (offset + len > new_size) + new_size = offset + len; + new_size = max_t(loff_t, new_size, offset + len); + } else { + if (off_start) { + ret = fill_zero(inode, pg_start++, off_start, + PAGE_SIZE - off_start); + if (ret) + return ret; + + new_size = max_t(loff_t, new_size, + (loff_t)pg_start << PAGE_SHIFT); + } + + for (index = pg_start; index < pg_end;) { + struct dnode_of_data dn; + unsigned int end_offset; + pgoff_t end; + + f2fs_lock_op(sbi); + + set_new_dnode(&dn, inode, NULL, NULL, 0); + ret = get_dnode_of_data(&dn, index, ALLOC_NODE); + if (ret) { + f2fs_unlock_op(sbi); + goto out; + } + + end_offset = ADDRS_PER_PAGE(dn.node_page, inode); + end = min(pg_end, end_offset - dn.ofs_in_node + index); + + ret = f2fs_do_zero_range(&dn, index, end); + f2fs_put_dnode(&dn); + f2fs_unlock_op(sbi); + if (ret) + goto out; + + index = end; + new_size = max_t(loff_t, new_size, + (loff_t)index << PAGE_SHIFT); + } + + if (off_end) { + ret = fill_zero(inode, pg_end, 0, off_end); + if (ret) + goto out; + + new_size = max_t(loff_t, new_size, offset + len); + } + } + +out: + if (!(mode & FALLOC_FL_KEEP_SIZE) && i_size_read(inode) < new_size) + f2fs_i_size_write(inode, new_size); + + return ret; +} + +static int f2fs_insert_range(struct inode *inode, loff_t offset, loff_t len) +{ + struct f2fs_sb_info *sbi = F2FS_I_SB(inode); + pgoff_t nr, pg_start, pg_end, delta, idx; + loff_t new_size; + int ret = 0; + + new_size = i_size_read(inode) + len; + if (new_size > inode->i_sb->s_maxbytes) + return -EFBIG; + + if (offset >= i_size_read(inode)) + return -EINVAL; + + /* insert range should be aligned to block size of f2fs. */ + if (offset & (F2FS_BLKSIZE - 1) || len & (F2FS_BLKSIZE - 1)) + return -EINVAL; + + ret = f2fs_convert_inline_inode(inode); + if (ret) + return ret; + + f2fs_balance_fs(sbi, true); + + ret = truncate_blocks(inode, i_size_read(inode), true); + if (ret) + return ret; + + /* write out all dirty pages from offset */ + ret = filemap_write_and_wait_range(inode->i_mapping, offset, LLONG_MAX); + if (ret) + return ret; + + truncate_pagecache(inode, 0, offset); + + pg_start = offset >> PAGE_SHIFT; + pg_end = (offset + len) >> PAGE_SHIFT; + delta = pg_end - pg_start; + idx = (i_size_read(inode) + PAGE_SIZE - 1) / PAGE_SIZE; + + while (!ret && idx > pg_start) { + nr = idx - pg_start; + if (nr > delta) + nr = delta; + idx -= nr; + + f2fs_lock_op(sbi); + f2fs_drop_extent_tree(inode); + + ret = __exchange_data_block(inode, inode, idx, + idx + delta, nr, false); + f2fs_unlock_op(sbi); + } + + /* write out all moved pages, if possible */ + filemap_write_and_wait_range(inode->i_mapping, offset, LLONG_MAX); + truncate_pagecache(inode, 0, offset); + + if (!ret) + f2fs_i_size_write(inode, new_size); + return ret; +} + +static int expand_inode_data(struct inode *inode, loff_t offset, + loff_t len, int mode) +{ + struct f2fs_sb_info *sbi = F2FS_I_SB(inode); + struct f2fs_map_blocks map = { .m_next_pgofs = NULL }; + pgoff_t pg_end; + loff_t new_size = i_size_read(inode); + loff_t off_end; + int ret; + + ret = inode_newsize_ok(inode, (len + offset)); + if (ret) + return ret; + + ret = f2fs_convert_inline_inode(inode); + if (ret) + return ret; + + f2fs_balance_fs(sbi, true); + + pg_end = ((unsigned long long)offset + len) >> PAGE_SHIFT; + off_end = (offset + len) & (PAGE_SIZE - 1); + + map.m_lblk = ((unsigned long long)offset) >> PAGE_SHIFT; + map.m_len = pg_end - map.m_lblk; + if (off_end) + map.m_len++; + + ret = f2fs_map_blocks(inode, &map, 1, F2FS_GET_BLOCK_PRE_AIO); + if (ret) { + pgoff_t last_off; + + if (!map.m_len) + return ret; + + last_off = map.m_lblk + map.m_len - 1; + + /* update new size to the failed position */ + new_size = (last_off == pg_end) ? offset + len: + (loff_t)(last_off + 1) << PAGE_SHIFT; + } else { + new_size = ((loff_t)pg_end << PAGE_SHIFT) + off_end; + } + + if (!(mode & FALLOC_FL_KEEP_SIZE) && i_size_read(inode) < new_size) + f2fs_i_size_write(inode, new_size); + + return ret; +} + +#ifndef FALLOC_FL_COLLAPSE_RANGE +#define FALLOC_FL_COLLAPSE_RANGE 0X08 +#endif +#ifndef FALLOC_FL_ZERO_RANGE +#define FALLOC_FL_ZERO_RANGE 0X10 +#endif +#ifndef FALLOC_FL_INSERT_RANGE +#define FALLOC_FL_INSERT_RANGE 0X20 +#endif + +static long f2fs_fallocate(struct file *file, int mode, + loff_t offset, loff_t len) +{ + struct inode *inode = file_inode(file); + long ret = 0; + + /* f2fs only support ->fallocate for regular file */ + if (!S_ISREG(inode->i_mode)) + return -EINVAL; + + if (f2fs_encrypted_inode(inode) && + (mode & (FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_INSERT_RANGE))) + return -EOPNOTSUPP; + + if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE | + FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE | + FALLOC_FL_INSERT_RANGE)) + return -EOPNOTSUPP; + + inode_lock(inode); + + if (mode & FALLOC_FL_PUNCH_HOLE) { + if (offset >= inode->i_size) + goto out; + + ret = punch_hole(inode, offset, len); + } else if (mode & FALLOC_FL_COLLAPSE_RANGE) { + ret = f2fs_collapse_range(inode, offset, len); + } else if (mode & FALLOC_FL_ZERO_RANGE) { + ret = f2fs_zero_range(inode, offset, len, mode); + } else if (mode & FALLOC_FL_INSERT_RANGE) { + ret = f2fs_insert_range(inode, offset, len); + } else { + ret = expand_inode_data(inode, offset, len, mode); + } + + if (!ret) { + inode->i_mtime = inode->i_ctime = CURRENT_TIME; + f2fs_mark_inode_dirty_sync(inode); + f2fs_update_time(F2FS_I_SB(inode), REQ_TIME); + } + +out: + inode_unlock(inode); + + trace_f2fs_fallocate(inode, mode, offset, len, ret); + return ret; +} + +static int f2fs_release_file(struct inode *inode, struct file *filp) +{ + /* + * f2fs_relase_file is called at every close calls. So we should + * not drop any inmemory pages by close called by other process. + */ + if (!(filp->f_mode & FMODE_WRITE) || + atomic_read(&inode->i_writecount) != 1) + return 0; + + /* some remained atomic pages should discarded */ + if (f2fs_is_atomic_file(inode)) + drop_inmem_pages(inode); + if (f2fs_is_volatile_file(inode)) { + clear_inode_flag(inode, FI_VOLATILE_FILE); + set_inode_flag(inode, FI_DROP_CACHE); + filemap_fdatawrite(inode->i_mapping); + clear_inode_flag(inode, FI_DROP_CACHE); + } + return 0; +} + +#define F2FS_REG_FLMASK (~(FS_DIRSYNC_FL | FS_TOPDIR_FL)) +#define F2FS_OTHER_FLMASK (FS_NODUMP_FL | FS_NOATIME_FL) + +static inline __u32 f2fs_mask_flags(umode_t mode, __u32 flags) +{ + if (S_ISDIR(mode)) + return flags; + else if (S_ISREG(mode)) + return flags & F2FS_REG_FLMASK; + else + return flags & F2FS_OTHER_FLMASK; +} + +static int f2fs_ioc_getflags(struct file *filp, unsigned long arg) +{ + struct inode *inode = file_inode(filp); + struct f2fs_inode_info *fi = F2FS_I(inode); + unsigned int flags = fi->i_flags & FS_FL_USER_VISIBLE; + return put_user(flags, (int __user *)arg); +} + +static int f2fs_ioc_setflags(struct file *filp, unsigned long arg) +{ + struct inode *inode = file_inode(filp); + struct f2fs_inode_info *fi = F2FS_I(inode); + unsigned int flags = fi->i_flags & FS_FL_USER_VISIBLE; + unsigned int oldflags; + int ret; if (!inode_owner_or_capable(inode)) return -EACCES; - if (f2fs_is_volatile_file(inode)) - return 0; + if (get_user(flags, (int __user *)arg)) + return -EFAULT; ret = mnt_want_write_file(filp); if (ret) return ret; + flags = f2fs_mask_flags(inode->i_mode, flags); + + inode_lock(inode); + + oldflags = fi->i_flags; + + if ((flags ^ oldflags) & (FS_APPEND_FL | FS_IMMUTABLE_FL)) { + if (!capable(CAP_LINUX_IMMUTABLE)) { + inode_unlock(inode); + ret = -EPERM; + goto out; + } + } + + flags = flags & FS_FL_USER_MODIFIABLE; + flags |= oldflags & ~FS_FL_USER_MODIFIABLE; + fi->i_flags = flags; + inode_unlock(inode); + + inode->i_ctime = CURRENT_TIME; + f2fs_set_inode_flags(inode); +out: + mnt_drop_write_file(filp); + return ret; +} + +static int f2fs_ioc_getversion(struct file *filp, unsigned long arg) +{ + struct inode *inode = file_inode(filp); + + return put_user(inode->i_generation, (int __user *)arg); +} + +static int f2fs_ioc_start_atomic_write(struct file *filp) +{ + struct inode *inode = file_inode(filp); + int ret; + + if (!inode_owner_or_capable(inode)) + return -EACCES; + + ret = mnt_want_write_file(filp); + if (ret) + return ret; + + inode_lock(inode); + if (f2fs_is_atomic_file(inode)) - commit_inmem_pages(inode, false); + goto out; + + ret = f2fs_convert_inline_inode(inode); + if (ret) + goto out; + + set_inode_flag(inode, FI_ATOMIC_FILE); + f2fs_update_time(F2FS_I_SB(inode), REQ_TIME); + + if (!get_dirty_pages(inode)) + goto out; + + f2fs_msg(F2FS_I_SB(inode)->sb, KERN_WARNING, + "Unexpected flush for atomic writes: ino=%lu, npages=%lld", + inode->i_ino, get_dirty_pages(inode)); + ret = filemap_write_and_wait_range(inode->i_mapping, 0, LLONG_MAX); + if (ret) + clear_inode_flag(inode, FI_ATOMIC_FILE); +out: + inode_unlock(inode); + mnt_drop_write_file(filp); + return ret; +} + +static int f2fs_ioc_commit_atomic_write(struct file *filp) +{ + struct inode *inode = file_inode(filp); + int ret; - ret = f2fs_sync_file(filp, 0, LONG_MAX, 0); + if (!inode_owner_or_capable(inode)) + return -EACCES; + + ret = mnt_want_write_file(filp); + if (ret) + return ret; + + inode_lock(inode); + + if (f2fs_is_volatile_file(inode)) + goto err_out; + + if (f2fs_is_atomic_file(inode)) { + clear_inode_flag(inode, FI_ATOMIC_FILE); + ret = commit_inmem_pages(inode); + if (ret) { + set_inode_flag(inode, FI_ATOMIC_FILE); + goto err_out; + } + } + + ret = f2fs_do_sync_file(filp, 0, LLONG_MAX, 0, true); +err_out: + inode_unlock(inode); mnt_drop_write_file(filp); - clear_inode_flag(F2FS_I(inode), FI_ATOMIC_FILE); return ret; } static int f2fs_ioc_start_volatile_write(struct file *filp) { struct inode *inode = file_inode(filp); + int ret; if (!inode_owner_or_capable(inode)) return -EACCES; + ret = mnt_want_write_file(filp); + if (ret) + return ret; + + inode_lock(inode); + if (f2fs_is_volatile_file(inode)) - return 0; + goto out; - set_inode_flag(F2FS_I(inode), FI_VOLATILE_FILE); + ret = f2fs_convert_inline_inode(inode); + if (ret) + goto out; - return f2fs_convert_inline_inode(inode); + set_inode_flag(inode, FI_VOLATILE_FILE); + f2fs_update_time(F2FS_I_SB(inode), REQ_TIME); +out: + inode_unlock(inode); + mnt_drop_write_file(filp); + return ret; } static int f2fs_ioc_release_volatile_write(struct file *filp) { struct inode *inode = file_inode(filp); + int ret; if (!inode_owner_or_capable(inode)) return -EACCES; + ret = mnt_want_write_file(filp); + if (ret) + return ret; + + inode_lock(inode); + if (!f2fs_is_volatile_file(inode)) - return 0; + goto out; - if (!f2fs_is_first_block_written(inode)) - return truncate_partial_data_page(inode, 0, true); + if (!f2fs_is_first_block_written(inode)) { + ret = truncate_partial_data_page(inode, 0, true); + goto out; + } - punch_hole(inode, 0, F2FS_BLKSIZE); - return 0; + ret = punch_hole(inode, 0, F2FS_BLKSIZE); +out: + inode_unlock(inode); + mnt_drop_write_file(filp); + return ret; } static int f2fs_ioc_abort_volatile_write(struct file *filp) @@ -1044,19 +1669,19 @@ static int f2fs_ioc_abort_volatile_write(struct file *filp) if (ret) return ret; - f2fs_balance_fs(F2FS_I_SB(inode)); - - if (f2fs_is_atomic_file(inode)) { - commit_inmem_pages(inode, false); - clear_inode_flag(F2FS_I(inode), FI_ATOMIC_FILE); - } + inode_lock(inode); + if (f2fs_is_atomic_file(inode)) + drop_inmem_pages(inode); if (f2fs_is_volatile_file(inode)) { - clear_inode_flag(F2FS_I(inode), FI_VOLATILE_FILE); - filemap_fdatawrite(inode->i_mapping); - set_inode_flag(F2FS_I(inode), FI_VOLATILE_FILE); + clear_inode_flag(inode, FI_VOLATILE_FILE); + ret = f2fs_do_sync_file(filp, 0, LLONG_MAX, 0, true); } + + inode_unlock(inode); + mnt_drop_write_file(filp); + f2fs_update_time(F2FS_I_SB(inode), REQ_TIME); return ret; } @@ -1066,6 +1691,7 @@ static int f2fs_ioc_shutdown(struct file *filp, unsigned long arg) struct f2fs_sb_info *sbi = F2FS_I_SB(inode); struct super_block *sb = sbi->sb; __u32 in; + int ret; if (!capable(CAP_SYS_ADMIN)) return -EPERM; @@ -1073,26 +1699,38 @@ static int f2fs_ioc_shutdown(struct file *filp, unsigned long arg) if (get_user(in, (__u32 __user *)arg)) return -EFAULT; + ret = mnt_want_write_file(filp); + if (ret) + return ret; + switch (in) { case F2FS_GOING_DOWN_FULLSYNC: sb = freeze_bdev(sb->s_bdev); if (sb && !IS_ERR(sb)) { - f2fs_stop_checkpoint(sbi); + f2fs_stop_checkpoint(sbi, false); thaw_bdev(sb->s_bdev, sb); } break; case F2FS_GOING_DOWN_METASYNC: /* do checkpoint only */ f2fs_sync_fs(sb, 1); - f2fs_stop_checkpoint(sbi); + f2fs_stop_checkpoint(sbi, false); break; case F2FS_GOING_DOWN_NOSYNC: - f2fs_stop_checkpoint(sbi); + f2fs_stop_checkpoint(sbi, false); + break; + case F2FS_GOING_DOWN_METAFLUSH: + sync_meta_pages(sbi, META, LONG_MAX); + f2fs_stop_checkpoint(sbi, false); break; default: - return -EINVAL; + ret = -EINVAL; + goto out; } - return 0; + f2fs_update_time(sbi, REQ_TIME); +out: + mnt_drop_write_file(filp); + return ret; } static int f2fs_ioc_fitrim(struct file *filp, unsigned long arg) @@ -1113,18 +1751,350 @@ static int f2fs_ioc_fitrim(struct file *filp, unsigned long arg) sizeof(range))) return -EFAULT; + ret = mnt_want_write_file(filp); + if (ret) + return ret; + range.minlen = max((unsigned int)range.minlen, q->limits.discard_granularity); ret = f2fs_trim_fs(F2FS_SB(sb), &range); + mnt_drop_write_file(filp); if (ret < 0) return ret; if (copy_to_user((struct fstrim_range __user *)arg, &range, sizeof(range))) return -EFAULT; + f2fs_update_time(F2FS_I_SB(inode), REQ_TIME); + return 0; +} + +static bool uuid_is_nonzero(__u8 u[16]) +{ + int i; + + for (i = 0; i < 16; i++) + if (u[i]) + return true; + return false; +} + +static int f2fs_ioc_set_encryption_policy(struct file *filp, unsigned long arg) +{ + struct fscrypt_policy policy; + struct inode *inode = file_inode(filp); + int ret; + + if (copy_from_user(&policy, (struct fscrypt_policy __user *)arg, + sizeof(policy))) + return -EFAULT; + + ret = mnt_want_write_file(filp); + if (ret) + return ret; + + f2fs_update_time(F2FS_I_SB(inode), REQ_TIME); + ret = fscrypt_process_policy(inode, &policy); + + mnt_drop_write_file(filp); + return ret; +} + +static int f2fs_ioc_get_encryption_policy(struct file *filp, unsigned long arg) +{ + struct fscrypt_policy policy; + struct inode *inode = file_inode(filp); + int err; + + err = fscrypt_get_policy(inode, &policy); + if (err) + return err; + + if (copy_to_user((struct fscrypt_policy __user *)arg, &policy, sizeof(policy))) + return -EFAULT; + return 0; +} + +static int f2fs_ioc_get_encryption_pwsalt(struct file *filp, unsigned long arg) +{ + struct inode *inode = file_inode(filp); + struct f2fs_sb_info *sbi = F2FS_I_SB(inode); + int err; + + if (!f2fs_sb_has_crypto(inode->i_sb)) + return -EOPNOTSUPP; + + if (uuid_is_nonzero(sbi->raw_super->encrypt_pw_salt)) + goto got_it; + + err = mnt_want_write_file(filp); + if (err) + return err; + + /* update superblock with uuid */ + generate_random_uuid(sbi->raw_super->encrypt_pw_salt); + + err = f2fs_commit_super(sbi, false); + if (err) { + /* undo new data */ + memset(sbi->raw_super->encrypt_pw_salt, 0, 16); + mnt_drop_write_file(filp); + return err; + } + mnt_drop_write_file(filp); +got_it: + if (copy_to_user((__u8 __user *)arg, sbi->raw_super->encrypt_pw_salt, + 16)) + return -EFAULT; return 0; } +static int f2fs_ioc_gc(struct file *filp, unsigned long arg) +{ + struct inode *inode = file_inode(filp); + struct f2fs_sb_info *sbi = F2FS_I_SB(inode); + __u32 sync; + int ret; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + if (get_user(sync, (__u32 __user *)arg)) + return -EFAULT; + + if (f2fs_readonly(sbi->sb)) + return -EROFS; + + ret = mnt_want_write_file(filp); + if (ret) + return ret; + + if (!sync) { + if (!mutex_trylock(&sbi->gc_mutex)) { + ret = -EBUSY; + goto out; + } + } else { + mutex_lock(&sbi->gc_mutex); + } + + ret = f2fs_gc(sbi, sync); +out: + mnt_drop_write_file(filp); + return ret; +} + +static int f2fs_ioc_write_checkpoint(struct file *filp, unsigned long arg) +{ + struct inode *inode = file_inode(filp); + struct f2fs_sb_info *sbi = F2FS_I_SB(inode); + int ret; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + if (f2fs_readonly(sbi->sb)) + return -EROFS; + + ret = mnt_want_write_file(filp); + if (ret) + return ret; + + ret = f2fs_sync_fs(sbi->sb, 1); + + mnt_drop_write_file(filp); + return ret; +} + +static int f2fs_defragment_range(struct f2fs_sb_info *sbi, + struct file *filp, + struct f2fs_defragment *range) +{ + struct inode *inode = file_inode(filp); + struct f2fs_map_blocks map = { .m_next_pgofs = NULL }; + struct extent_info ei; + pgoff_t pg_start, pg_end; + unsigned int blk_per_seg = sbi->blocks_per_seg; + unsigned int total = 0, sec_num; + unsigned int pages_per_sec = sbi->segs_per_sec * blk_per_seg; + block_t blk_end = 0; + bool fragmented = false; + int err; + + /* if in-place-update policy is enabled, don't waste time here */ + if (need_inplace_update(inode)) + return -EINVAL; + + pg_start = range->start >> PAGE_SHIFT; + pg_end = (range->start + range->len) >> PAGE_SHIFT; + + f2fs_balance_fs(sbi, true); + + inode_lock(inode); + + /* writeback all dirty pages in the range */ + err = filemap_write_and_wait_range(inode->i_mapping, range->start, + range->start + range->len - 1); + if (err) + goto out; + + /* + * lookup mapping info in extent cache, skip defragmenting if physical + * block addresses are continuous. + */ + if (f2fs_lookup_extent_cache(inode, pg_start, &ei)) { + if (ei.fofs + ei.len >= pg_end) + goto out; + } + + map.m_lblk = pg_start; + + /* + * lookup mapping info in dnode page cache, skip defragmenting if all + * physical block addresses are continuous even if there are hole(s) + * in logical blocks. + */ + while (map.m_lblk < pg_end) { + map.m_len = pg_end - map.m_lblk; + err = f2fs_map_blocks(inode, &map, 0, F2FS_GET_BLOCK_READ); + if (err) + goto out; + + if (!(map.m_flags & F2FS_MAP_FLAGS)) { + map.m_lblk++; + continue; + } + + if (blk_end && blk_end != map.m_pblk) { + fragmented = true; + break; + } + blk_end = map.m_pblk + map.m_len; + + map.m_lblk += map.m_len; + } + + if (!fragmented) + goto out; + + map.m_lblk = pg_start; + map.m_len = pg_end - pg_start; + + sec_num = (map.m_len + pages_per_sec - 1) / pages_per_sec; + + /* + * make sure there are enough free section for LFS allocation, this can + * avoid defragment running in SSR mode when free section are allocated + * intensively + */ + if (has_not_enough_free_secs(sbi, sec_num)) { + err = -EAGAIN; + goto out; + } + + while (map.m_lblk < pg_end) { + pgoff_t idx; + int cnt = 0; + +do_map: + map.m_len = pg_end - map.m_lblk; + err = f2fs_map_blocks(inode, &map, 0, F2FS_GET_BLOCK_READ); + if (err) + goto clear_out; + + if (!(map.m_flags & F2FS_MAP_FLAGS)) { + map.m_lblk++; + continue; + } + + set_inode_flag(inode, FI_DO_DEFRAG); + + idx = map.m_lblk; + while (idx < map.m_lblk + map.m_len && cnt < blk_per_seg) { + struct page *page; + + page = get_lock_data_page(inode, idx, true); + if (IS_ERR(page)) { + err = PTR_ERR(page); + goto clear_out; + } + + set_page_dirty(page); + f2fs_put_page(page, 1); + + idx++; + cnt++; + total++; + } + + map.m_lblk = idx; + + if (idx < pg_end && cnt < blk_per_seg) + goto do_map; + + clear_inode_flag(inode, FI_DO_DEFRAG); + + err = filemap_fdatawrite(inode->i_mapping); + if (err) + goto out; + } +clear_out: + clear_inode_flag(inode, FI_DO_DEFRAG); +out: + inode_unlock(inode); + if (!err) + range->len = (u64)total << PAGE_SHIFT; + return err; +} + +static int f2fs_ioc_defragment(struct file *filp, unsigned long arg) +{ + struct inode *inode = file_inode(filp); + struct f2fs_sb_info *sbi = F2FS_I_SB(inode); + struct f2fs_defragment range; + int err; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + if (!S_ISREG(inode->i_mode)) + return -EINVAL; + + err = mnt_want_write_file(filp); + if (err) + return err; + + if (f2fs_readonly(sbi->sb)) { + err = -EROFS; + goto out; + } + + if (copy_from_user(&range, (struct f2fs_defragment __user *)arg, + sizeof(range))) { + err = -EFAULT; + goto out; + } + + /* verify alignment of offset & size */ + if (range.start & (F2FS_BLKSIZE - 1) || + range.len & (F2FS_BLKSIZE - 1)) { + err = -EINVAL; + goto out; + } + + err = f2fs_defragment_range(sbi, filp, &range); + f2fs_update_time(sbi, REQ_TIME); + if (err < 0) + goto out; + + if (copy_to_user((struct f2fs_defragment __user *)arg, &range, + sizeof(range))) + err = -EFAULT; +out: + mnt_drop_write_file(filp); + return err; +} + long f2fs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { switch (cmd) { @@ -1148,11 +2118,65 @@ long f2fs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) return f2fs_ioc_shutdown(filp, arg); case FITRIM: return f2fs_ioc_fitrim(filp, arg); + case F2FS_IOC_SET_ENCRYPTION_POLICY: + return f2fs_ioc_set_encryption_policy(filp, arg); + case F2FS_IOC_GET_ENCRYPTION_POLICY: + return f2fs_ioc_get_encryption_policy(filp, arg); + case F2FS_IOC_GET_ENCRYPTION_PWSALT: + return f2fs_ioc_get_encryption_pwsalt(filp, arg); + case F2FS_IOC_GARBAGE_COLLECT: + return f2fs_ioc_gc(filp, arg); + case F2FS_IOC_WRITE_CHECKPOINT: + return f2fs_ioc_write_checkpoint(filp, arg); + case F2FS_IOC_DEFRAGMENT: + return f2fs_ioc_defragment(filp, arg); default: return -ENOTTY; } } +static ssize_t f2fs_file_aio_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) +{ + struct file *file = iocb->ki_filp; + struct inode *inode = file_inode(file); + size_t count; + struct blk_plug plug; + ssize_t ret; + + if (f2fs_encrypted_inode(inode) && + !fscrypt_has_encryption_key(inode) && + fscrypt_get_encryption_info(inode)) + return -EACCES; + + ret = generic_segment_checks(iov, &nr_segs, &count, VERIFY_READ); + if (ret) + return ret; + + inode_lock(inode); + ret = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode)); + if (!ret) { + ret = f2fs_preallocate_blocks(inode, pos, count, + iocb->ki_filp->f_flags & O_DIRECT); + if (!ret) { + blk_start_plug(&plug); + ret = __generic_file_aio_write(iocb, iov, nr_segs, + &iocb->ki_pos); + blk_finish_plug(&plug); + } + } + inode_unlock(inode); + + if (ret > 0 || ret == -EIOCBQUEUED) { + ssize_t err; + + err = generic_write_sync(file, iocb->ki_pos - ret, ret); + if (err < 0 && ret > 0) + ret = err; + } + return ret; +} + #ifdef CONFIG_COMPAT long f2fs_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { @@ -1163,6 +2187,22 @@ long f2fs_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg) case F2FS_IOC32_SETFLAGS: cmd = F2FS_IOC_SETFLAGS; break; + case F2FS_IOC32_GETVERSION: + cmd = F2FS_IOC_GETVERSION; + break; + case F2FS_IOC_START_ATOMIC_WRITE: + case F2FS_IOC_COMMIT_ATOMIC_WRITE: + case F2FS_IOC_START_VOLATILE_WRITE: + case F2FS_IOC_RELEASE_VOLATILE_WRITE: + case F2FS_IOC_ABORT_VOLATILE_WRITE: + case F2FS_IOC_SHUTDOWN: + case F2FS_IOC_SET_ENCRYPTION_POLICY: + case F2FS_IOC_GET_ENCRYPTION_PWSALT: + case F2FS_IOC_GET_ENCRYPTION_POLICY: + case F2FS_IOC_GARBAGE_COLLECT: + case F2FS_IOC_WRITE_CHECKPOINT: + case F2FS_IOC_DEFRAGMENT: + break; default: return -ENOIOCTLCMD; } @@ -1170,13 +2210,30 @@ long f2fs_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg) } #endif +static ssize_t f2fs_file_splice_write(struct pipe_inode_info *pipe, + struct file *out, + loff_t *ppos, size_t len, unsigned int flags) +{ + struct address_space *mapping = out->f_mapping; + struct inode *inode = mapping->host; + int ret; + + ret = generic_write_checks(out, ppos, &len, S_ISBLK(inode->i_mode)); + if (ret) + return ret; + ret = f2fs_preallocate_blocks(inode, *ppos, len, false); + if (ret) + return ret; + return generic_file_splice_write(pipe, out, ppos, len, flags); +} + const struct file_operations f2fs_file_operations = { .llseek = f2fs_llseek, .read = do_sync_read, .write = do_sync_write, .aio_read = generic_file_aio_read, - .aio_write = generic_file_aio_write, - .open = generic_file_open, + .aio_write = f2fs_file_aio_write, + .open = f2fs_file_open, .release = f2fs_release_file, .mmap = f2fs_file_mmap, .fsync = f2fs_sync_file, @@ -1186,5 +2243,5 @@ const struct file_operations f2fs_file_operations = { .compat_ioctl = f2fs_compat_ioctl, #endif .splice_read = generic_file_splice_read, - .splice_write = generic_file_splice_write, + .splice_write = f2fs_file_splice_write, }; diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c old mode 100644 new mode 100755 index 733a3651c034..2062e7673db8 --- a/fs/f2fs/gc.c +++ b/fs/f2fs/gc.c @@ -16,7 +16,6 @@ #include #include #include -#include #include "f2fs.h" #include "node.h" @@ -78,9 +77,12 @@ static int gc_thread_func(void *data) stat_inc_bggc_count(sbi); /* if return value is not zero, no victim was selected */ - if (f2fs_gc(sbi)) + if (f2fs_gc(sbi, test_opt(sbi, FORCE_FG_GC))) wait_ms = gc_th->no_gc_sleep_time; + trace_f2fs_background_gc(sbi->sb, wait_ms, + prefree_segments(sbi), free_segments(sbi)); + /* balancing f2fs's metadata periodically */ f2fs_balance_fs_bg(sbi); @@ -94,7 +96,7 @@ int start_gc_thread(struct f2fs_sb_info *sbi) dev_t dev = sbi->sb->s_bdev->bd_dev; int err = 0; - gc_th = kmalloc(sizeof(struct f2fs_gc_kthread), GFP_KERNEL); + gc_th = f2fs_kmalloc(sizeof(struct f2fs_gc_kthread), GFP_KERNEL); if (!gc_th) { err = -ENOMEM; goto out; @@ -170,9 +172,9 @@ static unsigned int get_max_cost(struct f2fs_sb_info *sbi, { /* SSR allocates in a segment unit */ if (p->alloc_mode == SSR) - return 1 << sbi->log_blocks_per_seg; + return sbi->blocks_per_seg; if (p->gc_mode == GC_GREEDY) - return (1 << sbi->log_blocks_per_seg) * p->ofs_unit; + return sbi->blocks_per_seg * p->ofs_unit; else if (p->gc_mode == GC_CB) return UINT_MAX; else /* No other gc_mode */ @@ -243,6 +245,18 @@ static inline unsigned int get_gc_cost(struct f2fs_sb_info *sbi, return get_cb_cost(sbi, segno); } +static unsigned int count_bits(const unsigned long *addr, + unsigned int offset, unsigned int len) +{ + unsigned int end = offset + len, sum = 0; + + while (offset < end) { + if (test_bit(offset++, addr)) + ++sum; + } + return sum; +} + /* * This function is called from two paths. * One is garbage collection and the other is SSR segment selection. @@ -256,8 +270,9 @@ static int get_victim_by_default(struct f2fs_sb_info *sbi, { struct dirty_seglist_info *dirty_i = DIRTY_I(sbi); struct victim_sel_policy p; - unsigned int secno, max_cost; - int nsearched = 0; + unsigned int secno, max_cost, last_victim; + unsigned int last_segment = MAIN_SEGS(sbi); + unsigned int nsearched = 0; mutex_lock(&dirty_i->seglist_lock); @@ -267,6 +282,10 @@ static int get_victim_by_default(struct f2fs_sb_info *sbi, p.min_segno = NULL_SEGNO; p.min_cost = max_cost = get_max_cost(sbi, &p); + if (p.max_search == 0) + goto out; + + last_victim = sbi->last_victim[p.gc_mode]; if (p.alloc_mode == LFS && gc_type == FG_GC) { p.min_segno = check_bg_victims(sbi); if (p.min_segno != NULL_SEGNO) @@ -277,9 +296,10 @@ static int get_victim_by_default(struct f2fs_sb_info *sbi, unsigned long cost; unsigned int segno; - segno = find_next_bit(p.dirty_segmap, MAIN_SEGS(sbi), p.offset); - if (segno >= MAIN_SEGS(sbi)) { + segno = find_next_bit(p.dirty_segmap, last_segment, p.offset); + if (segno >= last_segment) { if (sbi->last_victim[p.gc_mode]) { + last_segment = sbi->last_victim[p.gc_mode]; sbi->last_victim[p.gc_mode] = 0; p.offset = 0; continue; @@ -288,27 +308,35 @@ static int get_victim_by_default(struct f2fs_sb_info *sbi, } p.offset = segno + p.ofs_unit; - if (p.ofs_unit > 1) + if (p.ofs_unit > 1) { p.offset -= segno % p.ofs_unit; + nsearched += count_bits(p.dirty_segmap, + p.offset - p.ofs_unit, + p.ofs_unit); + } else { + nsearched++; + } + secno = GET_SECNO(sbi, segno); if (sec_usage_check(sbi, secno)) - continue; + goto next; if (gc_type == BG_GC && test_bit(secno, dirty_i->victim_secmap)) - continue; + goto next; cost = get_gc_cost(sbi, segno, &p); if (p.min_cost > cost) { p.min_segno = segno; p.min_cost = cost; - } else if (unlikely(cost == max_cost)) { - continue; } - - if (nsearched++ >= p.max_search) { - sbi->last_victim[p.gc_mode] = segno; +next: + if (nsearched >= p.max_search) { + if (!sbi->last_victim[p.gc_mode] && segno <= last_victim) + sbi->last_victim[p.gc_mode] = last_victim + 1; + else + sbi->last_victim[p.gc_mode] = segno + 1; break; } } @@ -327,6 +355,7 @@ static int get_victim_by_default(struct f2fs_sb_info *sbi, sbi->cur_victim_sec, prefree_segments(sbi), free_segments(sbi)); } +out: mutex_unlock(&dirty_i->seglist_lock); return (p.min_segno == NULL_SEGNO) ? 0 : 1; @@ -396,14 +425,18 @@ static void gc_node_segment(struct f2fs_sb_info *sbi, { bool initial = true; struct f2fs_summary *entry; + block_t start_addr; int off; + start_addr = START_BLOCK(sbi, segno); + next_step: entry = sum; for (off = 0; off < sbi->blocks_per_seg; off++, entry++) { nid_t nid = le32_to_cpu(entry->nid); struct page *node_page; + struct node_info ni; /* stop BG_GC if there is not enough free sections. */ if (gc_type == BG_GC && has_not_enough_free_secs(sbi, 0)) @@ -426,15 +459,13 @@ static void gc_node_segment(struct f2fs_sb_info *sbi, continue; } - /* set page dirty and write it */ - if (gc_type == FG_GC) { - f2fs_wait_on_page_writeback(node_page, NODE); - set_page_dirty(node_page); - } else { - if (!PageWriteback(node_page)) - set_page_dirty(node_page); + get_node_info(sbi, nid, &ni); + if (ni.blk_addr != start_addr + off) { + f2fs_put_page(node_page, 1); + continue; } - f2fs_put_page(node_page, 1); + + move_node_page(node_page, gc_type); stat_inc_node_blk_count(sbi, 1, gc_type); } @@ -442,22 +473,6 @@ static void gc_node_segment(struct f2fs_sb_info *sbi, initial = false; goto next_step; } - - if (gc_type == FG_GC) { - struct writeback_control wbc = { - .sync_mode = WB_SYNC_ALL, - .nr_to_write = LONG_MAX, - .for_reclaim = 0, - }; - sync_node_pages(sbi, 0, &wbc); - - /* - * In the case of FG_GC, it'd be better to reclaim this victim - * completely. - */ - if (get_valid_blocks(sbi, segno, 1) != 0) - goto next_step; - } } /* @@ -467,7 +482,7 @@ static void gc_node_segment(struct f2fs_sb_info *sbi, * as indirect or double indirect node blocks, are given, it must be a caller's * bug. */ -block_t start_bidx_of_node(unsigned int node_ofs, struct f2fs_inode_info *fi) +block_t start_bidx_of_node(unsigned int node_ofs, struct inode *inode) { unsigned int indirect_blks = 2 * NIDS_PER_BLOCK + 4; unsigned int bidx; @@ -484,10 +499,10 @@ block_t start_bidx_of_node(unsigned int node_ofs, struct f2fs_inode_info *fi) int dec = (node_ofs - indirect_blks - 3) / (NIDS_PER_BLOCK + 1); bidx = node_ofs - 5 - dec; } - return bidx * ADDRS_PER_BLOCK + ADDRS_PER_INODE(fi); + return bidx * ADDRS_PER_BLOCK + ADDRS_PER_INODE(inode); } -static int check_dnode(struct f2fs_sb_info *sbi, struct f2fs_summary *sum, +static bool is_alive(struct f2fs_sb_info *sbi, struct f2fs_summary *sum, struct node_info *dni, block_t blkaddr, unsigned int *nofs) { struct page *node_page; @@ -500,13 +515,13 @@ static int check_dnode(struct f2fs_sb_info *sbi, struct f2fs_summary *sum, node_page = get_node_page(sbi, nid); if (IS_ERR(node_page)) - return 0; + return false; get_node_info(sbi, nid, dni); if (sum->version != dni->version) { f2fs_put_page(node_page, 1); - return 0; + return false; } *nofs = ofs_of_node(node_page); @@ -514,16 +529,116 @@ static int check_dnode(struct f2fs_sb_info *sbi, struct f2fs_summary *sum, f2fs_put_page(node_page, 1); if (source_blkaddr != blkaddr) - return 0; - return 1; + return false; + return true; } -static void move_data_page(struct inode *inode, struct page *page, int gc_type) +static void move_encrypted_block(struct inode *inode, block_t bidx) { struct f2fs_io_info fio = { + .sbi = F2FS_I_SB(inode), .type = DATA, - .rw = WRITE_SYNC, + .rw = READ_SYNC, + .encrypted_page = NULL, }; + struct dnode_of_data dn; + struct f2fs_summary sum; + struct node_info ni; + struct page *page; + block_t newaddr; + int err; + + /* do not read out */ + page = f2fs_grab_cache_page(inode->i_mapping, bidx, false); + if (!page) + return; + + set_new_dnode(&dn, inode, NULL, NULL, 0); + err = get_dnode_of_data(&dn, bidx, LOOKUP_NODE); + if (err) + goto out; + + if (unlikely(dn.data_blkaddr == NULL_ADDR)) { + ClearPageUptodate(page); + goto put_out; + } + + /* + * don't cache encrypted data into meta inode until previous dirty + * data were writebacked to avoid racing between GC and flush. + */ + f2fs_wait_on_page_writeback(page, DATA, true); + + get_node_info(fio.sbi, dn.nid, &ni); + set_summary(&sum, dn.nid, dn.ofs_in_node, ni.version); + + /* read page */ + fio.page = page; + fio.new_blkaddr = fio.old_blkaddr = dn.data_blkaddr; + + allocate_data_block(fio.sbi, NULL, fio.old_blkaddr, &newaddr, + &sum, CURSEG_COLD_DATA); + + fio.encrypted_page = f2fs_grab_cache_page(META_MAPPING(fio.sbi), + newaddr, true); + if (!fio.encrypted_page) { + err = -ENOMEM; + goto recover_block; + } + + err = f2fs_submit_page_bio(&fio); + if (err) + goto put_page_out; + + /* write page */ + lock_page(fio.encrypted_page); + + if (unlikely(fio.encrypted_page->mapping != META_MAPPING(fio.sbi))) { + err = -EIO; + goto put_page_out; + } + if (unlikely(!PageUptodate(fio.encrypted_page))) { + err = -EIO; + goto put_page_out; + } + + set_page_dirty(fio.encrypted_page); + f2fs_wait_on_page_writeback(fio.encrypted_page, DATA, true); + if (clear_page_dirty_for_io(fio.encrypted_page)) + dec_page_count(fio.sbi, F2FS_DIRTY_META); + + set_page_writeback(fio.encrypted_page); + + /* allocate block address */ + f2fs_wait_on_page_writeback(dn.node_page, NODE, true); + + fio.rw = WRITE_SYNC; + fio.new_blkaddr = newaddr; + f2fs_submit_page_mbio(&fio); + + f2fs_update_data_blkaddr(&dn, newaddr); + set_inode_flag(inode, FI_APPEND_WRITE); + if (page->index == 0) + set_inode_flag(inode, FI_FIRST_BLOCK_WRITTEN); +put_page_out: + f2fs_put_page(fio.encrypted_page, 1); +recover_block: + if (err) + __f2fs_replace_block(fio.sbi, &sum, newaddr, fio.old_blkaddr, + true, true); +put_out: + f2fs_put_dnode(&dn); +out: + f2fs_put_page(page, 1); +} + +static void move_data_page(struct inode *inode, block_t bidx, int gc_type) +{ + struct page *page; + + page = get_lock_data_page(inode, bidx, true); + if (IS_ERR(page)) + return; if (gc_type == BG_GC) { if (PageWriteback(page)) @@ -531,12 +646,30 @@ static void move_data_page(struct inode *inode, struct page *page, int gc_type) set_page_dirty(page); set_cold_data(page); } else { - f2fs_wait_on_page_writeback(page, DATA); + struct f2fs_io_info fio = { + .sbi = F2FS_I_SB(inode), + .type = DATA, + .rw = WRITE_SYNC, + .page = page, + .encrypted_page = NULL, + }; + bool is_dirty = PageDirty(page); + int err; +retry: + set_page_dirty(page); + f2fs_wait_on_page_writeback(page, DATA, true); if (clear_page_dirty_for_io(page)) inode_dec_dirty_pages(inode); + set_cold_data(page); - do_write_data_page(page, &fio); + + err = do_write_data_page(&fio); + if (err == -ENOMEM && is_dirty) { + congestion_wait(BLK_RW_ASYNC, HZ/50); + goto retry; + } + clear_cold_data(page); } out: @@ -584,7 +717,7 @@ static void gc_data_segment(struct f2fs_sb_info *sbi, struct f2fs_summary *sum, } /* Get an inode by ino with checking validity */ - if (check_dnode(sbi, entry, &dni, start_addr + off, &nofs) == 0) + if (!is_alive(sbi, entry, &dni, start_addr + off, &nofs)) continue; if (phase == 1) { @@ -599,10 +732,16 @@ static void gc_data_segment(struct f2fs_sb_info *sbi, struct f2fs_summary *sum, if (IS_ERR(inode) || is_bad_inode(inode)) continue; - start_bidx = start_bidx_of_node(nofs, F2FS_I(inode)); + /* if encrypted inode, let's go phase 3 */ + if (f2fs_encrypted_inode(inode) && + S_ISREG(inode->i_mode)) { + add_gc_inode(gc_list, inode); + continue; + } - data_page = find_data_page(inode, - start_bidx + ofs_in_node, false); + start_bidx = start_bidx_of_node(nofs, inode); + data_page = get_read_data_page(inode, + start_bidx + ofs_in_node, READA, true); if (IS_ERR(data_page)) { iput(inode); continue; @@ -616,31 +755,38 @@ static void gc_data_segment(struct f2fs_sb_info *sbi, struct f2fs_summary *sum, /* phase 3 */ inode = find_gc_inode(gc_list, dni.ino); if (inode) { - start_bidx = start_bidx_of_node(nofs, F2FS_I(inode)); - data_page = get_lock_data_page(inode, - start_bidx + ofs_in_node); - if (IS_ERR(data_page)) - continue; - move_data_page(inode, data_page, gc_type); + struct f2fs_inode_info *fi = F2FS_I(inode); + bool locked = false; + + if (S_ISREG(inode->i_mode)) { + if (!down_write_trylock(&fi->dio_rwsem[READ])) + continue; + if (!down_write_trylock( + &fi->dio_rwsem[WRITE])) { + up_write(&fi->dio_rwsem[READ]); + continue; + } + locked = true; + } + + start_bidx = start_bidx_of_node(nofs, inode) + + ofs_in_node; + if (f2fs_encrypted_inode(inode) && S_ISREG(inode->i_mode)) + move_encrypted_block(inode, start_bidx); + else + move_data_page(inode, start_bidx, gc_type); + + if (locked) { + up_write(&fi->dio_rwsem[WRITE]); + up_write(&fi->dio_rwsem[READ]); + } + stat_inc_data_blk_count(sbi, 1, gc_type); } } if (++phase < 4) goto next_step; - - if (gc_type == FG_GC) { - f2fs_submit_merged_bio(sbi, DATA, WRITE); - - /* - * In the case of FG_GC, it'd be better to reclaim this victim - * completely. - */ - if (get_valid_blocks(sbi, segno, 1) != 0) { - phase = 2; - goto next_step; - } - } } static int __get_victim(struct f2fs_sb_info *sbi, unsigned int *victim, @@ -656,42 +802,88 @@ static int __get_victim(struct f2fs_sb_info *sbi, unsigned int *victim, return ret; } -static void do_garbage_collect(struct f2fs_sb_info *sbi, unsigned int segno, +static int do_garbage_collect(struct f2fs_sb_info *sbi, + unsigned int start_segno, struct gc_inode_list *gc_list, int gc_type) { struct page *sum_page; struct f2fs_summary_block *sum; struct blk_plug plug; + unsigned int segno = start_segno; + unsigned int end_segno = start_segno + sbi->segs_per_sec; + int seg_freed = 0; + unsigned char type = IS_DATASEG(get_seg_entry(sbi, segno)->type) ? + SUM_TYPE_DATA : SUM_TYPE_NODE; - /* read segment summary of victim */ - sum_page = get_sum_page(sbi, segno); + /* readahead multi ssa blocks those have contiguous address */ + if (sbi->segs_per_sec > 1) + ra_meta_pages(sbi, GET_SUM_BLOCK(sbi, segno), + sbi->segs_per_sec, META_SSA, true); + + /* reference all summary page */ + while (segno < end_segno) { + sum_page = get_sum_page(sbi, segno++); + unlock_page(sum_page); + } blk_start_plug(&plug); - sum = page_address(sum_page); + for (segno = start_segno; segno < end_segno; segno++) { - switch (GET_SUM_TYPE((&sum->footer))) { - case SUM_TYPE_NODE: - gc_node_segment(sbi, sum->entries, segno, gc_type); - break; - case SUM_TYPE_DATA: - gc_data_segment(sbi, sum->entries, gc_list, segno, gc_type); - break; + if (get_valid_blocks(sbi, segno, 1) == 0) + continue; + + /* find segment summary of victim */ + sum_page = find_get_page(META_MAPPING(sbi), + GET_SUM_BLOCK(sbi, segno)); + f2fs_bug_on(sbi, !PageUptodate(sum_page)); + f2fs_put_page(sum_page, 0); + + sum = page_address(sum_page); + f2fs_bug_on(sbi, type != GET_SUM_TYPE((&sum->footer))); + + /* + * this is to avoid deadlock: + * - lock_page(sum_page) - f2fs_replace_block + * - check_valid_map() - mutex_lock(sentry_lock) + * - mutex_lock(sentry_lock) - change_curseg() + * - lock_page(sum_page) + */ + + if (type == SUM_TYPE_NODE) + gc_node_segment(sbi, sum->entries, segno, gc_type); + else + gc_data_segment(sbi, sum->entries, gc_list, segno, + gc_type); + + stat_inc_seg_count(sbi, type, gc_type); + + f2fs_put_page(sum_page, 0); } + + if (gc_type == FG_GC) + f2fs_submit_merged_bio(sbi, + (type == SUM_TYPE_NODE) ? NODE : DATA, WRITE); + blk_finish_plug(&plug); - stat_inc_seg_count(sbi, GET_SUM_TYPE((&sum->footer)), gc_type); + if (gc_type == FG_GC) { + while (start_segno < end_segno) + if (get_valid_blocks(sbi, start_segno++, 1) == 0) + seg_freed++; + } + stat_inc_call_count(sbi->stat_info); - f2fs_put_page(sum_page, 1); + return seg_freed; } -int f2fs_gc(struct f2fs_sb_info *sbi) +int f2fs_gc(struct f2fs_sb_info *sbi, bool sync) { - unsigned int segno, i; - int gc_type = BG_GC; - int nfree = 0; - int ret = -1; + unsigned int segno; + int gc_type = sync ? FG_GC : BG_GC; + int sec_freed = 0, seg_freed; + int ret = -EINVAL; struct cp_control cpc; struct gc_inode_list gc_list = { .ilist = LIST_HEAD_INIT(gc_list.ilist), @@ -700,43 +892,57 @@ int f2fs_gc(struct f2fs_sb_info *sbi) cpc.reason = __get_cp_reason(sbi); gc_more: + segno = NULL_SEGNO; + if (unlikely(!(sbi->sb->s_flags & MS_ACTIVE))) goto stop; - if (unlikely(f2fs_cp_error(sbi))) + if (unlikely(f2fs_cp_error(sbi))) { + ret = -EIO; goto stop; + } - if (gc_type == BG_GC && has_not_enough_free_secs(sbi, nfree)) { + if (gc_type == BG_GC && has_not_enough_free_secs(sbi, sec_freed)) { gc_type = FG_GC; - write_checkpoint(sbi, &cpc); + /* + * If there is no victim and no prefree segment but still not + * enough free sections, we should flush dent/node blocks and do + * garbage collections. + */ + if (__get_victim(sbi, &segno, gc_type) || + prefree_segments(sbi)) { + write_checkpoint(sbi, &cpc); + segno = NULL_SEGNO; + } else if (has_not_enough_free_secs(sbi, 0)) { + write_checkpoint(sbi, &cpc); + } } - if (!__get_victim(sbi, &segno, gc_type)) + if (segno == NULL_SEGNO && !__get_victim(sbi, &segno, gc_type)) goto stop; ret = 0; - /* readahead multi ssa blocks those have contiguous address */ - if (sbi->segs_per_sec > 1) - ra_meta_pages(sbi, GET_SUM_BLOCK(sbi, segno), sbi->segs_per_sec, - META_SSA); + seg_freed = do_garbage_collect(sbi, segno, &gc_list, gc_type); - for (i = 0; i < sbi->segs_per_sec; i++) - do_garbage_collect(sbi, segno + i, &gc_list, gc_type); + if (gc_type == FG_GC && seg_freed == sbi->segs_per_sec) + sec_freed++; - if (gc_type == FG_GC) { + if (gc_type == FG_GC) sbi->cur_victim_sec = NULL_SEGNO; - nfree++; - WARN_ON(get_valid_blocks(sbi, segno, sbi->segs_per_sec)); - } - if (has_not_enough_free_secs(sbi, nfree)) - goto gc_more; + if (!sync) { + if (has_not_enough_free_secs(sbi, sec_freed)) + goto gc_more; - if (gc_type == FG_GC) - write_checkpoint(sbi, &cpc); + if (gc_type == FG_GC) + write_checkpoint(sbi, &cpc); + } stop: mutex_unlock(&sbi->gc_mutex); put_gc_inode(&gc_list); + + if (sync) + ret = sec_freed ? 0 : -EAGAIN; return ret; } diff --git a/fs/f2fs/gc.h b/fs/f2fs/gc.h old mode 100644 new mode 100755 index 9091e0c9ded6..a993967dcdb9 --- a/fs/f2fs/gc.h +++ b/fs/f2fs/gc.h @@ -100,11 +100,3 @@ static inline bool has_enough_invalid_blocks(struct f2fs_sb_info *sbi) return true; return false; } - -static inline int is_idle(struct f2fs_sb_info *sbi) -{ - struct block_device *bdev = sbi->sb->s_bdev; - struct request_queue *q = bdev_get_queue(bdev); - struct request_list *rl = &q->rq; - return !(rl->count[BLK_RW_SYNC]) && !(rl->count[BLK_RW_ASYNC]); -} diff --git a/fs/f2fs/hash.c b/fs/f2fs/hash.c old mode 100644 new mode 100755 index a844fcfb9a8d..71b7206c431e --- a/fs/f2fs/hash.c +++ b/fs/f2fs/hash.c @@ -79,8 +79,7 @@ f2fs_hash_t f2fs_dentry_hash(const struct qstr *name_info) const unsigned char *name = name_info->name; size_t len = name_info->len; - if ((len <= 2) && (name[0] == '.') && - (name[1] == '.' || name[1] == '\0')) + if (is_dot_dotdot(name_info)) return 0; /* Initialize the default seed for the hash checksum functions */ diff --git a/fs/f2fs/inline.c b/fs/f2fs/inline.c old mode 100644 new mode 100755 index 7885c71e5054..c698bd16b5a3 --- a/fs/f2fs/inline.c +++ b/fs/f2fs/inline.c @@ -12,12 +12,10 @@ #include #include "f2fs.h" +#include "node.h" -bool f2fs_may_inline(struct inode *inode) +bool f2fs_may_inline_data(struct inode *inode) { - if (!test_opt(F2FS_I_SB(inode), INLINE_DATA)) - return false; - if (f2fs_is_atomic_file(inode)) return false; @@ -27,6 +25,20 @@ bool f2fs_may_inline(struct inode *inode) if (i_size_read(inode) > MAX_INLINE_DATA) return false; + if (f2fs_encrypted_inode(inode) && S_ISREG(inode->i_mode)) + return false; + + return true; +} + +bool f2fs_may_inline_dentry(struct inode *inode) +{ + if (!test_opt(F2FS_I_SB(inode), INLINE_DENTRY)) + return false; + + if (!S_ISDIR(inode->i_mode)) + return false; + return true; } @@ -39,7 +51,7 @@ void read_inline_data(struct page *page, struct page *ipage) f2fs_bug_on(F2FS_P_SB(page), page->index); - zero_user_segment(page, MAX_INLINE_DATA, PAGE_CACHE_SIZE); + zero_user_segment(page, MAX_INLINE_DATA, PAGE_SIZE); /* Copy the whole inline data block */ src_addr = inline_data_addr(ipage); @@ -47,7 +59,8 @@ void read_inline_data(struct page *page, struct page *ipage) memcpy(dst_addr, src_addr, MAX_INLINE_DATA); flush_dcache_page(page); kunmap_atomic(dst_addr); - SetPageUptodate(page); + if (!PageUptodate(page)) + SetPageUptodate(page); } bool truncate_inline_inode(struct page *ipage, u64 from) @@ -59,9 +72,9 @@ bool truncate_inline_inode(struct page *ipage, u64 from) addr = inline_data_addr(ipage); - f2fs_wait_on_page_writeback(ipage, NODE); + f2fs_wait_on_page_writeback(ipage, NODE, true); memset(addr + from, 0, MAX_INLINE_DATA - from); - + set_page_dirty(ipage); return true; } @@ -81,11 +94,12 @@ int f2fs_read_inline_data(struct inode *inode, struct page *page) } if (page->index) - zero_user_segment(page, 0, PAGE_CACHE_SIZE); + zero_user_segment(page, 0, PAGE_SIZE); else read_inline_data(page, ipage); - SetPageUptodate(page); + if (!PageUptodate(page)) + SetPageUptodate(page); f2fs_put_page(ipage, 1); unlock_page(page); return 0; @@ -93,15 +107,15 @@ int f2fs_read_inline_data(struct inode *inode, struct page *page) int f2fs_convert_inline_page(struct dnode_of_data *dn, struct page *page) { - void *src_addr, *dst_addr; struct f2fs_io_info fio = { + .sbi = F2FS_I_SB(dn->inode), .type = DATA, .rw = WRITE_SYNC | REQ_PRIO, + .page = page, + .encrypted_page = NULL, }; int dirty, err; - f2fs_bug_on(F2FS_I_SB(dn->inode), page->index); - if (!f2fs_exist_data(dn->inode)) goto clear_out; @@ -109,43 +123,31 @@ int f2fs_convert_inline_page(struct dnode_of_data *dn, struct page *page) if (err) return err; - f2fs_wait_on_page_writeback(page, DATA); - - if (PageUptodate(page)) - goto no_update; + f2fs_bug_on(F2FS_P_SB(page), PageWriteback(page)); - zero_user_segment(page, MAX_INLINE_DATA, PAGE_CACHE_SIZE); + read_inline_data(page, dn->inode_page); + set_page_dirty(page); - /* Copy the whole inline data block */ - src_addr = inline_data_addr(dn->inode_page); - dst_addr = kmap_atomic(page); - memcpy(dst_addr, src_addr, MAX_INLINE_DATA); - flush_dcache_page(page); - kunmap_atomic(dst_addr); - SetPageUptodate(page); -no_update: /* clear dirty state */ dirty = clear_page_dirty_for_io(page); /* write data page to try to make data consistent */ set_page_writeback(page); - fio.blk_addr = dn->data_blkaddr; - write_data_page(page, dn, &fio); - set_data_blkaddr(dn); - f2fs_update_extent_cache(dn); - f2fs_wait_on_page_writeback(page, DATA); + fio.old_blkaddr = dn->data_blkaddr; + write_data_page(dn, &fio); + f2fs_wait_on_page_writeback(page, DATA, true); if (dirty) inode_dec_dirty_pages(dn->inode); /* this converted inline_data should be recovered. */ - set_inode_flag(F2FS_I(dn->inode), FI_APPEND_WRITE); + set_inode_flag(dn->inode, FI_APPEND_WRITE); /* clear inline data and flag after data writeback */ truncate_inline_inode(dn->inode_page, 0); + clear_inline_node(dn->inode_page); clear_out: stat_dec_inline_inode(dn->inode); f2fs_clear_inline_inode(dn->inode); - sync_inode_page(dn); f2fs_put_dnode(dn); return 0; } @@ -157,7 +159,10 @@ int f2fs_convert_inline_inode(struct inode *inode) struct page *ipage, *page; int err = 0; - page = grab_cache_page(inode->i_mapping, 0); + if (!f2fs_has_inline_data(inode)) + return 0; + + page = f2fs_grab_cache_page(inode->i_mapping, 0, false); if (!page) return -ENOMEM; @@ -179,6 +184,9 @@ int f2fs_convert_inline_inode(struct inode *inode) f2fs_unlock_op(sbi); f2fs_put_page(page, 1); + + f2fs_balance_fs(sbi, dn.node_changed); + return err; } @@ -200,16 +208,17 @@ int f2fs_write_inline_data(struct inode *inode, struct page *page) f2fs_bug_on(F2FS_I_SB(inode), page->index); - f2fs_wait_on_page_writeback(dn.inode_page, NODE); + f2fs_wait_on_page_writeback(dn.inode_page, NODE, true); src_addr = kmap_atomic(page); dst_addr = inline_data_addr(dn.inode_page); memcpy(dst_addr, src_addr, MAX_INLINE_DATA); kunmap_atomic(src_addr); + set_page_dirty(dn.inode_page); - set_inode_flag(F2FS_I(inode), FI_APPEND_WRITE); - set_inode_flag(F2FS_I(inode), FI_DATA_EXIST); + set_inode_flag(inode, FI_APPEND_WRITE); + set_inode_flag(inode, FI_DATA_EXIST); - sync_inode_page(&dn); + clear_inline_node(dn.inode_page); f2fs_put_dnode(&dn); return 0; } @@ -238,16 +247,16 @@ bool recover_inline_data(struct inode *inode, struct page *npage) ipage = get_node_page(sbi, inode->i_ino); f2fs_bug_on(sbi, IS_ERR(ipage)); - f2fs_wait_on_page_writeback(ipage, NODE); + f2fs_wait_on_page_writeback(ipage, NODE, true); src_addr = inline_data_addr(npage); dst_addr = inline_data_addr(ipage); memcpy(dst_addr, src_addr, MAX_INLINE_DATA); - set_inode_flag(F2FS_I(inode), FI_INLINE_DATA); - set_inode_flag(F2FS_I(inode), FI_DATA_EXIST); + set_inode_flag(inode, FI_INLINE_DATA); + set_inode_flag(inode, FI_DATA_EXIST); - update_inode(inode, ipage); + set_page_dirty(ipage); f2fs_put_page(ipage, 1); return true; } @@ -255,65 +264,47 @@ bool recover_inline_data(struct inode *inode, struct page *npage) if (f2fs_has_inline_data(inode)) { ipage = get_node_page(sbi, inode->i_ino); f2fs_bug_on(sbi, IS_ERR(ipage)); - truncate_inline_inode(ipage, 0); + if (!truncate_inline_inode(ipage, 0)) + return false; f2fs_clear_inline_inode(inode); - update_inode(inode, ipage); f2fs_put_page(ipage, 1); } else if (ri && (ri->i_inline & F2FS_INLINE_DATA)) { - truncate_blocks(inode, 0, false); + if (truncate_blocks(inode, 0, false)) + return false; goto process_inline; } return false; } struct f2fs_dir_entry *find_in_inline_dir(struct inode *dir, - struct qstr *name, struct page **res_page) + struct fscrypt_name *fname, struct page **res_page) { struct f2fs_sb_info *sbi = F2FS_SB(dir->i_sb); struct f2fs_inline_dentry *inline_dentry; + struct qstr name = FSTR_TO_QSTR(&fname->disk_name); struct f2fs_dir_entry *de; struct f2fs_dentry_ptr d; struct page *ipage; + f2fs_hash_t namehash; ipage = get_node_page(sbi, dir->i_ino); - if (IS_ERR(ipage)) + if (IS_ERR(ipage)) { + *res_page = ipage; return NULL; + } - inline_dentry = inline_data_addr(ipage); + namehash = f2fs_dentry_hash(&name); - make_dentry_ptr(&d, (void *)inline_dentry, 2); - de = find_target_dentry(name, NULL, &d); + inline_dentry = inline_data_addr(ipage); + make_dentry_ptr(NULL, &d, (void *)inline_dentry, 2); + de = find_target_dentry(fname, namehash, NULL, &d); unlock_page(ipage); if (de) *res_page = ipage; else f2fs_put_page(ipage, 0); - /* - * For the most part, it should be a bug when name_len is zero. - * We stop here for figuring out where the bugs has occurred. - */ - f2fs_bug_on(sbi, d.max < 0); - return de; -} - -struct f2fs_dir_entry *f2fs_parent_inline_dir(struct inode *dir, - struct page **p) -{ - struct f2fs_sb_info *sbi = F2FS_I_SB(dir); - struct page *ipage; - struct f2fs_dir_entry *de; - struct f2fs_inline_dentry *dentry_blk; - - ipage = get_node_page(sbi, dir->i_ino); - if (IS_ERR(ipage)) - return NULL; - - dentry_blk = inline_data_addr(ipage); - de = &dentry_blk->dentry[1]; - *p = ipage; - unlock_page(ipage); return de; } @@ -325,20 +316,22 @@ int make_empty_inline_dir(struct inode *inode, struct inode *parent, dentry_blk = inline_data_addr(ipage); - make_dentry_ptr(&d, (void *)dentry_blk, 2); + make_dentry_ptr(NULL, &d, (void *)dentry_blk, 2); do_make_empty_dir(inode, parent, &d); set_page_dirty(ipage); /* update i_size to MAX_INLINE_DATA */ - if (i_size_read(inode) < MAX_INLINE_DATA) { - i_size_write(inode, MAX_INLINE_DATA); - set_inode_flag(F2FS_I(inode), FI_UPDATE_DIR); - } + if (i_size_read(inode) < MAX_INLINE_DATA) + f2fs_i_size_write(inode, MAX_INLINE_DATA); return 0; } -static int f2fs_convert_inline_dir(struct inode *dir, struct page *ipage, +/* + * NOTE: ipage is grabbed by caller, but if any error occurs, we should + * release ipage in this function. + */ +static int f2fs_move_inline_dirents(struct inode *dir, struct page *ipage, struct f2fs_inline_dentry *inline_dentry) { struct page *page; @@ -346,49 +339,154 @@ static int f2fs_convert_inline_dir(struct inode *dir, struct page *ipage, struct f2fs_dentry_block *dentry_blk; int err; - page = grab_cache_page(dir->i_mapping, 0); - if (!page) + page = f2fs_grab_cache_page(dir->i_mapping, 0, false); + if (!page) { + f2fs_put_page(ipage, 1); return -ENOMEM; + } set_new_dnode(&dn, dir, ipage, NULL, 0); err = f2fs_reserve_block(&dn, 0); if (err) goto out; - f2fs_wait_on_page_writeback(page, DATA); - zero_user_segment(page, 0, PAGE_CACHE_SIZE); + f2fs_wait_on_page_writeback(page, DATA, true); + zero_user_segment(page, MAX_INLINE_DATA, PAGE_SIZE); dentry_blk = kmap_atomic(page); /* copy data from inline dentry block to new dentry block */ memcpy(dentry_blk->dentry_bitmap, inline_dentry->dentry_bitmap, INLINE_DENTRY_BITMAP_SIZE); + memset(dentry_blk->dentry_bitmap + INLINE_DENTRY_BITMAP_SIZE, 0, + SIZE_OF_DENTRY_BITMAP - INLINE_DENTRY_BITMAP_SIZE); + /* + * we do not need to zero out remainder part of dentry and filename + * field, since we have used bitmap for marking the usage status of + * them, besides, we can also ignore copying/zeroing reserved space + * of dentry block, because them haven't been used so far. + */ memcpy(dentry_blk->dentry, inline_dentry->dentry, sizeof(struct f2fs_dir_entry) * NR_INLINE_DENTRY); memcpy(dentry_blk->filename, inline_dentry->filename, NR_INLINE_DENTRY * F2FS_SLOT_LEN); kunmap_atomic(dentry_blk); - SetPageUptodate(page); + if (!PageUptodate(page)) + SetPageUptodate(page); set_page_dirty(page); /* clear inline dir and flag after data writeback */ truncate_inline_inode(ipage, 0); stat_dec_inline_dir(dir); - clear_inode_flag(F2FS_I(dir), FI_INLINE_DENTRY); - - if (i_size_read(dir) < PAGE_CACHE_SIZE) { - i_size_write(dir, PAGE_CACHE_SIZE); - set_inode_flag(F2FS_I(dir), FI_UPDATE_DIR); - } + clear_inode_flag(dir, FI_INLINE_DENTRY); - sync_inode_page(&dn); + f2fs_i_depth_write(dir, 1); + if (i_size_read(dir) < PAGE_SIZE) + f2fs_i_size_write(dir, PAGE_SIZE); out: f2fs_put_page(page, 1); return err; } +static int f2fs_add_inline_entries(struct inode *dir, + struct f2fs_inline_dentry *inline_dentry) +{ + struct f2fs_dentry_ptr d; + unsigned long bit_pos = 0; + int err = 0; + + make_dentry_ptr(NULL, &d, (void *)inline_dentry, 2); + + while (bit_pos < d.max) { + struct f2fs_dir_entry *de; + struct qstr new_name; + nid_t ino; + umode_t fake_mode; + + if (!test_bit_le(bit_pos, d.bitmap)) { + bit_pos++; + continue; + } + + de = &d.dentry[bit_pos]; + + if (unlikely(!de->name_len)) { + bit_pos++; + continue; + } + + new_name.name = d.filename[bit_pos]; + new_name.len = de->name_len; + + ino = le32_to_cpu(de->ino); + fake_mode = get_de_type(de) << S_SHIFT; + + err = f2fs_add_regular_entry(dir, &new_name, NULL, + ino, fake_mode); + if (err) + goto punch_dentry_pages; + + bit_pos += GET_DENTRY_SLOTS(le16_to_cpu(de->name_len)); + } + return 0; +punch_dentry_pages: + truncate_inode_pages(&dir->i_data, 0); + truncate_blocks(dir, 0, false); + remove_dirty_inode(dir); + return err; +} + +static int f2fs_move_rehashed_dirents(struct inode *dir, struct page *ipage, + struct f2fs_inline_dentry *inline_dentry) +{ + struct f2fs_inline_dentry *backup_dentry; + int err; + + backup_dentry = f2fs_kmalloc(sizeof(struct f2fs_inline_dentry), + GFP_F2FS_ZERO); + if (!backup_dentry) { + f2fs_put_page(ipage, 1); + return -ENOMEM; + } + + memcpy(backup_dentry, inline_dentry, MAX_INLINE_DATA); + truncate_inline_inode(ipage, 0); + + unlock_page(ipage); + + err = f2fs_add_inline_entries(dir, backup_dentry); + if (err) + goto recover; + + lock_page(ipage); + + stat_dec_inline_dir(dir); + clear_inode_flag(dir, FI_INLINE_DENTRY); + kfree(backup_dentry); + return 0; +recover: + lock_page(ipage); + memcpy(inline_dentry, backup_dentry, MAX_INLINE_DATA); + f2fs_i_depth_write(dir, 0); + f2fs_i_size_write(dir, MAX_INLINE_DATA); + set_page_dirty(ipage); + f2fs_put_page(ipage, 1); + + kfree(backup_dentry); + return err; +} + +static int f2fs_convert_inline_dir(struct inode *dir, struct page *ipage, + struct f2fs_inline_dentry *inline_dentry) +{ + if (!F2FS_I(dir)->i_dir_level) + return f2fs_move_inline_dirents(dir, ipage, inline_dentry); + else + return f2fs_move_rehashed_dirents(dir, ipage, inline_dentry); +} + int f2fs_add_inline_entry(struct inode *dir, const struct qstr *name, struct inode *inode, nid_t ino, umode_t mode) { @@ -412,8 +510,9 @@ int f2fs_add_inline_entry(struct inode *dir, const struct qstr *name, slots, NR_INLINE_DENTRY); if (bit_pos >= NR_INLINE_DENTRY) { err = f2fs_convert_inline_dir(dir, ipage, dentry_blk); - if (!err) - err = -EAGAIN; + if (err) + return err; + err = -EAGAIN; goto out; } @@ -426,18 +525,17 @@ int f2fs_add_inline_entry(struct inode *dir, const struct qstr *name, } } - f2fs_wait_on_page_writeback(ipage, NODE); + f2fs_wait_on_page_writeback(ipage, NODE, true); name_hash = f2fs_dentry_hash(name); - make_dentry_ptr(&d, (void *)dentry_blk, 2); + make_dentry_ptr(NULL, &d, (void *)dentry_blk, 2); f2fs_update_dentry(ino, mode, &d, name, name_hash, bit_pos); set_page_dirty(ipage); /* we don't need to mark_inode_dirty now */ if (inode) { - F2FS_I(inode)->i_pino = dir->i_ino; - update_inode(inode, page); + f2fs_i_pino_write(inode, dir->i_ino); f2fs_put_page(page, 1); } @@ -445,11 +543,6 @@ int f2fs_add_inline_entry(struct inode *dir, const struct qstr *name, fail: if (inode) up_write(&F2FS_I(inode)->i_sem); - - if (is_inode_flag_set(F2FS_I(dir), FI_UPDATE_DIR)) { - update_inode(dir, ipage); - clear_inode_flag(F2FS_I(dir), FI_UPDATE_DIR); - } out: f2fs_put_page(ipage, 1); return err; @@ -464,7 +557,7 @@ void f2fs_delete_inline_entry(struct f2fs_dir_entry *dentry, struct page *page, int i; lock_page(page); - f2fs_wait_on_page_writeback(page, NODE); + f2fs_wait_on_page_writeback(page, NODE, true); inline_dentry = inline_data_addr(page); bit_pos = dentry - inline_dentry->dentry; @@ -473,13 +566,13 @@ void f2fs_delete_inline_entry(struct f2fs_dir_entry *dentry, struct page *page, &inline_dentry->dentry_bitmap); set_page_dirty(page); + f2fs_put_page(page, 1); dir->i_ctime = dir->i_mtime = CURRENT_TIME; + f2fs_mark_inode_dirty_sync(dir); if (inode) - f2fs_drop_nlink(dir, inode, page); - - f2fs_put_page(page, 1); + f2fs_drop_nlink(dir, inode); } bool f2fs_empty_inline_dir(struct inode *dir) @@ -506,7 +599,8 @@ bool f2fs_empty_inline_dir(struct inode *dir) return true; } -int f2fs_read_inline_dir(struct file *file, void *dirent, filldir_t filldir) +int f2fs_read_inline_dir(struct file *file, void *dirent, filldir_t filldir, + struct fscrypt_str *fstr) { unsigned long pos = file->f_pos; unsigned int bit_pos = 0; @@ -526,11 +620,46 @@ int f2fs_read_inline_dir(struct file *file, void *dirent, filldir_t filldir) inline_dentry = inline_data_addr(ipage); - make_dentry_ptr(&d, (void *)inline_dentry, 2); + make_dentry_ptr(inode, &d, (void *)inline_dentry, 2); - if (!f2fs_fill_dentries(file, dirent, filldir, &d, 0, bit_pos)) + if (!f2fs_fill_dentries(file, dirent, filldir, &d, 0, bit_pos, fstr)) file->f_pos = NR_INLINE_DENTRY; f2fs_put_page(ipage, 1); return 0; } + +int f2fs_inline_data_fiemap(struct inode *inode, + struct fiemap_extent_info *fieinfo, __u64 start, __u64 len) +{ + __u64 byteaddr, ilen; + __u32 flags = FIEMAP_EXTENT_DATA_INLINE | FIEMAP_EXTENT_NOT_ALIGNED | + FIEMAP_EXTENT_LAST; + struct node_info ni; + struct page *ipage; + int err = 0; + + ipage = get_node_page(F2FS_I_SB(inode), inode->i_ino); + if (IS_ERR(ipage)) + return PTR_ERR(ipage); + + if (!f2fs_has_inline_data(inode)) { + err = -EAGAIN; + goto out; + } + + ilen = min_t(size_t, MAX_INLINE_DATA, i_size_read(inode)); + if (start >= ilen) + goto out; + if (start + len < ilen) + ilen = start + len; + ilen -= start; + + get_node_info(F2FS_I_SB(inode), inode->i_ino, &ni); + byteaddr = (__u64)ni.blk_addr << inode->i_sb->s_blocksize_bits; + byteaddr += (char *)inline_data_addr(ipage) - (char *)F2FS_INODE(ipage); + err = fiemap_fill_next_extent(fieinfo, start, byteaddr, ilen, flags); +out: + f2fs_put_page(ipage, 1); + return err; +} diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c old mode 100644 new mode 100755 index 45e67f2bcccc..b9b7459f85de --- a/fs/f2fs/inode.c +++ b/fs/f2fs/inode.c @@ -18,6 +18,13 @@ #include +void f2fs_mark_inode_dirty_sync(struct inode *inode) +{ + if (f2fs_inode_dirtied(inode)) + return; + mark_inode_dirty_sync(inode); +} + void f2fs_set_inode_flags(struct inode *inode) { unsigned int flags = F2FS_I(inode)->i_flags; @@ -35,6 +42,7 @@ void f2fs_set_inode_flags(struct inode *inode) inode->i_flags |= S_NOATIME; if (flags & FS_DIRSYNC_FL) inode->i_flags |= S_DIRSYNC; + f2fs_mark_inode_dirty_sync(inode); } static void __get_inode_rdev(struct inode *inode, struct f2fs_inode *ri) @@ -83,10 +91,10 @@ static void __recover_inline_status(struct inode *inode, struct page *ipage) while (start < end) { if (*start++) { - f2fs_wait_on_page_writeback(ipage, NODE); + f2fs_wait_on_page_writeback(ipage, NODE, true); - set_inode_flag(F2FS_I(inode), FI_DATA_EXIST); - set_raw_inline(F2FS_I(inode), F2FS_INODE(ipage)); + set_inode_flag(inode, FI_DATA_EXIST); + set_raw_inline(inode, F2FS_INODE(ipage)); set_page_dirty(ipage); return; } @@ -138,9 +146,10 @@ static int do_read_inode(struct inode *inode) fi->i_pino = le32_to_cpu(ri->i_pino); fi->i_dir_level = ri->i_dir_level; - f2fs_init_extent_cache(inode, &ri->i_ext); + if (f2fs_init_extent_tree(inode, &ri->i_ext)) + set_page_dirty(node_page); - get_inline_info(fi, ri); + get_inline_info(inode, ri); /* check data exist */ if (f2fs_has_inline_data(inode) && !f2fs_exist_data(inode)) @@ -150,10 +159,14 @@ static int do_read_inode(struct inode *inode) __get_inode_rdev(inode, ri); if (__written_first_block(ri)) - set_inode_flag(F2FS_I(inode), FI_FIRST_BLOCK_WRITTEN); + set_inode_flag(inode, FI_FIRST_BLOCK_WRITTEN); + + if (!need_inode_block_update(sbi, inode->i_ino)) + fi->last_disk_size = inode->i_size; f2fs_put_page(node_page, 1); + stat_inc_inline_xattr(inode); stat_inc_inline_inode(inode); stat_inc_inline_dir(inode); @@ -197,7 +210,11 @@ struct inode *f2fs_iget(struct super_block *sb, unsigned long ino) inode->i_mapping->a_ops = &f2fs_dblock_aops; mapping_set_gfp_mask(inode->i_mapping, GFP_F2FS_HIGH_ZERO); } else if (S_ISLNK(inode->i_mode)) { - inode->i_op = &f2fs_symlink_inode_operations; + if (f2fs_encrypted_inode(inode)) + inode->i_op = &f2fs_encrypted_symlink_inode_operations; + else + inode->i_op = &f2fs_symlink_inode_operations; + inode_nohighmem(inode); inode->i_mapping->a_ops = &f2fs_dblock_aops; } else if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode) || S_ISFIFO(inode->i_mode) || S_ISSOCK(inode->i_mode)) { @@ -217,11 +234,13 @@ struct inode *f2fs_iget(struct super_block *sb, unsigned long ino) return ERR_PTR(ret); } -void update_inode(struct inode *inode, struct page *node_page) +int update_inode(struct inode *inode, struct page *node_page) { struct f2fs_inode *ri; - f2fs_wait_on_page_writeback(node_page, NODE); + f2fs_inode_synced(inode); + + f2fs_wait_on_page_writeback(node_page, NODE, true); ri = F2FS_INODE(node_page); @@ -233,11 +252,12 @@ void update_inode(struct inode *inode, struct page *node_page) ri->i_size = cpu_to_le64(i_size_read(inode)); ri->i_blocks = cpu_to_le64(inode->i_blocks); - read_lock(&F2FS_I(inode)->ext_lock); - set_raw_extent(&F2FS_I(inode)->ext, &ri->i_ext); - read_unlock(&F2FS_I(inode)->ext_lock); - - set_raw_inline(F2FS_I(inode), ri); + if (F2FS_I(inode)->extent_tree) + set_raw_extent(&F2FS_I(inode)->extent_tree->largest, + &ri->i_ext); + else + memset(&ri->i_ext, 0, sizeof(ri->i_ext)); + set_raw_inline(inode, ri); ri->i_atime = cpu_to_le64(inode->i_atime.tv_sec); ri->i_ctime = cpu_to_le64(inode->i_ctime.tv_sec); @@ -254,15 +274,19 @@ void update_inode(struct inode *inode, struct page *node_page) __set_inode_rdev(inode, ri); set_cold_node(inode, node_page); - set_page_dirty(node_page); - clear_inode_flag(F2FS_I(inode), FI_DIRTY_INODE); + /* deleted inode */ + if (inode->i_nlink == 0) + clear_inline_node(node_page); + + return set_page_dirty(node_page); } -void update_inode_page(struct inode *inode) +int update_inode_page(struct inode *inode) { struct f2fs_sb_info *sbi = F2FS_I_SB(inode); struct page *node_page; + int ret = 0; retry: node_page = get_node_page(sbi, inode->i_ino); if (IS_ERR(node_page)) { @@ -271,12 +295,14 @@ void update_inode_page(struct inode *inode) cond_resched(); goto retry; } else if (err != -ENOENT) { - f2fs_stop_checkpoint(sbi); + f2fs_stop_checkpoint(sbi, false); } - return; + f2fs_inode_synced(inode); + return 0; } - update_inode(inode, node_page); + ret = update_inode(inode, node_page); f2fs_put_page(node_page, 1); + return ret; } int f2fs_write_inode(struct inode *inode, struct writeback_control *wbc) @@ -287,20 +313,15 @@ int f2fs_write_inode(struct inode *inode, struct writeback_control *wbc) inode->i_ino == F2FS_META_INO(sbi)) return 0; - if (!is_inode_flag_set(F2FS_I(inode), FI_DIRTY_INODE)) + if (!is_inode_flag_set(inode, FI_DIRTY_INODE)) return 0; /* - * We need to lock here to prevent from producing dirty node pages + * We need to balance fs here to prevent from producing dirty node pages * during the urgent cleaning time when runing out of free sections. */ - f2fs_lock_op(sbi); - update_inode_page(inode); - f2fs_unlock_op(sbi); - - if (wbc) - f2fs_balance_fs(sbi); - + if (update_inode_page(inode)) + f2fs_balance_fs(sbi, true); return 0; } @@ -311,10 +332,11 @@ void f2fs_evict_inode(struct inode *inode) { struct f2fs_sb_info *sbi = F2FS_I_SB(inode); nid_t xnid = F2FS_I(inode)->i_xattr_nid; + int err = 0; /* some remained atomic pages should discarded */ if (f2fs_is_atomic_file(inode)) - commit_inmem_pages(inode, true); + drop_inmem_pages(inode); trace_f2fs_evict_inode(inode); truncate_inode_pages(&inode->i_data, 0); @@ -324,38 +346,58 @@ void f2fs_evict_inode(struct inode *inode) goto out_clear; f2fs_bug_on(sbi, get_dirty_pages(inode)); - remove_dirty_dir_inode(inode); + remove_dirty_inode(inode); + + f2fs_destroy_extent_tree(inode); if (inode->i_nlink || is_bad_inode(inode)) goto no_delete; - set_inode_flag(F2FS_I(inode), FI_NO_ALLOC); - i_size_write(inode, 0); +#ifdef CONFIG_F2FS_FAULT_INJECTION + if (time_to_inject(FAULT_EVICT_INODE)) + goto no_delete; +#endif + set_inode_flag(inode, FI_NO_ALLOC); + i_size_write(inode, 0); +retry: if (F2FS_HAS_BLOCKS(inode)) - f2fs_truncate(inode); + err = f2fs_truncate(inode); - f2fs_lock_op(sbi); - remove_inode_page(inode); - f2fs_unlock_op(sbi); + if (!err) { + f2fs_lock_op(sbi); + err = remove_inode_page(inode); + f2fs_unlock_op(sbi); + } + /* give more chances, if ENOMEM case */ + if (err == -ENOMEM) { + err = 0; + goto retry; + } + + if (err) + update_inode_page(inode); no_delete: + stat_dec_inline_xattr(inode); stat_dec_inline_dir(inode); stat_dec_inline_inode(inode); - /* update extent info in inode */ - if (inode->i_nlink) - f2fs_preserve_extent_tree(inode); - f2fs_destroy_extent_tree(inode); - invalidate_mapping_pages(NODE_MAPPING(sbi), inode->i_ino, inode->i_ino); if (xnid) invalidate_mapping_pages(NODE_MAPPING(sbi), xnid, xnid); - if (is_inode_flag_set(F2FS_I(inode), FI_APPEND_WRITE)) - add_dirty_inode(sbi, inode->i_ino, APPEND_INO); - if (is_inode_flag_set(F2FS_I(inode), FI_UPDATE_WRITE)) - add_dirty_inode(sbi, inode->i_ino, UPDATE_INO); + if (is_inode_flag_set(inode, FI_APPEND_WRITE)) + add_ino_entry(sbi, inode->i_ino, APPEND_INO); + if (is_inode_flag_set(inode, FI_UPDATE_WRITE)) + add_ino_entry(sbi, inode->i_ino, UPDATE_INO); + if (is_inode_flag_set(inode, FI_FREE_NID)) { + alloc_nid_failed(sbi, inode->i_ino); + clear_inode_flag(inode, FI_FREE_NID); + } + f2fs_bug_on(sbi, err && + !exist_written_data(sbi, inode->i_ino, ORPHAN_INO)); out_clear: + fscrypt_put_encryption_info(inode, NULL); end_writeback(inode); } @@ -363,20 +405,32 @@ void f2fs_evict_inode(struct inode *inode) void handle_failed_inode(struct inode *inode) { struct f2fs_sb_info *sbi = F2FS_I_SB(inode); + struct node_info ni; - clear_nlink(inode); - make_bad_inode(inode); + /* don't make bad inode, since it becomes a regular file. */ unlock_new_inode(inode); - i_size_write(inode, 0); - if (F2FS_HAS_BLOCKS(inode)) - f2fs_truncate(inode); - - remove_inode_page(inode); + /* + * Note: we should add inode to orphan list before f2fs_unlock_op() + * so we can prevent losing this orphan when encoutering checkpoint + * and following suddenly power-off. + */ + get_node_info(sbi, inode->i_ino, &ni); + + if (ni.blk_addr != NULL_ADDR) { + int err = acquire_orphan_inode(sbi); + if (err) { + set_sbi_flag(sbi, SBI_NEED_FSCK); + f2fs_msg(sbi->sb, KERN_WARNING, + "Too many orphan inodes, run fsck to fix."); + } else { + add_orphan_inode(inode); + } + alloc_nid_done(sbi, inode->i_ino); + } else { + set_inode_flag(inode, FI_FREE_NID); + } - clear_inode_flag(F2FS_I(inode), FI_INLINE_DATA); - clear_inode_flag(F2FS_I(inode), FI_INLINE_DENTRY); - alloc_nid_failed(sbi, inode->i_ino); f2fs_unlock_op(sbi); /* iput will drop the inode object */ diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c old mode 100644 new mode 100755 index 8d8cdf72a525..b8c97346efa3 --- a/fs/f2fs/namei.c +++ b/fs/f2fs/namei.c @@ -53,27 +53,37 @@ static struct inode *f2fs_new_inode(struct inode *dir, umode_t mode) if (err) { err = -EINVAL; nid_free = true; - goto out; + goto fail; } - if (f2fs_may_inline(inode)) - set_inode_flag(F2FS_I(inode), FI_INLINE_DATA); - if (test_opt(sbi, INLINE_DENTRY) && S_ISDIR(inode->i_mode)) - set_inode_flag(F2FS_I(inode), FI_INLINE_DENTRY); + /* If the directory encrypted, then we should encrypt the inode. */ + if (f2fs_encrypted_inode(dir) && f2fs_may_encrypt(inode)) + f2fs_set_encrypted_inode(inode); + + set_inode_flag(inode, FI_NEW_INODE); + + if (test_opt(sbi, INLINE_XATTR)) + set_inode_flag(inode, FI_INLINE_XATTR); + if (test_opt(sbi, INLINE_DATA) && f2fs_may_inline_data(inode)) + set_inode_flag(inode, FI_INLINE_DATA); + if (f2fs_may_inline_dentry(inode)) + set_inode_flag(inode, FI_INLINE_DENTRY); + + f2fs_init_extent_tree(inode, NULL); + + stat_inc_inline_xattr(inode); + stat_inc_inline_inode(inode); + stat_inc_inline_dir(inode); trace_f2fs_new_inode(inode, 0); - mark_inode_dirty(inode); return inode; -out: - clear_nlink(inode); - unlock_new_inode(inode); fail: trace_f2fs_new_inode(inode, err); make_bad_inode(inode); - iput(inode); if (nid_free) - alloc_nid_failed(sbi, ino); + set_inode_flag(inode, FI_FREE_NID); + iput(inode); return ERR_PTR(err); } @@ -82,7 +92,14 @@ static int is_multimedia_file(const unsigned char *s, const char *sub) size_t slen = strlen(s); size_t sublen = strlen(sub); - if (sublen > slen) + /* + * filename format of multimedia file should be defined as: + * "filename + '.' + extension". + */ + if (slen < sublen + 2) + return 0; + + if (s[slen - sublen - 1] != '.') return 0; return !strncasecmp(s + slen - sublen, sub, sublen); @@ -114,8 +131,6 @@ static int f2fs_create(struct inode *dir, struct dentry *dentry, umode_t mode, nid_t ino = 0; int err; - f2fs_balance_fs(sbi); - inode = f2fs_new_inode(dir, mode); if (IS_ERR(inode)) return PTR_ERR(inode); @@ -128,6 +143,8 @@ static int f2fs_create(struct inode *dir, struct dentry *dentry, umode_t mode, inode->i_mapping->a_ops = &f2fs_dblock_aops; ino = inode->i_ino; + f2fs_balance_fs(sbi, true); + f2fs_lock_op(sbi); err = f2fs_add_link(dentry, inode); if (err) @@ -136,7 +153,6 @@ static int f2fs_create(struct inode *dir, struct dentry *dentry, umode_t mode, alloc_nid_done(sbi, ino); - stat_inc_inline_inode(inode); d_instantiate(dentry, inode); unlock_new_inode(inode); @@ -151,16 +167,20 @@ static int f2fs_create(struct inode *dir, struct dentry *dentry, umode_t mode, static int f2fs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry) { - struct inode *inode = old_dentry->d_inode; + struct inode *inode = d_inode(old_dentry); struct f2fs_sb_info *sbi = F2FS_I_SB(dir); int err; - f2fs_balance_fs(sbi); + if (f2fs_encrypted_inode(dir) && + !fscrypt_has_permitted_context(dir, inode)) + return -EPERM; + + f2fs_balance_fs(sbi, true); inode->i_ctime = CURRENT_TIME; ihold(inode); - set_inode_flag(F2FS_I(inode), FI_INC_LINK); + set_inode_flag(inode, FI_INC_LINK); f2fs_lock_op(sbi); err = f2fs_add_link(dentry, inode); if (err) @@ -173,7 +193,7 @@ static int f2fs_link(struct dentry *old_dentry, struct inode *dir, f2fs_sync_fs(sbi->sb, 1); return 0; out: - clear_inode_flag(F2FS_I(inode), FI_INC_LINK); + clear_inode_flag(inode, FI_INC_LINK); iput(inode); f2fs_unlock_op(sbi); return err; @@ -182,27 +202,43 @@ static int f2fs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *f2fs_get_parent(struct dentry *child) { struct qstr dotdot = {.len = 2, .name = ".."}; - unsigned long ino = f2fs_inode_by_name(child->d_inode, &dotdot); - if (!ino) + struct page *page; + unsigned long ino = f2fs_inode_by_name(d_inode(child), &dotdot, &page); + if (!ino) { + if (IS_ERR(page)) + return ERR_CAST(page); return ERR_PTR(-ENOENT); - return d_obtain_alias(f2fs_iget(child->d_inode->i_sb, ino)); + } + return d_obtain_alias(f2fs_iget(child->d_sb, ino)); } static int __recover_dot_dentries(struct inode *dir, nid_t pino) { struct f2fs_sb_info *sbi = F2FS_I_SB(dir); - struct qstr dot = {.len = 1, .name = "."}; - struct qstr dotdot = {.len = 2, .name = ".."}; + struct qstr dot = QSTR_INIT(".", 1); + struct qstr dotdot = QSTR_INIT("..", 2); struct f2fs_dir_entry *de; struct page *page; int err = 0; + if (f2fs_readonly(sbi->sb)) { + f2fs_msg(sbi->sb, KERN_INFO, + "skip recovering inline_dots inode (ino:%lu, pino:%u) " + "in readonly mountpoint", dir->i_ino, pino); + return 0; + } + + f2fs_balance_fs(sbi, true); + f2fs_lock_op(sbi); de = f2fs_find_entry(dir, &dot, &page); if (de) { f2fs_dentry_kunmap(dir, page); f2fs_put_page(page, 0); + } else if (IS_ERR(page)) { + err = PTR_ERR(page); + goto out; } else { err = __f2fs_add_link(dir, &dot, NULL, dir->i_ino, S_IFDIR); if (err) @@ -213,14 +249,14 @@ static int __recover_dot_dentries(struct inode *dir, nid_t pino) if (de) { f2fs_dentry_kunmap(dir, page); f2fs_put_page(page, 0); + } else if (IS_ERR(page)) { + err = PTR_ERR(page); } else { err = __f2fs_add_link(dir, &dotdot, NULL, pino, S_IFDIR); } out: - if (!err) { - clear_inode_flag(F2FS_I(dir), FI_INLINE_DOTS); - mark_inode_dirty(dir); - } + if (!err) + clear_inode_flag(dir, FI_INLINE_DOTS); f2fs_unlock_op(sbi); return err; @@ -232,48 +268,87 @@ static struct dentry *f2fs_lookup(struct inode *dir, struct dentry *dentry, struct inode *inode = NULL; struct f2fs_dir_entry *de; struct page *page; + nid_t ino; + int err = 0; + unsigned int root_ino = F2FS_ROOT_INO(F2FS_I_SB(dir)); + + if (f2fs_encrypted_inode(dir)) { + int res = fscrypt_get_encryption_info(dir); + + /* + * DCACHE_ENCRYPTED_WITH_KEY is set if the dentry is + * created while the directory was encrypted and we + * don't have access to the key. + */ + if (fscrypt_has_encryption_key(dir)) + fscrypt_set_encrypted_dentry(dentry); + fscrypt_set_d_op(dentry); + if (res && res != -ENOKEY) + return ERR_PTR(res); + } if (dentry->d_name.len > F2FS_NAME_LEN) return ERR_PTR(-ENAMETOOLONG); de = f2fs_find_entry(dir, &dentry->d_name, &page); - if (de) { - nid_t ino = le32_to_cpu(de->ino); - f2fs_dentry_kunmap(dir, page); - f2fs_put_page(page, 0); + if (!de) { + if (IS_ERR(page)) + return (struct dentry *)page; + return d_splice_alias(inode, dentry); + } - inode = f2fs_iget(dir->i_sb, ino); - if (IS_ERR(inode)) - return ERR_CAST(inode); + ino = le32_to_cpu(de->ino); + f2fs_dentry_kunmap(dir, page); + f2fs_put_page(page, 0); - if (f2fs_has_inline_dots(inode)) { - int err; + inode = f2fs_iget(dir->i_sb, ino); + if (IS_ERR(inode)) + return ERR_CAST(inode); - err = __recover_dot_dentries(inode, dir->i_ino); - if (err) { - iget_failed(inode); - return ERR_PTR(err); - } - } + if ((dir->i_ino == root_ino) && f2fs_has_inline_dots(dir)) { + err = __recover_dot_dentries(dir, root_ino); + if (err) + goto err_out; } + if (f2fs_has_inline_dots(inode)) { + err = __recover_dot_dentries(inode, dir->i_ino); + if (err) + goto err_out; + } + if (!IS_ERR(inode) && f2fs_encrypted_inode(dir) && + (S_ISDIR(inode->i_mode) || S_ISLNK(inode->i_mode)) && + !fscrypt_has_permitted_context(dir, inode)) { + bool nokey = f2fs_encrypted_inode(inode) && + !fscrypt_has_encryption_key(inode); + err = nokey ? -ENOKEY : -EPERM; + goto err_out; + } return d_splice_alias(inode, dentry); + +err_out: + iput(inode); + return ERR_PTR(err); } static int f2fs_unlink(struct inode *dir, struct dentry *dentry) { struct f2fs_sb_info *sbi = F2FS_I_SB(dir); - struct inode *inode = dentry->d_inode; + struct inode *inode = d_inode(dentry); struct f2fs_dir_entry *de; struct page *page; int err = -ENOENT; trace_f2fs_unlink_enter(dir, dentry); - f2fs_balance_fs(sbi); de = f2fs_find_entry(dir, &dentry->d_name, &page); - if (!de) + if (!de) { + if (IS_ERR(page)) + err = PTR_ERR(page); goto fail; + } + + f2fs_balance_fs(sbi, true); f2fs_lock_op(sbi); err = acquire_orphan_inode(sbi); @@ -286,9 +361,6 @@ static int f2fs_unlink(struct inode *dir, struct dentry *dentry) f2fs_delete_entry(de, page, dir, inode); f2fs_unlock_op(sbi); - /* In order to evict this inode, we set it dirty */ - mark_inode_dirty(inode); - if (IS_DIRSYNC(dir)) f2fs_sync_fs(sbi->sb, 1); fail: @@ -299,13 +371,18 @@ static int f2fs_unlink(struct inode *dir, struct dentry *dentry) static void *f2fs_follow_link(struct dentry *dentry, struct nameidata *nd) { struct page *page; + char *link; page = page_follow_link_light(dentry, nd); if (IS_ERR(page)) return page; + link = nd_get_link(nd); + if (IS_ERR(link)) + return link; + /* this is broken symlink case */ - if (*nd_get_link(nd) == 0) { + if (*link == 0) { kunmap(page); page_cache_release(page); return ERR_PTR(-ENOENT); @@ -318,27 +395,78 @@ static int f2fs_symlink(struct inode *dir, struct dentry *dentry, { struct f2fs_sb_info *sbi = F2FS_I_SB(dir); struct inode *inode; - size_t symlen = strlen(symname) + 1; + size_t len = strlen(symname); + struct fscrypt_str disk_link = FSTR_INIT((char *)symname, len + 1); + struct fscrypt_symlink_data *sd = NULL; int err; - f2fs_balance_fs(sbi); + if (f2fs_encrypted_inode(dir)) { + err = fscrypt_get_encryption_info(dir); + if (err) + return err; + + if (!fscrypt_has_encryption_key(dir)) + return -EPERM; + + disk_link.len = (fscrypt_fname_encrypted_size(dir, len) + + sizeof(struct fscrypt_symlink_data)); + } + + if (disk_link.len > dir->i_sb->s_blocksize) + return -ENAMETOOLONG; inode = f2fs_new_inode(dir, S_IFLNK | S_IRWXUGO); if (IS_ERR(inode)) return PTR_ERR(inode); - inode->i_op = &f2fs_symlink_inode_operations; + if (f2fs_encrypted_inode(inode)) + inode->i_op = &f2fs_encrypted_symlink_inode_operations; + else + inode->i_op = &f2fs_symlink_inode_operations; + inode_nohighmem(inode); inode->i_mapping->a_ops = &f2fs_dblock_aops; + f2fs_balance_fs(sbi, true); + f2fs_lock_op(sbi); err = f2fs_add_link(dentry, inode); if (err) goto out; f2fs_unlock_op(sbi); - - err = page_symlink(inode, symname, symlen); alloc_nid_done(sbi, inode->i_ino); + if (f2fs_encrypted_inode(inode)) { + struct qstr istr = QSTR_INIT(symname, len); + struct fscrypt_str ostr; + + sd = kzalloc(disk_link.len, GFP_NOFS); + if (!sd) { + err = -ENOMEM; + goto err_out; + } + + err = fscrypt_get_encryption_info(inode); + if (err) + goto err_out; + + if (!fscrypt_has_encryption_key(inode)) { + err = -EPERM; + goto err_out; + } + + ostr.name = sd->encrypted_path; + ostr.len = disk_link.len; + err = fscrypt_fname_usr_to_disk(inode, &istr, &ostr); + if (err < 0) + goto err_out; + + sd->len = cpu_to_le16(ostr.len); + disk_link.name = (char *)sd; + } + + err = page_symlink(inode, disk_link.name, disk_link.len); + +err_out: d_instantiate(dentry, inode); unlock_new_inode(inode); @@ -351,10 +479,17 @@ static int f2fs_symlink(struct inode *dir, struct dentry *dentry, * If the symlink path is stored into inline_data, there is no * performance regression. */ - filemap_write_and_wait_range(inode->i_mapping, 0, symlen - 1); + if (!err) { + filemap_write_and_wait_range(inode->i_mapping, 0, + disk_link.len - 1); - if (IS_DIRSYNC(dir)) - f2fs_sync_fs(sbi->sb, 1); + if (IS_DIRSYNC(dir)) + f2fs_sync_fs(sbi->sb, 1); + } else { + f2fs_unlink(dir, dentry); + } + + kfree(sd); return err; out: handle_failed_inode(inode); @@ -367,8 +502,6 @@ static int f2fs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode) struct inode *inode; int err; - f2fs_balance_fs(sbi); - inode = f2fs_new_inode(dir, S_IFDIR | mode); if (IS_ERR(inode)) return PTR_ERR(inode); @@ -378,14 +511,15 @@ static int f2fs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode) inode->i_mapping->a_ops = &f2fs_dblock_aops; mapping_set_gfp_mask(inode->i_mapping, GFP_F2FS_HIGH_ZERO); - set_inode_flag(F2FS_I(inode), FI_INC_LINK); + f2fs_balance_fs(sbi, true); + + set_inode_flag(inode, FI_INC_LINK); f2fs_lock_op(sbi); err = f2fs_add_link(dentry, inode); if (err) goto out_fail; f2fs_unlock_op(sbi); - stat_inc_inline_dir(inode); alloc_nid_done(sbi, inode->i_ino); d_instantiate(dentry, inode); @@ -396,14 +530,14 @@ static int f2fs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode) return 0; out_fail: - clear_inode_flag(F2FS_I(inode), FI_INC_LINK); + clear_inode_flag(inode, FI_INC_LINK); handle_failed_inode(inode); return err; } static int f2fs_rmdir(struct inode *dir, struct dentry *dentry) { - struct inode *inode = dentry->d_inode; + struct inode *inode = d_inode(dentry); if (f2fs_empty_dir(inode)) return f2fs_unlink(dir, dentry); return -ENOTEMPTY; @@ -419,8 +553,6 @@ static int f2fs_mknod(struct inode *dir, struct dentry *dentry, if (!new_valid_dev(rdev)) return -EINVAL; - f2fs_balance_fs(sbi); - inode = f2fs_new_inode(dir, mode); if (IS_ERR(inode)) return PTR_ERR(inode); @@ -428,6 +560,8 @@ static int f2fs_mknod(struct inode *dir, struct dentry *dentry, init_special_inode(inode, inode->i_mode, rdev); inode->i_op = &f2fs_special_inode_operations; + f2fs_balance_fs(sbi, true); + f2fs_lock_op(sbi); err = f2fs_add_link(dentry, inode); if (err) @@ -451,26 +585,36 @@ static int f2fs_rename(struct inode *old_dir, struct dentry *old_dentry, struct inode *new_dir, struct dentry *new_dentry) { struct f2fs_sb_info *sbi = F2FS_I_SB(old_dir); - struct inode *old_inode = old_dentry->d_inode; - struct inode *new_inode = new_dentry->d_inode; + struct inode *old_inode = d_inode(old_dentry); + struct inode *new_inode = d_inode(new_dentry); struct page *old_dir_page; struct page *old_page, *new_page; struct f2fs_dir_entry *old_dir_entry = NULL; struct f2fs_dir_entry *old_entry; struct f2fs_dir_entry *new_entry; + bool is_old_inline = f2fs_has_inline_dentry(old_dir); int err = -ENOENT; - f2fs_balance_fs(sbi); + if ((old_dir != new_dir) && f2fs_encrypted_inode(new_dir) && + !fscrypt_has_permitted_context(new_dir, old_inode)) { + err = -EPERM; + goto out; + } old_entry = f2fs_find_entry(old_dir, &old_dentry->d_name, &old_page); - if (!old_entry) + if (!old_entry) { + if (IS_ERR(old_page)) + err = PTR_ERR(old_page); goto out; + } if (S_ISDIR(old_inode->i_mode)) { - err = -EIO; old_dir_entry = f2fs_parent_dir(old_inode, &old_dir_page); - if (!old_dir_entry) + if (!old_dir_entry) { + if (IS_ERR(old_dir_page)) + err = PTR_ERR(old_dir_page); goto out_old; + } } if (new_inode) { @@ -482,8 +626,13 @@ static int f2fs_rename(struct inode *old_dir, struct dentry *old_dentry, err = -ENOENT; new_entry = f2fs_find_entry(new_dir, &new_dentry->d_name, &new_page); - if (!new_entry) + if (!new_entry) { + if (IS_ERR(new_page)) + err = PTR_ERR(new_page); goto out_dir; + } + + f2fs_balance_fs(sbi, true); f2fs_lock_op(sbi); @@ -491,7 +640,9 @@ static int f2fs_rename(struct inode *old_dir, struct dentry *old_dentry, if (err) goto put_out_dir; - if (update_dent_inode(old_inode, &new_dentry->d_name)) { + err = update_dent_inode(old_inode, new_inode, + &new_dentry->d_name); + if (err) { release_orphan_inode(sbi); goto put_out_dir; } @@ -501,20 +652,17 @@ static int f2fs_rename(struct inode *old_dir, struct dentry *old_dentry, new_inode->i_ctime = CURRENT_TIME; down_write(&F2FS_I(new_inode)->i_sem); if (old_dir_entry) - drop_nlink(new_inode); - drop_nlink(new_inode); + f2fs_i_links_write(new_inode, false); + f2fs_i_links_write(new_inode, false); up_write(&F2FS_I(new_inode)->i_sem); - mark_inode_dirty(new_inode); - if (!new_inode->i_nlink) - add_orphan_inode(sbi, new_inode->i_ino); + add_orphan_inode(new_inode); else release_orphan_inode(sbi); - - update_inode_page(old_inode); - update_inode_page(new_inode); } else { + f2fs_balance_fs(sbi, true); + f2fs_lock_op(sbi); err = f2fs_add_link(new_dentry, old_inode); @@ -523,18 +671,40 @@ static int f2fs_rename(struct inode *old_dir, struct dentry *old_dentry, goto out_dir; } - if (old_dir_entry) { - inc_nlink(new_dir); - update_inode_page(new_dir); + if (old_dir_entry) + f2fs_i_links_write(new_dir, true); + + /* + * old entry and new entry can locate in the same inline + * dentry in inode, when attaching new entry in inline dentry, + * it could force inline dentry conversion, after that, + * old_entry and old_page will point to wrong address, in + * order to avoid this, let's do the check and update here. + */ + if (is_old_inline && !f2fs_has_inline_dentry(old_dir)) { + f2fs_put_page(old_page, 0); + old_page = NULL; + + old_entry = f2fs_find_entry(old_dir, + &old_dentry->d_name, &old_page); + if (!old_entry) { + err = -ENOENT; + if (IS_ERR(old_page)) + err = PTR_ERR(old_page); + f2fs_unlock_op(sbi); + goto out_dir; + } } } down_write(&F2FS_I(old_inode)->i_sem); file_lost_pino(old_inode); + if (new_inode && file_enc_name(new_inode)) + file_set_enc_name(old_inode); up_write(&F2FS_I(old_inode)->i_sem); old_inode->i_ctime = CURRENT_TIME; - mark_inode_dirty(old_inode); + f2fs_mark_inode_dirty_sync(old_inode); f2fs_delete_entry(old_entry, old_page, old_dir, NULL); @@ -542,14 +712,11 @@ static int f2fs_rename(struct inode *old_dir, struct dentry *old_dentry, if (old_dir != new_dir) { f2fs_set_link(old_inode, old_dir_entry, old_dir_page, new_dir); - update_inode_page(old_inode); } else { f2fs_dentry_kunmap(old_inode, old_dir_page); f2fs_put_page(old_dir_page, 0); } - drop_nlink(old_dir); - mark_inode_dirty(old_dir); - update_inode_page(old_dir); + f2fs_i_links_write(old_dir, false); } f2fs_unlock_op(sbi); @@ -574,6 +741,97 @@ static int f2fs_rename(struct inode *old_dir, struct dentry *old_dentry, return err; } +static void *f2fs_encrypted_follow_link(struct dentry *dentry, + struct nameidata *nd) +{ + struct page *cpage = NULL; + char *caddr, *paddr = NULL; + struct fscrypt_str cstr = FSTR_INIT(NULL, 0); + struct fscrypt_str pstr = FSTR_INIT(NULL, 0); + struct fscrypt_symlink_data *sd; + struct inode *inode = d_inode(dentry); + loff_t size = min_t(loff_t, i_size_read(inode), PAGE_SIZE - 1); + u32 max_size = inode->i_sb->s_blocksize; + int res; + + res = fscrypt_get_encryption_info(inode); + if (res) + return ERR_PTR(res); + + cpage = read_mapping_page(inode->i_mapping, 0, NULL); + if (IS_ERR(cpage)) + return cpage; + caddr = kmap(cpage); + caddr[size] = 0; + + /* Symlink is encrypted */ + sd = (struct fscrypt_symlink_data *)caddr; + cstr.name = sd->encrypted_path; + cstr.len = le16_to_cpu(sd->len); + + /* this is broken symlink case */ + if (unlikely(cstr.len == 0)) { + res = -ENOENT; + goto errout; + } + + if ((cstr.len + sizeof(struct fscrypt_symlink_data) - 1) > max_size) { + /* Symlink data on the disk is corrupted */ + res = -EIO; + goto errout; + } + res = fscrypt_fname_alloc_buffer(inode, cstr.len, &pstr); + if (res) + goto errout; + + res = fscrypt_fname_disk_to_usr(inode, 0, 0, &cstr, &pstr); + if (res < 0) + goto errout; + + /* this is broken symlink case */ + if (unlikely(pstr.name[0] == 0)) { + res = -ENOENT; + goto errout; + } + + paddr = pstr.name; + + /* Null-terminate the name */ + paddr[res] = '\0'; + nd_set_link(nd, paddr); + + kunmap(cpage); + page_cache_release(cpage); + return NULL; +errout: + fscrypt_fname_free_buffer(&pstr); + kunmap(cpage); + page_cache_release(cpage); + return ERR_PTR(res); +} + +void kfree_put_link(struct dentry *dentry, struct nameidata *nd, + void *cookie) +{ + char *s = nd_get_link(nd); + if (!IS_ERR(s)) + kfree(s); +} + +const struct inode_operations f2fs_encrypted_symlink_inode_operations = { + .readlink = generic_readlink, + .follow_link = f2fs_encrypted_follow_link, + .put_link = kfree_put_link, + .getattr = f2fs_getattr, + .setattr = f2fs_setattr, +#ifdef CONFIG_F2FS_FS_XATTR + .setxattr = generic_setxattr, + .getxattr = generic_getxattr, + .listxattr = f2fs_listxattr, + .removexattr = generic_removexattr, +#endif +}; + const struct inode_operations f2fs_dir_inode_operations = { .create = f2fs_create, .lookup = f2fs_lookup, diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c old mode 100644 new mode 100755 index 32f4934c6a2a..c6ffdf8cb23f --- a/fs/f2fs/node.c +++ b/fs/f2fs/node.c @@ -24,6 +24,16 @@ #define on_build_free_nids(nmi) mutex_is_locked(&nm_i->build_lock) +#ifndef PTR_ERR_OR_ZERO +static inline int __must_check PTR_ERR_OR_ZERO(__force const void *ptr) +{ + if (IS_ERR(ptr)) + return PTR_ERR(ptr); + else + return 0; +} +#endif + static struct kmem_cache *nat_entry_slab; static struct kmem_cache *free_nid_slab; static struct kmem_cache *nat_entry_set_slab; @@ -46,12 +56,16 @@ bool available_free_memory(struct f2fs_sb_info *sbi, int type) */ if (type == FREE_NIDS) { mem_size = (nm_i->fcnt * sizeof(struct free_nid)) >> - PAGE_CACHE_SHIFT; + PAGE_SHIFT; res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 2); } else if (type == NAT_ENTRIES) { mem_size = (nm_i->nat_cnt * sizeof(struct nat_entry)) >> - PAGE_CACHE_SHIFT; + PAGE_SHIFT; res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 2); + if (excess_cached_nats(sbi)) + res = false; + if (nm_i->nat_cnt > DEF_NAT_CACHE_THRESHOLD) + res = false; } else if (type == DIRTY_DENTS) { if (sbi->sb->s_bdi->dirty_exceeded) return false; @@ -62,16 +76,17 @@ bool available_free_memory(struct f2fs_sb_info *sbi, int type) for (i = 0; i <= UPDATE_INO; i++) mem_size += (sbi->im[i].ino_num * - sizeof(struct ino_entry)) >> PAGE_CACHE_SHIFT; + sizeof(struct ino_entry)) >> PAGE_SHIFT; res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 1); } else if (type == EXTENT_CACHE) { - mem_size = (sbi->total_ext_tree * sizeof(struct extent_tree) + + mem_size = (atomic_read(&sbi->total_ext_tree) * + sizeof(struct extent_tree) + atomic_read(&sbi->total_ext_node) * - sizeof(struct extent_node)) >> PAGE_CACHE_SHIFT; + sizeof(struct extent_node)) >> PAGE_SHIFT; res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 1); } else { - if (sbi->sb->s_bdi->dirty_exceeded) - return false; + if (!sbi->sb->s_bdi->dirty_exceeded) + return true; } return res; } @@ -120,7 +135,7 @@ static struct page *get_next_nat_page(struct f2fs_sb_info *sbi, nid_t nid) src_addr = page_address(src_page); dst_addr = page_address(dst_page); - memcpy(dst_addr, src_addr, PAGE_CACHE_SIZE); + memcpy(dst_addr, src_addr, PAGE_SIZE); set_page_dirty(dst_page); f2fs_put_page(src_page, 1); @@ -159,7 +174,7 @@ static void __set_nat_cache_dirty(struct f2fs_nm_info *nm_i, head = radix_tree_lookup(&nm_i->nat_set_root, set); if (!head) { - head = f2fs_kmem_cache_alloc(nat_entry_set_slab, GFP_ATOMIC); + head = f2fs_kmem_cache_alloc(nat_entry_set_slab, GFP_NOFS); INIT_LIST_HEAD(&head->entry_list); INIT_LIST_HEAD(&head->set_list); @@ -195,32 +210,35 @@ static unsigned int __gang_lookup_nat_set(struct f2fs_nm_info *nm_i, start, nr); } -bool is_checkpointed_node(struct f2fs_sb_info *sbi, nid_t nid) +int need_dentry_mark(struct f2fs_sb_info *sbi, nid_t nid) { struct f2fs_nm_info *nm_i = NM_I(sbi); struct nat_entry *e; - bool is_cp = true; + bool need = false; down_read(&nm_i->nat_tree_lock); e = __lookup_nat_cache(nm_i, nid); - if (e && !get_nat_flag(e, IS_CHECKPOINTED)) - is_cp = false; + if (e) { + if (!get_nat_flag(e, IS_CHECKPOINTED) && + !get_nat_flag(e, HAS_FSYNCED_INODE)) + need = true; + } up_read(&nm_i->nat_tree_lock); - return is_cp; + return need; } -bool has_fsynced_inode(struct f2fs_sb_info *sbi, nid_t ino) +bool is_checkpointed_node(struct f2fs_sb_info *sbi, nid_t nid) { struct f2fs_nm_info *nm_i = NM_I(sbi); struct nat_entry *e; - bool fsynced = false; + bool is_cp = true; down_read(&nm_i->nat_tree_lock); - e = __lookup_nat_cache(nm_i, ino); - if (e && get_nat_flag(e, HAS_FSYNCED_INODE)) - fsynced = true; + e = __lookup_nat_cache(nm_i, nid); + if (e && !get_nat_flag(e, IS_CHECKPOINTED)) + is_cp = false; up_read(&nm_i->nat_tree_lock); - return fsynced; + return is_cp; } bool need_inode_block_update(struct f2fs_sb_info *sbi, nid_t ino) @@ -243,7 +261,7 @@ static struct nat_entry *grab_nat_entry(struct f2fs_nm_info *nm_i, nid_t nid) { struct nat_entry *new; - new = f2fs_kmem_cache_alloc(nat_entry_slab, GFP_ATOMIC); + new = f2fs_kmem_cache_alloc(nat_entry_slab, GFP_NOFS); f2fs_radix_tree_insert(&nm_i->nat_root, nid, new); memset(new, 0, sizeof(struct nat_entry)); nat_set_nid(new, nid); @@ -253,18 +271,21 @@ static struct nat_entry *grab_nat_entry(struct f2fs_nm_info *nm_i, nid_t nid) return new; } -static void cache_nat_entry(struct f2fs_nm_info *nm_i, nid_t nid, +static void cache_nat_entry(struct f2fs_sb_info *sbi, nid_t nid, struct f2fs_nat_entry *ne) { + struct f2fs_nm_info *nm_i = NM_I(sbi); struct nat_entry *e; - down_write(&nm_i->nat_tree_lock); e = __lookup_nat_cache(nm_i, nid); if (!e) { e = grab_nat_entry(nm_i, nid); node_info_from_raw_nat(&e->ni, ne); + } else { + f2fs_bug_on(sbi, nat_get_ino(e) != ne->ino || + nat_get_blkaddr(e) != ne->block_addr || + nat_get_version(e) != ne->version); } - up_write(&nm_i->nat_tree_lock); } static void set_node_addr(struct f2fs_sb_info *sbi, struct node_info *ni, @@ -303,6 +324,10 @@ static void set_node_addr(struct f2fs_sb_info *sbi, struct node_info *ni, if (nat_get_blkaddr(e) != NEW_ADDR && new_blkaddr == NULL_ADDR) { unsigned char version = nat_get_version(e); nat_set_version(e, inc_node_version(version)); + + /* in order to reuse the nid */ + if (nm_i->next_scan_nid > ni->nid) + nm_i->next_scan_nid = ni->nid; } /* change address */ @@ -312,7 +337,8 @@ static void set_node_addr(struct f2fs_sb_info *sbi, struct node_info *ni, __set_nat_cache_dirty(nm_i, e); /* update fsync_mark if its inode nat entry is still alive */ - e = __lookup_nat_cache(nm_i, ni->ino); + if (ni->nid != ni->ino) + e = __lookup_nat_cache(nm_i, ni->ino); if (e) { if (fsync_done && ni->nid == ni->ino) set_nat_flag(e, HAS_FSYNCED_INODE, true); @@ -324,11 +350,11 @@ static void set_node_addr(struct f2fs_sb_info *sbi, struct node_info *ni, int try_to_free_nats(struct f2fs_sb_info *sbi, int nr_shrink) { struct f2fs_nm_info *nm_i = NM_I(sbi); + int nr = nr_shrink; - if (available_free_memory(sbi, NAT_ENTRIES)) + if (!down_write_trylock(&nm_i->nat_tree_lock)) return 0; - down_write(&nm_i->nat_tree_lock); while (nr_shrink && !list_empty(&nm_i->nat_entries)) { struct nat_entry *ne; ne = list_first_entry(&nm_i->nat_entries, @@ -337,7 +363,7 @@ int try_to_free_nats(struct f2fs_sb_info *sbi, int nr_shrink) nr_shrink--; } up_write(&nm_i->nat_tree_lock); - return nr_shrink; + return nr - nr_shrink; } /* @@ -347,7 +373,7 @@ void get_node_info(struct f2fs_sb_info *sbi, nid_t nid, struct node_info *ni) { struct f2fs_nm_info *nm_i = NM_I(sbi); struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA); - struct f2fs_summary_block *sum = curseg->sum_blk; + struct f2fs_journal *journal = curseg->journal; nid_t start_nid = START_NID(nid); struct f2fs_nat_block *nat_blk; struct page *page = NULL; @@ -364,21 +390,20 @@ void get_node_info(struct f2fs_sb_info *sbi, nid_t nid, struct node_info *ni) ni->ino = nat_get_ino(e); ni->blk_addr = nat_get_blkaddr(e); ni->version = nat_get_version(e); - } - up_read(&nm_i->nat_tree_lock); - if (e) + up_read(&nm_i->nat_tree_lock); return; + } memset(&ne, 0, sizeof(struct f2fs_nat_entry)); /* Check current segment summary */ - mutex_lock(&curseg->curseg_mutex); - i = lookup_journal_in_cursum(sum, NAT_JOURNAL, nid, 0); + down_read(&curseg->journal_rwsem); + i = lookup_journal_in_cursum(journal, NAT_JOURNAL, nid, 0); if (i >= 0) { - ne = nat_in_journal(sum, i); + ne = nat_in_journal(journal, i); node_info_from_raw_nat(ni, &ne); } - mutex_unlock(&curseg->curseg_mutex); + up_read(&curseg->journal_rwsem); if (i >= 0) goto cache; @@ -389,18 +414,75 @@ void get_node_info(struct f2fs_sb_info *sbi, nid_t nid, struct node_info *ni) node_info_from_raw_nat(ni, &ne); f2fs_put_page(page, 1); cache: + up_read(&nm_i->nat_tree_lock); /* cache nat entry */ - cache_nat_entry(NM_I(sbi), nid, &ne); + down_write(&nm_i->nat_tree_lock); + cache_nat_entry(sbi, nid, &ne); + up_write(&nm_i->nat_tree_lock); +} + +/* + * readahead MAX_RA_NODE number of node pages. + */ +static void ra_node_pages(struct page *parent, int start, int n) +{ + struct f2fs_sb_info *sbi = F2FS_P_SB(parent); + struct blk_plug plug; + int i, end; + nid_t nid; + + blk_start_plug(&plug); + + /* Then, try readahead for siblings of the desired node */ + end = start + n; + end = min(end, NIDS_PER_BLOCK); + for (i = start; i < end; i++) { + nid = get_nid(parent, i, false); + ra_node_page(sbi, nid); + } + + blk_finish_plug(&plug); +} + +pgoff_t get_next_page_offset(struct dnode_of_data *dn, pgoff_t pgofs) +{ + const long direct_index = ADDRS_PER_INODE(dn->inode); + const long direct_blks = ADDRS_PER_BLOCK; + const long indirect_blks = ADDRS_PER_BLOCK * NIDS_PER_BLOCK; + unsigned int skipped_unit = ADDRS_PER_BLOCK; + int cur_level = dn->cur_level; + int max_level = dn->max_level; + pgoff_t base = 0; + + if (!dn->max_level) + return pgofs + 1; + + while (max_level-- > cur_level) + skipped_unit *= NIDS_PER_BLOCK; + + switch (dn->max_level) { + case 3: + base += 2 * indirect_blks; + case 2: + base += 2 * direct_blks; + case 1: + base += direct_index; + break; + default: + f2fs_bug_on(F2FS_I_SB(dn->inode), 1); + } + + return ((pgofs - base) / skipped_unit + 1) * skipped_unit + base; } /* * The maximum depth is four. * Offset[0] will have raw inode offset. */ -static int get_node_path(struct f2fs_inode_info *fi, long block, +static int get_node_path(struct inode *inode, long block, int offset[4], unsigned int noffset[4]) { - const long direct_index = ADDRS_PER_INODE(fi); + const long direct_index = ADDRS_PER_INODE(inode); const long direct_blks = ADDRS_PER_BLOCK; const long dptrs_per_blk = NIDS_PER_BLOCK; const long indirect_blks = ADDRS_PER_BLOCK * NIDS_PER_BLOCK; @@ -485,10 +567,10 @@ int get_dnode_of_data(struct dnode_of_data *dn, pgoff_t index, int mode) int offset[4]; unsigned int noffset[4]; nid_t nids[4]; - int level, i; + int level, i = 0; int err = 0; - level = get_node_path(F2FS_I(dn->inode), index, offset, noffset); + level = get_node_path(dn->inode, index, offset, noffset); nids[0] = dn->inode->i_ino; npage[0] = dn->inode_page; @@ -575,6 +657,11 @@ int get_dnode_of_data(struct dnode_of_data *dn, pgoff_t index, int mode) release_out: dn->inode_page = NULL; dn->node_page = NULL; + if (err == -ENOENT) { + dn->cur_level = i; + dn->max_level = level; + dn->ofs_in_node = offset[level]; + } return err; } @@ -598,8 +685,7 @@ static void truncate_node(struct dnode_of_data *dn) if (dn->nid == dn->inode->i_ino) { remove_orphan_inode(sbi, dn->nid); dec_valid_inode_count(sbi); - } else { - sync_inode_page(dn); + f2fs_inode_synced(dn->inode); } invalidate: clear_node_page_dirty(dn->node_page); @@ -658,6 +744,8 @@ static int truncate_nodes(struct dnode_of_data *dn, unsigned int nofs, return PTR_ERR(page); } + ra_node_pages(page, ofs, NIDS_PER_BLOCK); + rn = F2FS_NODE(page); if (depth < 3) { for (i = ofs; i < NIDS_PER_BLOCK; i++, freed++) { @@ -668,7 +756,8 @@ static int truncate_nodes(struct dnode_of_data *dn, unsigned int nofs, ret = truncate_dnode(&rdn); if (ret < 0) goto out_err; - set_nid(page, i, 0, false); + if (set_nid(page, i, 0, false)) + dn->node_changed = true; } } else { child_nofs = nofs + ofs * (NIDS_PER_BLOCK + 1) + 1; @@ -681,7 +770,8 @@ static int truncate_nodes(struct dnode_of_data *dn, unsigned int nofs, rdn.nid = child_nid; ret = truncate_nodes(&rdn, child_nofs, 0, depth - 1); if (ret == (NIDS_PER_BLOCK + 1)) { - set_nid(page, i, 0, false); + if (set_nid(page, i, 0, false)) + dn->node_changed = true; child_nofs += ret; } else if (ret < 0 && ret != -ENOENT) { goto out_err; @@ -733,6 +823,8 @@ static int truncate_partial_nodes(struct dnode_of_data *dn, nid[i + 1] = get_nid(pages[i], offset[i + 1], false); } + ra_node_pages(pages[idx], offset[idx + 1], NIDS_PER_BLOCK); + /* free direct nodes linked to a partial indirect node */ for (i = offset[idx + 1]; i < NIDS_PER_BLOCK; i++) { child_nid = get_nid(pages[idx], i, false); @@ -742,7 +834,8 @@ static int truncate_partial_nodes(struct dnode_of_data *dn, err = truncate_dnode(dn); if (err < 0) goto fail; - set_nid(pages[idx], i, 0, false); + if (set_nid(pages[idx], i, 0, false)) + dn->node_changed = true; } if (offset[idx + 1] == 0) { @@ -779,8 +872,8 @@ int truncate_inode_blocks(struct inode *inode, pgoff_t from) trace_f2fs_truncate_inode_blocks_enter(inode, from); - level = get_node_path(F2FS_I(inode), from, offset, noffset); -restart: + level = get_node_path(inode, from, offset, noffset); + page = get_node_page(sbi, inode->i_ino); if (IS_ERR(page)) { trace_f2fs_truncate_inode_blocks_exit(inode, PTR_ERR(page)); @@ -844,11 +937,8 @@ int truncate_inode_blocks(struct inode *inode, pgoff_t from) if (offset[1] == 0 && ri->i_nid[offset[0] - NODE_DIR1_BLOCK]) { lock_page(page); - if (unlikely(page->mapping != NODE_MAPPING(sbi))) { - f2fs_put_page(page, 1); - goto restart; - } - f2fs_wait_on_page_writeback(page, NODE); + BUG_ON(page->mapping != NODE_MAPPING(sbi)); + f2fs_wait_on_page_writeback(page, NODE, true); ri->i_nid[offset[0] - NODE_DIR1_BLOCK] = 0; set_page_dirty(page); unlock_page(page); @@ -877,7 +967,7 @@ int truncate_xattr_node(struct inode *inode, struct page *page) if (IS_ERR(npage)) return PTR_ERR(npage); - F2FS_I(inode)->i_xattr_nid = 0; + f2fs_i_xnid_write(inode, 0); /* need to do checkpoint during fsync */ F2FS_I(inode)->xattr_ver = cur_cp_version(F2FS_CKPT(sbi)); @@ -894,17 +984,20 @@ int truncate_xattr_node(struct inode *inode, struct page *page) * Caller should grab and release a rwsem by calling f2fs_lock_op() and * f2fs_unlock_op(). */ -void remove_inode_page(struct inode *inode) +int remove_inode_page(struct inode *inode) { struct dnode_of_data dn; + int err; set_new_dnode(&dn, inode, NULL, NULL, inode->i_ino); - if (get_dnode_of_data(&dn, 0, LOOKUP_NODE)) - return; + err = get_dnode_of_data(&dn, 0, LOOKUP_NODE); + if (err) + return err; - if (truncate_xattr_node(inode, dn.inode_page)) { + err = truncate_xattr_node(inode, dn.inode_page); + if (err) { f2fs_put_dnode(&dn); - return; + return err; } /* remove potential inline_data blocks */ @@ -918,6 +1011,7 @@ void remove_inode_page(struct inode *inode) /* will put inode & node pages */ truncate_node(&dn); + return 0; } struct page *new_inode_page(struct inode *inode) @@ -939,10 +1033,10 @@ struct page *new_node_page(struct dnode_of_data *dn, struct page *page; int err; - if (unlikely(is_inode_flag_set(F2FS_I(dn->inode), FI_NO_ALLOC))) + if (unlikely(is_inode_flag_set(dn->inode, FI_NO_ALLOC))) return ERR_PTR(-EPERM); - page = grab_cache_page(NODE_MAPPING(sbi), dn->nid); + page = f2fs_grab_cache_page(NODE_MAPPING(sbi), dn->nid, false); if (!page) return ERR_PTR(-ENOMEM); @@ -959,23 +1053,19 @@ struct page *new_node_page(struct dnode_of_data *dn, new_ni.ino = dn->inode->i_ino; set_node_addr(sbi, &new_ni, NEW_ADDR, false); - f2fs_wait_on_page_writeback(page, NODE); + f2fs_wait_on_page_writeback(page, NODE, true); fill_node_footer(page, dn->nid, dn->inode->i_ino, ofs, true); set_cold_node(dn->inode, page); - SetPageUptodate(page); - set_page_dirty(page); + if (!PageUptodate(page)) + SetPageUptodate(page); + if (set_page_dirty(page)) + dn->node_changed = true; if (f2fs_has_xattr_block(ofs)) - F2FS_I(dn->inode)->i_xattr_nid = dn->nid; + f2fs_i_xnid_write(dn->inode, dn->nid); - dn->node_page = page; - if (ipage) - update_inode(dn->inode, ipage); - else - sync_inode_page(dn); if (ofs == 0) inc_valid_inode_count(sbi); - return page; fail: @@ -987,31 +1077,32 @@ struct page *new_node_page(struct dnode_of_data *dn, /* * Caller should do after getting the following values. * 0: f2fs_put_page(page, 0) - * LOCKED_PAGE: f2fs_put_page(page, 1) - * error: nothing + * LOCKED_PAGE or error: f2fs_put_page(page, 1) */ static int read_node_page(struct page *page, int rw) { struct f2fs_sb_info *sbi = F2FS_P_SB(page); struct node_info ni; struct f2fs_io_info fio = { + .sbi = sbi, .type = NODE, .rw = rw, + .page = page, + .encrypted_page = NULL, }; + if (PageUptodate(page)) + return LOCKED_PAGE; + get_node_info(sbi, page->index, &ni); if (unlikely(ni.blk_addr == NULL_ADDR)) { ClearPageUptodate(page); - f2fs_put_page(page, 1); return -ENOENT; } - if (PageUptodate(page)) - return LOCKED_PAGE; - - fio.blk_addr = ni.blk_addr; - return f2fs_submit_page_bio(sbi, page, &fio); + fio.new_blkaddr = fio.old_blkaddr = ni.blk_addr; + return f2fs_submit_page_bio(&fio); } /* @@ -1022,135 +1113,333 @@ void ra_node_page(struct f2fs_sb_info *sbi, nid_t nid) struct page *apage; int err; - apage = find_get_page(NODE_MAPPING(sbi), nid); - if (apage && PageUptodate(apage)) { - f2fs_put_page(apage, 0); + if (!nid) return; - } - f2fs_put_page(apage, 0); + f2fs_bug_on(sbi, check_nid_range(sbi, nid)); - apage = grab_cache_page(NODE_MAPPING(sbi), nid); + rcu_read_lock(); + apage = radix_tree_lookup(&NODE_MAPPING(sbi)->page_tree, nid); + rcu_read_unlock(); + if (apage) + return; + + apage = f2fs_grab_cache_page(NODE_MAPPING(sbi), nid, false); if (!apage) return; err = read_node_page(apage, READA); - if (err == 0) - f2fs_put_page(apage, 0); - else if (err == LOCKED_PAGE) - f2fs_put_page(apage, 1); + f2fs_put_page(apage, err ? 1 : 0); } -struct page *get_node_page(struct f2fs_sb_info *sbi, pgoff_t nid) +static struct page *__get_node_page(struct f2fs_sb_info *sbi, pgoff_t nid, + struct page *parent, int start) { struct page *page; int err; + + if (!nid) + return ERR_PTR(-ENOENT); + f2fs_bug_on(sbi, check_nid_range(sbi, nid)); repeat: - page = grab_cache_page(NODE_MAPPING(sbi), nid); + page = f2fs_grab_cache_page(NODE_MAPPING(sbi), nid, false); if (!page) return ERR_PTR(-ENOMEM); err = read_node_page(page, READ_SYNC); - if (err < 0) - return ERR_PTR(err); - else if (err != LOCKED_PAGE) - lock_page(page); - - if (unlikely(!PageUptodate(page) || nid != nid_of_node(page))) { - ClearPageUptodate(page); + if (err < 0) { f2fs_put_page(page, 1); - return ERR_PTR(-EIO); + return ERR_PTR(err); + } else if (err == LOCKED_PAGE) { + goto page_hit; } + + if (parent) + ra_node_pages(parent, start + 1, MAX_RA_NODE); + + lock_page(page); + if (unlikely(page->mapping != NODE_MAPPING(sbi))) { f2fs_put_page(page, 1); goto repeat; } + + if (unlikely(!PageUptodate(page))) + goto out_err; +page_hit: mark_page_accessed(page); + + if(unlikely(nid != nid_of_node(page))) { + f2fs_bug_on(sbi, 1); + ClearPageUptodate(page); +out_err: + f2fs_put_page(page, 1); + return ERR_PTR(-EIO); + } return page; } -/* - * Return a locked page for the desired node page. - * And, readahead MAX_RA_NODE number of node pages. - */ +struct page *get_node_page(struct f2fs_sb_info *sbi, pgoff_t nid) +{ + return __get_node_page(sbi, nid, NULL, 0); +} + struct page *get_node_page_ra(struct page *parent, int start) { struct f2fs_sb_info *sbi = F2FS_P_SB(parent); - struct blk_plug plug; + nid_t nid = get_nid(parent, start, false); + + return __get_node_page(sbi, nid, parent, start); +} + +static void flush_inline_data(struct f2fs_sb_info *sbi, nid_t ino) +{ + struct inode *inode; struct page *page; - int err, i, end; - nid_t nid; + int ret; - /* First, try getting the desired direct node. */ - nid = get_nid(parent, start, false); - if (!nid) - return ERR_PTR(-ENOENT); -repeat: - page = grab_cache_page(NODE_MAPPING(sbi), nid); + /* should flush inline_data before evict_inode */ + inode = ilookup(sbi->sb, ino); + if (!inode) + return; + + page = find_get_page(inode->i_mapping, 0); if (!page) - return ERR_PTR(-ENOMEM); + goto iput_out; - err = read_node_page(page, READ_SYNC); - if (err < 0) - return ERR_PTR(err); - else if (err == LOCKED_PAGE) - goto page_hit; + if (!trylock_page(page)) + goto release_out; - blk_start_plug(&plug); + if (!PageUptodate(page)) + goto page_out; - /* Then, try readahead for siblings of the desired node */ - end = start + MAX_RA_NODE; - end = min(end, NIDS_PER_BLOCK); - for (i = start + 1; i < end; i++) { - nid = get_nid(parent, i, false); - if (!nid) - continue; - ra_node_page(sbi, nid); - } + if (!PageDirty(page)) + goto page_out; - blk_finish_plug(&plug); + if (!clear_page_dirty_for_io(page)) + goto page_out; - lock_page(page); - if (unlikely(page->mapping != NODE_MAPPING(sbi))) { - f2fs_put_page(page, 1); - goto repeat; - } -page_hit: - if (unlikely(!PageUptodate(page))) { - f2fs_put_page(page, 1); - return ERR_PTR(-EIO); - } - mark_page_accessed(page); - return page; + ret = f2fs_write_inline_data(inode, page); + inode_dec_dirty_pages(inode); + if (ret) + set_page_dirty(page); +page_out: + unlock_page(page); +release_out: + f2fs_put_page(page, 0); +iput_out: + iput(inode); } -void sync_inode_page(struct dnode_of_data *dn) +void move_node_page(struct page *node_page, int gc_type) { - if (IS_INODE(dn->node_page) || dn->inode_page == dn->node_page) { - update_inode(dn->inode, dn->node_page); - } else if (dn->inode_page) { - if (!dn->inode_page_locked) - lock_page(dn->inode_page); - update_inode(dn->inode, dn->inode_page); - if (!dn->inode_page_locked) - unlock_page(dn->inode_page); + if (gc_type == FG_GC) { + struct f2fs_sb_info *sbi = F2FS_P_SB(node_page); + struct writeback_control wbc = { + .sync_mode = WB_SYNC_ALL, + .nr_to_write = 1, + .for_reclaim = 0, + }; + + set_page_dirty(node_page); + f2fs_wait_on_page_writeback(node_page, NODE, true); + + f2fs_bug_on(sbi, PageWriteback(node_page)); + if (!clear_page_dirty_for_io(node_page)) + goto out_page; + + if (NODE_MAPPING(sbi)->a_ops->writepage(node_page, &wbc)) + unlock_page(node_page); + goto release_page; } else { - update_inode_page(dn->inode); + /* set page dirty and write it */ + if (!PageWriteback(node_page)) + set_page_dirty(node_page); + } +out_page: + unlock_page(node_page); +release_page: + f2fs_put_page(node_page, 0); +} + +static struct page *last_fsync_dnode(struct f2fs_sb_info *sbi, nid_t ino) +{ + pgoff_t index, end; + struct pagevec pvec; + struct page *last_page = NULL; + + pagevec_init(&pvec, 0); + index = 0; + end = ULONG_MAX; + + while (index <= end) { + int i, nr_pages; + nr_pages = pagevec_lookup_tag(&pvec, NODE_MAPPING(sbi), &index, + PAGECACHE_TAG_DIRTY, + min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1); + if (nr_pages == 0) + break; + + for (i = 0; i < nr_pages; i++) { + struct page *page = pvec.pages[i]; + + if (unlikely(f2fs_cp_error(sbi))) { + f2fs_put_page(last_page, 0); + pagevec_release(&pvec); + return ERR_PTR(-EIO); + } + + if (!IS_DNODE(page) || !is_cold_node(page)) + continue; + if (ino_of_node(page) != ino) + continue; + + lock_page(page); + + if (unlikely(page->mapping != NODE_MAPPING(sbi))) { +continue_unlock: + unlock_page(page); + continue; + } + if (ino_of_node(page) != ino) + goto continue_unlock; + + if (!PageDirty(page)) { + /* someone wrote it for us */ + goto continue_unlock; + } + + if (last_page) + f2fs_put_page(last_page, 0); + + get_page(page); + last_page = page; + unlock_page(page); + } + pagevec_release(&pvec); + cond_resched(); } + return last_page; } -int sync_node_pages(struct f2fs_sb_info *sbi, nid_t ino, - struct writeback_control *wbc) +int fsync_node_pages(struct f2fs_sb_info *sbi, struct inode *inode, + struct writeback_control *wbc, bool atomic) { pgoff_t index, end; struct pagevec pvec; - int step = ino ? 2 : 0; - int nwritten = 0, wrote = 0; + int ret = 0; + struct page *last_page = NULL; + bool marked = false; + nid_t ino = inode->i_ino; + + if (atomic) { + last_page = last_fsync_dnode(sbi, ino); + if (IS_ERR_OR_NULL(last_page)) + return PTR_ERR_OR_ZERO(last_page); + } +retry: + pagevec_init(&pvec, 0); + index = 0; + end = ULONG_MAX; + + while (index <= end) { + int i, nr_pages; + nr_pages = pagevec_lookup_tag(&pvec, NODE_MAPPING(sbi), &index, + PAGECACHE_TAG_DIRTY, + min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1); + if (nr_pages == 0) + break; + + for (i = 0; i < nr_pages; i++) { + struct page *page = pvec.pages[i]; + + if (unlikely(f2fs_cp_error(sbi))) { + f2fs_put_page(last_page, 0); + pagevec_release(&pvec); + return -EIO; + } + + if (!IS_DNODE(page) || !is_cold_node(page)) + continue; + if (ino_of_node(page) != ino) + continue; + + lock_page(page); + + if (unlikely(page->mapping != NODE_MAPPING(sbi))) { +continue_unlock: + unlock_page(page); + continue; + } + if (ino_of_node(page) != ino) + goto continue_unlock; + + if (!PageDirty(page) && page != last_page) { + /* someone wrote it for us */ + goto continue_unlock; + } + + f2fs_wait_on_page_writeback(page, NODE, true); + BUG_ON(PageWriteback(page)); + + if (!atomic || page == last_page) { + set_fsync_mark(page, 1); + if (IS_INODE(page)) { + if (is_inode_flag_set(inode, + FI_DIRTY_INODE)) + update_inode(inode, page); + set_dentry_mark(page, + need_dentry_mark(sbi, ino)); + } + /* may be written by other thread */ + if (!PageDirty(page)) + set_page_dirty(page); + } + + if (!clear_page_dirty_for_io(page)) + goto continue_unlock; + + ret = NODE_MAPPING(sbi)->a_ops->writepage(page, wbc); + if (ret) { + unlock_page(page); + f2fs_put_page(last_page, 0); + break; + } + if (page == last_page) { + f2fs_put_page(page, 0); + marked = true; + break; + } + } + pagevec_release(&pvec); + cond_resched(); + + if (ret || marked) + break; + } + if (!ret && atomic && !marked) { + f2fs_msg(sbi->sb, KERN_DEBUG, + "Retry to write fsync mark: ino=%u, idx=%lx", + ino, last_page->index); + lock_page(last_page); + set_page_dirty(last_page); + unlock_page(last_page); + goto retry; + } + return ret ? -EIO: 0; +} + +int sync_node_pages(struct f2fs_sb_info *sbi, struct writeback_control *wbc) +{ + pgoff_t index, end; + struct pagevec pvec; + int step = 0; + int nwritten = 0; pagevec_init(&pvec, 0); next_step: index = 0; - end = LONG_MAX; + end = ULONG_MAX; while (index <= end) { int i, nr_pages; @@ -1163,6 +1452,11 @@ int sync_node_pages(struct f2fs_sb_info *sbi, nid_t ino, for (i = 0; i < nr_pages; i++) { struct page *page = pvec.pages[i]; + if (unlikely(f2fs_cp_error(sbi))) { + pagevec_release(&pvec); + return -EIO; + } + /* * flushing sequence with step: * 0. indirect nodes @@ -1177,14 +1471,8 @@ int sync_node_pages(struct f2fs_sb_info *sbi, nid_t ino, if (step == 2 && (!IS_DNODE(page) || !is_cold_node(page))) continue; - - /* - * If an fsync mode, - * we should not skip writing node pages. - */ - if (ino && ino_of_node(page) == ino) - lock_page(page); - else if (!trylock_page(page)) +lock_node: + if (!trylock_page(page)) continue; if (unlikely(page->mapping != NODE_MAPPING(sbi))) { @@ -1192,37 +1480,31 @@ int sync_node_pages(struct f2fs_sb_info *sbi, nid_t ino, unlock_page(page); continue; } - if (ino && ino_of_node(page) != ino) - goto continue_unlock; if (!PageDirty(page)) { /* someone wrote it for us */ goto continue_unlock; } + /* flush inline_data */ + if (is_inline_node(page)) { + clear_inline_node(page); + unlock_page(page); + flush_inline_data(sbi, ino_of_node(page)); + goto lock_node; + } + + f2fs_wait_on_page_writeback(page, NODE, true); + + BUG_ON(PageWriteback(page)); if (!clear_page_dirty_for_io(page)) goto continue_unlock; - /* called by fsync() */ - if (ino && IS_DNODE(page)) { - set_fsync_mark(page, 1); - if (IS_INODE(page)) { - if (!is_checkpointed_node(sbi, ino) && - !has_fsynced_inode(sbi, ino)) - set_dentry_mark(page, 1); - else - set_dentry_mark(page, 0); - } - nwritten++; - } else { - set_fsync_mark(page, 0); - set_dentry_mark(page, 0); - } + set_fsync_mark(page, 0); + set_dentry_mark(page, 0); if (NODE_MAPPING(sbi)->a_ops->writepage(page, wbc)) unlock_page(page); - else - wrote++; if (--wbc->nr_to_write == 0) break; @@ -1240,15 +1522,12 @@ int sync_node_pages(struct f2fs_sb_info *sbi, nid_t ino, step++; goto next_step; } - - if (wrote) - f2fs_submit_merged_bio(sbi, NODE, WRITE); return nwritten; } int wait_on_node_pages_writeback(struct f2fs_sb_info *sbi, nid_t ino) { - pgoff_t index = 0, end = LONG_MAX; + pgoff_t index = 0, end = ULONG_MAX; struct pagevec pvec; int ret2 = 0, ret = 0; @@ -1270,7 +1549,7 @@ int wait_on_node_pages_writeback(struct f2fs_sb_info *sbi, nid_t ino) continue; if (ino && ino_of_node(page) == ino) { - f2fs_wait_on_page_writeback(page, NODE); + f2fs_wait_on_page_writeback(page, NODE, true); if (TestClearPageError(page)) ret = -EIO; } @@ -1295,8 +1574,11 @@ static int f2fs_write_node_page(struct page *page, nid_t nid; struct node_info ni; struct f2fs_io_info fio = { + .sbi = sbi, .type = NODE, .rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE, + .page = page, + .encrypted_page = NULL, }; trace_f2fs_writepage(page, NODE); @@ -1306,38 +1588,41 @@ static int f2fs_write_node_page(struct page *page, if (unlikely(f2fs_cp_error(sbi))) goto redirty_out; - f2fs_wait_on_page_writeback(page, NODE); - /* get old block addr of this node page */ nid = nid_of_node(page); f2fs_bug_on(sbi, page->index != nid); + if (wbc->for_reclaim) { + if (!down_read_trylock(&sbi->node_write)) + goto redirty_out; + } else { + down_read(&sbi->node_write); + } + get_node_info(sbi, nid, &ni); /* This page is already truncated */ if (unlikely(ni.blk_addr == NULL_ADDR)) { ClearPageUptodate(page); dec_page_count(sbi, F2FS_DIRTY_NODES); + up_read(&sbi->node_write); unlock_page(page); return 0; } - if (wbc->for_reclaim) { - if (!down_read_trylock(&sbi->node_write)) - goto redirty_out; - } else { - down_read(&sbi->node_write); - } - set_page_writeback(page); - fio.blk_addr = ni.blk_addr; - write_node_page(sbi, page, nid, &fio); - set_node_addr(sbi, &ni, fio.blk_addr, is_fsync_dnode(page)); + fio.old_blkaddr = ni.blk_addr; + write_node_page(nid, &fio); + set_node_addr(sbi, &ni, fio.new_blkaddr, is_fsync_dnode(page)); dec_page_count(sbi, F2FS_DIRTY_NODES); up_read(&sbi->node_write); - unlock_page(page); if (wbc->for_reclaim) + f2fs_submit_merged_bio_cond(sbi, NULL, page, 0, NODE, WRITE); + + unlock_page(page); + + if (unlikely(f2fs_cp_error(sbi))) f2fs_submit_merged_bio(sbi, NODE, WRITE); return 0; @@ -1351,10 +1636,9 @@ static int f2fs_write_node_pages(struct address_space *mapping, struct writeback_control *wbc) { struct f2fs_sb_info *sbi = F2FS_M_SB(mapping); + struct blk_plug plug; long diff; - trace_f2fs_writepages(mapping->host, wbc, NODE); - /* balancing f2fs's metadata in background */ f2fs_balance_fs_bg(sbi); @@ -1362,14 +1646,19 @@ static int f2fs_write_node_pages(struct address_space *mapping, if (get_pages(sbi, F2FS_DIRTY_NODES) < nr_pages_to_skip(sbi, NODE)) goto skip_write; + trace_f2fs_writepages(mapping->host, wbc, NODE); + diff = nr_pages_to_write(sbi, NODE, wbc); wbc->sync_mode = WB_SYNC_NONE; - sync_node_pages(sbi, 0, wbc); + blk_start_plug(&plug); + sync_node_pages(sbi, wbc); + blk_finish_plug(&plug); wbc->nr_to_write = max((long)0, wbc->nr_to_write - diff); return 0; skip_write: wbc->pages_skipped += get_pages(sbi, F2FS_DIRTY_NODES); + trace_f2fs_writepages(mapping->host, wbc, NODE); return 0; } @@ -1377,9 +1666,10 @@ static int f2fs_set_node_page_dirty(struct page *page) { trace_f2fs_set_page_dirty(page, NODE); - SetPageUptodate(page); + if (!PageUptodate(page)) + SetPageUptodate(page); if (!PageDirty(page)) { - __set_page_dirty_nobuffers(page); + f2fs_set_page_dirty_nobuffers(page); inc_page_count(F2FS_P_SB(page), F2FS_DIRTY_NODES); SetPagePrivate(page); f2fs_trace_pid(page); @@ -1417,7 +1707,6 @@ static int add_free_nid(struct f2fs_sb_info *sbi, nid_t nid, bool build) struct f2fs_nm_info *nm_i = NM_I(sbi); struct free_nid *i; struct nat_entry *ne; - bool allocated = false; if (!available_free_memory(sbi, FREE_NIDS)) return -1; @@ -1428,14 +1717,9 @@ static int add_free_nid(struct f2fs_sb_info *sbi, nid_t nid, bool build) if (build) { /* do not add allocated nids */ - down_read(&nm_i->nat_tree_lock); ne = __lookup_nat_cache(nm_i, nid); - if (ne && - (!get_nat_flag(ne, IS_CHECKPOINTED) || + if (ne && (!get_nat_flag(ne, IS_CHECKPOINTED) || nat_get_blkaddr(ne) != NULL_ADDR)) - allocated = true; - up_read(&nm_i->nat_tree_lock); - if (allocated) return 0; } @@ -1504,20 +1788,23 @@ static void scan_nat_page(struct f2fs_sb_info *sbi, } } -static void build_free_nids(struct f2fs_sb_info *sbi) +void build_free_nids(struct f2fs_sb_info *sbi) { struct f2fs_nm_info *nm_i = NM_I(sbi); struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA); - struct f2fs_summary_block *sum = curseg->sum_blk; + struct f2fs_journal *journal = curseg->journal; int i = 0; nid_t nid = nm_i->next_scan_nid; /* Enough entries */ - if (nm_i->fcnt > NAT_ENTRY_PER_BLOCK) + if (nm_i->fcnt >= NAT_ENTRY_PER_BLOCK) return; /* readahead nat pages to be scanned */ - ra_meta_pages(sbi, NAT_BLOCK_OFFSET(nid), FREE_NID_PAGES, META_NAT); + ra_meta_pages(sbi, NAT_BLOCK_OFFSET(nid), FREE_NID_PAGES, + META_NAT, true); + + down_read(&nm_i->nat_tree_lock); while (1) { struct page *page = get_current_nat_page(sbi, nid); @@ -1529,7 +1816,7 @@ static void build_free_nids(struct f2fs_sb_info *sbi) if (unlikely(nid >= nm_i->max_nid)) nid = 0; - if (i++ == FREE_NID_PAGES) + if (++i >= FREE_NID_PAGES) break; } @@ -1537,16 +1824,22 @@ static void build_free_nids(struct f2fs_sb_info *sbi) nm_i->next_scan_nid = nid; /* find free nids from current sum_pages */ - mutex_lock(&curseg->curseg_mutex); - for (i = 0; i < nats_in_cursum(sum); i++) { - block_t addr = le32_to_cpu(nat_in_journal(sum, i).block_addr); - nid = le32_to_cpu(nid_in_journal(sum, i)); + down_read(&curseg->journal_rwsem); + for (i = 0; i < nats_in_cursum(journal); i++) { + block_t addr; + + addr = le32_to_cpu(nat_in_journal(journal, i).block_addr); + nid = le32_to_cpu(nid_in_journal(journal, i)); if (addr == NULL_ADDR) add_free_nid(sbi, nid, true); else remove_free_nid(nm_i, nid); } - mutex_unlock(&curseg->curseg_mutex); + up_read(&curseg->journal_rwsem); + up_read(&nm_i->nat_tree_lock); + + ra_meta_pages(sbi, NAT_BLOCK_OFFSET(nm_i->next_scan_nid), + nm_i->ra_nid_pages, META_NAT, false); } /* @@ -1559,6 +1852,10 @@ bool alloc_nid(struct f2fs_sb_info *sbi, nid_t *nid) struct f2fs_nm_info *nm_i = NM_I(sbi); struct free_nid *i = NULL; retry: +#ifdef CONFIG_F2FS_FAULT_INJECTION + if (time_to_inject(FAULT_ALLOC_NID)) + return false; +#endif if (unlikely(sbi->total_valid_node_count + 1 > nm_i->available_nids)) return false; @@ -1632,6 +1929,35 @@ void alloc_nid_failed(struct f2fs_sb_info *sbi, nid_t nid) kmem_cache_free(free_nid_slab, i); } +int try_to_free_nids(struct f2fs_sb_info *sbi, int nr_shrink) +{ + struct f2fs_nm_info *nm_i = NM_I(sbi); + struct free_nid *i, *next; + int nr = nr_shrink; + + if (nm_i->fcnt <= MAX_FREE_NIDS) + return 0; + + if (!mutex_trylock(&nm_i->build_lock)) + return 0; + + spin_lock(&nm_i->free_nid_list_lock); + list_for_each_entry_safe(i, next, &nm_i->free_nid_list, list) { + if (nr_shrink <= 0 || nm_i->fcnt <= MAX_FREE_NIDS) + break; + if (i->state == NID_ALLOC) + continue; + __del_from_free_nid_list(nm_i, i); + kmem_cache_free(free_nid_slab, i); + nm_i->fcnt--; + nr_shrink--; + } + spin_unlock(&nm_i->free_nid_list_lock); + mutex_unlock(&nm_i->build_lock); + + return nr - nr_shrink; +} + void recover_inline_xattr(struct inode *inode, struct page *page) { void *src_addr, *dst_addr; @@ -1644,7 +1970,7 @@ void recover_inline_xattr(struct inode *inode, struct page *page) ri = F2FS_INODE(page); if (!(ri->i_inline & F2FS_INLINE_XATTR)) { - clear_inode_flag(F2FS_I(inode), FI_INLINE_XATTR); + clear_inode_flag(inode, FI_INLINE_XATTR); goto update_inode; } @@ -1652,7 +1978,7 @@ void recover_inline_xattr(struct inode *inode, struct page *page) src_addr = inline_xattr_addr(page); inline_size = inline_xattr_size(inode); - f2fs_wait_on_page_writeback(ipage, NODE); + f2fs_wait_on_page_writeback(ipage, NODE, true); memcpy(dst_addr, src_addr, inline_size); update_inode: update_inode(inode, ipage); @@ -1686,13 +2012,11 @@ void recover_xattr_data(struct inode *inode, struct page *page, block_t blkaddr) get_node_info(sbi, new_xnid, &ni); ni.ino = inode->i_ino; set_node_addr(sbi, &ni, NEW_ADDR, false); - F2FS_I(inode)->i_xattr_nid = new_xnid; + f2fs_i_xnid_write(inode, new_xnid); /* 3: update xattr blkaddr */ refresh_sit_entry(sbi, NEW_ADDR, blkaddr); set_node_addr(sbi, &ni, blkaddr, false); - - update_inode_page(inode); } int recover_inode_page(struct f2fs_sb_info *sbi, struct page *page) @@ -1707,14 +2031,15 @@ int recover_inode_page(struct f2fs_sb_info *sbi, struct page *page) if (unlikely(old_ni.blk_addr != NULL_ADDR)) return -EINVAL; - ipage = grab_cache_page(NODE_MAPPING(sbi), ino); + ipage = f2fs_grab_cache_page(NODE_MAPPING(sbi), ino, false); if (!ipage) return -ENOMEM; /* Should not use this inode from free nid list */ remove_free_nid(NM_I(sbi), ino); - SetPageUptodate(ipage); + if (!PageUptodate(ipage)) + SetPageUptodate(ipage); fill_node_footer(ipage, ino, ino, 0, true); src = F2FS_INODE(page); @@ -1757,10 +2082,10 @@ int restore_node_summary(struct f2fs_sb_info *sbi, nrpages = min(last_offset - i, bio_blocks); /* readahead node pages */ - ra_meta_pages(sbi, addr, nrpages, META_POR); + ra_meta_pages(sbi, addr, nrpages, META_POR, true); for (idx = addr; idx < addr + nrpages; idx++) { - struct page *page = get_meta_page(sbi, idx); + struct page *page = get_tmp_page(sbi, idx); rn = F2FS_NODE(page); sum_entry->nid = rn->footer.nid; @@ -1780,28 +2105,26 @@ static void remove_nats_in_journal(struct f2fs_sb_info *sbi) { struct f2fs_nm_info *nm_i = NM_I(sbi); struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA); - struct f2fs_summary_block *sum = curseg->sum_blk; + struct f2fs_journal *journal = curseg->journal; int i; - mutex_lock(&curseg->curseg_mutex); - for (i = 0; i < nats_in_cursum(sum); i++) { + down_write(&curseg->journal_rwsem); + for (i = 0; i < nats_in_cursum(journal); i++) { struct nat_entry *ne; struct f2fs_nat_entry raw_ne; - nid_t nid = le32_to_cpu(nid_in_journal(sum, i)); + nid_t nid = le32_to_cpu(nid_in_journal(journal, i)); - raw_ne = nat_in_journal(sum, i); + raw_ne = nat_in_journal(journal, i); - down_write(&nm_i->nat_tree_lock); ne = __lookup_nat_cache(nm_i, nid); if (!ne) { ne = grab_nat_entry(nm_i, nid); node_info_from_raw_nat(&ne->ni, &raw_ne); } __set_nat_cache_dirty(nm_i, ne); - up_write(&nm_i->nat_tree_lock); } - update_nats_in_cursum(sum, -i); - mutex_unlock(&curseg->curseg_mutex); + update_nats_in_cursum(journal, -i); + up_write(&curseg->journal_rwsem); } static void __adjust_nat_entry_set(struct nat_entry_set *nes, @@ -1826,24 +2149,23 @@ static void __flush_nat_entry_set(struct f2fs_sb_info *sbi, struct nat_entry_set *set) { struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA); - struct f2fs_summary_block *sum = curseg->sum_blk; + struct f2fs_journal *journal = curseg->journal; nid_t start_nid = set->set * NAT_ENTRY_PER_BLOCK; bool to_journal = true; struct f2fs_nat_block *nat_blk; struct nat_entry *ne, *cur; struct page *page = NULL; - struct f2fs_nm_info *nm_i = NM_I(sbi); /* * there are two steps to flush nat entries: * #1, flush nat entries to journal in current hot data summary block. * #2, flush nat entries to nat page. */ - if (!__has_cursum_space(sum, set->entry_cnt, NAT_JOURNAL)) + if (!__has_cursum_space(journal, set->entry_cnt, NAT_JOURNAL)) to_journal = false; if (to_journal) { - mutex_lock(&curseg->curseg_mutex); + down_write(&curseg->journal_rwsem); } else { page = get_next_nat_page(sbi, start_nid); nat_blk = page_address(page); @@ -1860,35 +2182,29 @@ static void __flush_nat_entry_set(struct f2fs_sb_info *sbi, continue; if (to_journal) { - offset = lookup_journal_in_cursum(sum, + offset = lookup_journal_in_cursum(journal, NAT_JOURNAL, nid, 1); f2fs_bug_on(sbi, offset < 0); - raw_ne = &nat_in_journal(sum, offset); - nid_in_journal(sum, offset) = cpu_to_le32(nid); + raw_ne = &nat_in_journal(journal, offset); + nid_in_journal(journal, offset) = cpu_to_le32(nid); } else { raw_ne = &nat_blk->entries[nid - start_nid]; } raw_nat_from_node_info(raw_ne, &ne->ni); - - down_write(&NM_I(sbi)->nat_tree_lock); nat_reset_flag(ne); __clear_nat_cache_dirty(NM_I(sbi), ne); - up_write(&NM_I(sbi)->nat_tree_lock); - if (nat_get_blkaddr(ne) == NULL_ADDR) add_free_nid(sbi, nid, false); } if (to_journal) - mutex_unlock(&curseg->curseg_mutex); + up_write(&curseg->journal_rwsem); else f2fs_put_page(page, 1); f2fs_bug_on(sbi, set->entry_cnt); - down_write(&nm_i->nat_tree_lock); radix_tree_delete(&NM_I(sbi)->nat_set_root, set->set); - up_write(&nm_i->nat_tree_lock); kmem_cache_free(nat_entry_set_slab, set); } @@ -1899,7 +2215,7 @@ void flush_nat_entries(struct f2fs_sb_info *sbi) { struct f2fs_nm_info *nm_i = NM_I(sbi); struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA); - struct f2fs_summary_block *sum = curseg->sum_blk; + struct f2fs_journal *journal = curseg->journal; struct nat_entry_set *setvec[SETVEC_SIZE]; struct nat_entry_set *set, *tmp; unsigned int found; @@ -1908,29 +2224,32 @@ void flush_nat_entries(struct f2fs_sb_info *sbi) if (!nm_i->dirty_nat_cnt) return; + + down_write(&nm_i->nat_tree_lock); + /* * if there are no enough space in journal to store dirty nat * entries, remove all entries from journal and merge them * into nat entry set. */ - if (!__has_cursum_space(sum, nm_i->dirty_nat_cnt, NAT_JOURNAL)) + if (!__has_cursum_space(journal, nm_i->dirty_nat_cnt, NAT_JOURNAL)) remove_nats_in_journal(sbi); - down_write(&nm_i->nat_tree_lock); while ((found = __gang_lookup_nat_set(nm_i, set_idx, SETVEC_SIZE, setvec))) { unsigned idx; set_idx = setvec[found - 1]->set + 1; for (idx = 0; idx < found; idx++) __adjust_nat_entry_set(setvec[idx], &sets, - MAX_NAT_JENTRIES(sum)); + MAX_NAT_JENTRIES(journal)); } - up_write(&nm_i->nat_tree_lock); /* flush dirty nats in nat entry set */ list_for_each_entry_safe(set, tmp, &sets, set_list) __flush_nat_entry_set(sbi, set); + up_write(&nm_i->nat_tree_lock); + f2fs_bug_on(sbi, nm_i->dirty_nat_cnt); } @@ -1954,6 +2273,8 @@ static int init_node_manager(struct f2fs_sb_info *sbi) nm_i->fcnt = 0; nm_i->nat_cnt = 0; nm_i->ram_thresh = DEF_RAM_THRESHOLD; + nm_i->ra_nid_pages = DEF_RA_NID_PAGES; + nm_i->dirty_nats_ratio = DEF_DIRTY_NAT_RATIO_THRESHOLD; INIT_RADIX_TREE(&nm_i->free_nid_root, GFP_ATOMIC); INIT_LIST_HEAD(&nm_i->free_nid_list); diff --git a/fs/f2fs/node.h b/fs/f2fs/node.h old mode 100644 new mode 100755 index c56026f1725c..fc7684554b1a --- a/fs/f2fs/node.h +++ b/fs/f2fs/node.h @@ -14,14 +14,22 @@ /* node block offset on the NAT area dedicated to the given start node id */ #define NAT_BLOCK_OFFSET(start_nid) (start_nid / NAT_ENTRY_PER_BLOCK) -/* # of pages to perform readahead before building free nids */ -#define FREE_NID_PAGES 4 +/* # of pages to perform synchronous readahead before building free nids */ +#define FREE_NID_PAGES 8 +#define MAX_FREE_NIDS (NAT_ENTRY_PER_BLOCK * FREE_NID_PAGES) + +#define DEF_RA_NID_PAGES 0 /* # of nid pages to be readaheaded */ /* maximum readahead size for node during getting data blocks */ #define MAX_RA_NODE 128 /* control the memory footprint threshold (10MB per 1GB ram) */ -#define DEF_RAM_THRESHOLD 10 +#define DEF_RAM_THRESHOLD 1 + +/* control dirty nats ratio threshold (default: 10% over max nid count) */ +#define DEF_DIRTY_NAT_RATIO_THRESHOLD 10 +/* control total # of nats */ +#define DEF_NAT_CACHE_THRESHOLD 100000 /* vector size for gang look-up from nat cache that consists of radix tree */ #define NATVEC_SIZE 64 @@ -115,6 +123,17 @@ static inline void raw_nat_from_node_info(struct f2fs_nat_entry *raw_ne, raw_ne->version = ni->version; } +static inline bool excess_dirty_nats(struct f2fs_sb_info *sbi) +{ + return NM_I(sbi)->dirty_nat_cnt >= NM_I(sbi)->max_nid * + NM_I(sbi)->dirty_nats_ratio / 100; +} + +static inline bool excess_cached_nats(struct f2fs_sb_info *sbi) +{ + return NM_I(sbi)->nat_cnt >= DEF_NAT_CACHE_THRESHOLD; +} + enum mem_type { FREE_NIDS, /* indicates the free nid list */ NAT_ENTRIES, /* indicates the cached nat entry */ @@ -181,7 +200,7 @@ static inline pgoff_t current_nat_addr(struct f2fs_sb_info *sbi, nid_t start) block_addr = (pgoff_t)(nm_i->nat_blkaddr + (seg_off << sbi->log_blocks_per_seg << 1) + - (block_off & ((1 << sbi->log_blocks_per_seg) - 1))); + (block_off & (sbi->blocks_per_seg - 1))); if (f2fs_test_bit(block_off, nm_i->nat_bitmap)) block_addr += sbi->blocks_per_seg; @@ -315,17 +334,17 @@ static inline bool IS_DNODE(struct page *node_page) return true; } -static inline void set_nid(struct page *p, int off, nid_t nid, bool i) +static inline int set_nid(struct page *p, int off, nid_t nid, bool i) { struct f2fs_node *rn = F2FS_NODE(p); - f2fs_wait_on_page_writeback(p, NODE); + f2fs_wait_on_page_writeback(p, NODE, true); if (i) rn->i.i_nid[off - NODE_DIR1_BLOCK] = cpu_to_le32(nid); else rn->in.nid[off] = cpu_to_le32(nid); - set_page_dirty(p); + return set_page_dirty(p); } static inline nid_t get_nid(struct page *p, int off, bool i) @@ -343,28 +362,6 @@ static inline nid_t get_nid(struct page *p, int off, bool i) * - Mark cold node blocks in their node footer * - Mark cold data pages in page cache */ -static inline int is_file(struct inode *inode, int type) -{ - return F2FS_I(inode)->i_advise & type; -} - -static inline void set_file(struct inode *inode, int type) -{ - F2FS_I(inode)->i_advise |= type; -} - -static inline void clear_file(struct inode *inode, int type) -{ - F2FS_I(inode)->i_advise &= ~type; -} - -#define file_is_cold(inode) is_file(inode, FADVISE_COLD_BIT) -#define file_wrong_pino(inode) is_file(inode, FADVISE_LOST_PINO_BIT) -#define file_set_cold(inode) set_file(inode, FADVISE_COLD_BIT) -#define file_lost_pino(inode) set_file(inode, FADVISE_LOST_PINO_BIT) -#define file_clear_cold(inode) clear_file(inode, FADVISE_COLD_BIT) -#define file_got_pino(inode) clear_file(inode, FADVISE_LOST_PINO_BIT) - static inline int is_cold_data(struct page *page) { return PageChecked(page); @@ -390,6 +387,21 @@ static inline int is_node(struct page *page, int type) #define is_fsync_dnode(page) is_node(page, FSYNC_BIT_SHIFT) #define is_dent_dnode(page) is_node(page, DENT_BIT_SHIFT) +static inline int is_inline_node(struct page *page) +{ + return PageChecked(page); +} + +static inline void set_inline_node(struct page *page) +{ + SetPageChecked(page); +} + +static inline void clear_inline_node(struct page *page) +{ + ClearPageChecked(page); +} + static inline void set_cold_node(struct inode *inode, struct page *page) { struct f2fs_node *rn = F2FS_NODE(page); diff --git a/fs/f2fs/recovery.c b/fs/f2fs/recovery.c old mode 100644 new mode 100755 index 0d4866b7c8b5..cb591900257a --- a/fs/f2fs/recovery.c +++ b/fs/f2fs/recovery.c @@ -49,8 +49,9 @@ static struct kmem_cache *fsync_entry_slab; bool space_for_roll_forward(struct f2fs_sb_info *sbi) { - if (sbi->last_valid_block_count + sbi->alloc_valid_block_count - > sbi->user_block_count) + s64 nalloc = percpu_counter_sum_positive(&sbi->alloc_valid_block_count); + + if (sbi->last_valid_block_count + nalloc > sbi->user_block_count) return false; return true; } @@ -67,7 +68,30 @@ static struct fsync_inode_entry *get_fsync_inode(struct list_head *head, return NULL; } -static int recover_dentry(struct inode *inode, struct page *ipage) +static struct fsync_inode_entry *add_fsync_inode(struct list_head *head, + struct inode *inode) +{ + struct fsync_inode_entry *entry; + + entry = kmem_cache_alloc(fsync_entry_slab, GFP_F2FS_ZERO); + if (!entry) + return NULL; + + entry->inode = inode; + list_add_tail(&entry->list, head); + + return entry; +} + +static void del_fsync_inode(struct fsync_inode_entry *entry) +{ + iput(entry->inode); + list_del(&entry->list); + kmem_cache_free(fsync_entry_slab, entry); +} + +static int recover_dentry(struct inode *inode, struct page *ipage, + struct list_head *dir_list) { struct f2fs_inode *raw_inode = F2FS_INODE(ipage); nid_t pino = le32_to_cpu(raw_inode->i_pino); @@ -75,21 +99,37 @@ static int recover_dentry(struct inode *inode, struct page *ipage) struct qstr name; struct page *page; struct inode *dir, *einode; + struct fsync_inode_entry *entry; int err = 0; - dir = f2fs_iget(inode->i_sb, pino); - if (IS_ERR(dir)) { - err = PTR_ERR(dir); - goto out; + entry = get_fsync_inode(dir_list, pino); + if (!entry) { + dir = f2fs_iget(inode->i_sb, pino); + if (IS_ERR(dir)) { + err = PTR_ERR(dir); + goto out; + } + + entry = add_fsync_inode(dir_list, dir); + if (!entry) { + err = -ENOMEM; + iput(dir); + goto out; + } } + dir = entry->inode; + + if (file_enc_name(inode)) + return 0; + name.len = le32_to_cpu(raw_inode->i_namelen); name.name = raw_inode->i_name; if (unlikely(name.len > F2FS_NAME_LEN)) { WARN_ON(1); err = -ENAMETOOLONG; - goto out_err; + goto out; } retry: de = f2fs_find_entry(dir, &name, &page); @@ -113,25 +153,17 @@ static int recover_dentry(struct inode *inode, struct page *ipage) f2fs_delete_entry(de, page, dir, einode); iput(einode); goto retry; - } - err = __f2fs_add_link(dir, &name, inode, inode->i_ino, inode->i_mode); - if (err) - goto out_err; - - if (is_inode_flag_set(F2FS_I(dir), FI_DELAY_IPUT)) { - iput(dir); + } else if (IS_ERR(page)) { + err = PTR_ERR(page); } else { - add_dirty_dir_inode(dir); - set_inode_flag(F2FS_I(dir), FI_DELAY_IPUT); + err = __f2fs_add_link(dir, &name, inode, + inode->i_ino, inode->i_mode); } - goto out; out_unmap_put: f2fs_dentry_kunmap(dir, page); f2fs_put_page(page, 0); -out_err: - iput(dir); out: f2fs_msg(inode->i_sb, KERN_NOTICE, "%s: ino = %x, name = %s, dir = %lx, err = %d", @@ -143,9 +175,10 @@ static int recover_dentry(struct inode *inode, struct page *ipage) static void recover_inode(struct inode *inode, struct page *page) { struct f2fs_inode *raw = F2FS_INODE(page); + char *name; inode->i_mode = le16_to_cpu(raw->i_mode); - i_size_write(inode, le64_to_cpu(raw->i_size)); + f2fs_i_size_write(inode, le64_to_cpu(raw->i_size)); inode->i_atime.tv_sec = le64_to_cpu(raw->i_mtime); inode->i_ctime.tv_sec = le64_to_cpu(raw->i_ctime); inode->i_mtime.tv_sec = le64_to_cpu(raw->i_mtime); @@ -153,14 +186,46 @@ static void recover_inode(struct inode *inode, struct page *page) inode->i_ctime.tv_nsec = le32_to_cpu(raw->i_ctime_nsec); inode->i_mtime.tv_nsec = le32_to_cpu(raw->i_mtime_nsec); + if (file_enc_name(inode)) + name = ""; + else + name = F2FS_INODE(page)->i_name; + f2fs_msg(inode->i_sb, KERN_NOTICE, "recover_inode: ino = %x, name = %s", - ino_of_node(page), F2FS_INODE(page)->i_name); + ino_of_node(page), name); +} + +static bool is_same_inode(struct inode *inode, struct page *ipage) +{ + struct f2fs_inode *ri = F2FS_INODE(ipage); + struct timespec disk; + + if (!IS_INODE(ipage)) + return true; + + disk.tv_sec = le64_to_cpu(ri->i_ctime); + disk.tv_nsec = le32_to_cpu(ri->i_ctime_nsec); + if (timespec_compare(&inode->i_ctime, &disk) > 0) + return false; + + disk.tv_sec = le64_to_cpu(ri->i_atime); + disk.tv_nsec = le32_to_cpu(ri->i_atime_nsec); + if (timespec_compare(&inode->i_atime, &disk) > 0) + return false; + + disk.tv_sec = le64_to_cpu(ri->i_mtime); + disk.tv_nsec = le32_to_cpu(ri->i_mtime_nsec); + if (timespec_compare(&inode->i_mtime, &disk) > 0) + return false; + + return true; } static int find_fsync_dnodes(struct f2fs_sb_info *sbi, struct list_head *head) { unsigned long long cp_ver = cur_cp_version(F2FS_CKPT(sbi)); struct curseg_info *curseg; + struct inode *inode; struct page *page = NULL; block_t blkaddr; int err = 0; @@ -169,15 +234,13 @@ static int find_fsync_dnodes(struct f2fs_sb_info *sbi, struct list_head *head) curseg = CURSEG_I(sbi, CURSEG_WARM_NODE); blkaddr = NEXT_FREE_BLKADDR(sbi, curseg); - ra_meta_pages(sbi, blkaddr, 1, META_POR); - while (1) { struct fsync_inode_entry *entry; - if (blkaddr < MAIN_BLKADDR(sbi) || blkaddr >= MAX_BLKADDR(sbi)) + if (!is_valid_blkaddr(sbi, blkaddr, META_POR)) return 0; - page = get_meta_page(sbi, blkaddr); + page = get_tmp_page(sbi, blkaddr); if (cp_ver != cpver_of_node(page)) break; @@ -186,42 +249,42 @@ static int find_fsync_dnodes(struct f2fs_sb_info *sbi, struct list_head *head) goto next; entry = get_fsync_inode(head, ino_of_node(page)); - if (!entry) { + if (entry) { + if (!is_same_inode(entry->inode, page)) + goto next; + } else { if (IS_INODE(page) && is_dent_dnode(page)) { err = recover_inode_page(sbi, page); if (err) break; } - /* add this fsync inode to the list */ - entry = kmem_cache_alloc(fsync_entry_slab, GFP_F2FS_ZERO); - if (!entry) { - err = -ENOMEM; - break; - } /* * CP | dnode(F) | inode(DF) * For this case, we should not give up now. */ - entry->inode = f2fs_iget(sbi->sb, ino_of_node(page)); - if (IS_ERR(entry->inode)) { - err = PTR_ERR(entry->inode); - kmem_cache_free(fsync_entry_slab, entry); + inode = f2fs_iget(sbi->sb, ino_of_node(page)); + if (IS_ERR(inode)) { + err = PTR_ERR(inode); if (err == -ENOENT) { err = 0; goto next; } break; } - list_add_tail(&entry->list, head); + + /* add this fsync inode to the list */ + entry = add_fsync_inode(head, inode); + if (!entry) { + err = -ENOMEM; + iput(inode); + break; + } } entry->blkaddr = blkaddr; - if (IS_INODE(page)) { - entry->last_inode = blkaddr; - if (is_dent_dnode(page)) - entry->last_dentry = blkaddr; - } + if (IS_INODE(page) && is_dent_dnode(page)) + entry->last_dentry = blkaddr; next: /* check next segment */ blkaddr = next_blkaddr_of_node(page); @@ -237,11 +300,8 @@ static void destroy_fsync_dnodes(struct list_head *head) { struct fsync_inode_entry *entry, *tmp; - list_for_each_entry_safe(entry, tmp, head, list) { - iput(entry->inode); - list_del(&entry->list); - kmem_cache_free(fsync_entry_slab, entry); - } + list_for_each_entry_safe(entry, tmp, head, list) + del_fsync_inode(entry); } static int check_index_in_prev_nodes(struct f2fs_sb_info *sbi, @@ -310,8 +370,7 @@ static int check_index_in_prev_nodes(struct f2fs_sb_info *sbi, inode = dn->inode; } - bidx = start_bidx_of_node(offset, F2FS_I(inode)) + - le16_to_cpu(sum.ofs_in_node); + bidx = start_bidx_of_node(offset, inode) + le16_to_cpu(sum.ofs_in_node); /* * if inode page is locked, unlock temporarily, but its reference @@ -346,11 +405,9 @@ static int check_index_in_prev_nodes(struct f2fs_sb_info *sbi, static int do_recover_data(struct f2fs_sb_info *sbi, struct inode *inode, struct page *page, block_t blkaddr) { - struct f2fs_inode_info *fi = F2FS_I(inode); - unsigned int start, end; struct dnode_of_data dn; - struct f2fs_summary sum; struct node_info ni; + unsigned int start, end; int err = 0, recovered = 0; /* step 1: recover xattr */ @@ -370,38 +427,63 @@ static int do_recover_data(struct f2fs_sb_info *sbi, struct inode *inode, goto out; /* step 3: recover data indices */ - start = start_bidx_of_node(ofs_of_node(page), fi); - end = start + ADDRS_PER_PAGE(page, fi); - - f2fs_lock_op(sbi); + start = start_bidx_of_node(ofs_of_node(page), inode); + end = start + ADDRS_PER_PAGE(page, inode); set_new_dnode(&dn, inode, NULL, NULL, 0); err = get_dnode_of_data(&dn, start, ALLOC_NODE); - if (err) { - f2fs_unlock_op(sbi); + if (err) goto out; - } - f2fs_wait_on_page_writeback(dn.node_page, NODE); + f2fs_wait_on_page_writeback(dn.node_page, NODE, true); get_node_info(sbi, dn.nid, &ni); f2fs_bug_on(sbi, ni.ino != ino_of_node(page)); f2fs_bug_on(sbi, ofs_of_node(dn.node_page) != ofs_of_node(page)); - for (; start < end; start++) { + for (; start < end; start++, dn.ofs_in_node++) { block_t src, dest; src = datablock_addr(dn.node_page, dn.ofs_in_node); dest = datablock_addr(page, dn.ofs_in_node); - if (src != dest && dest != NEW_ADDR && dest != NULL_ADDR && - dest >= MAIN_BLKADDR(sbi) && dest < MAX_BLKADDR(sbi)) { + /* skip recovering if dest is the same as src */ + if (src == dest) + continue; + + /* dest is invalid, just invalidate src block */ + if (dest == NULL_ADDR) { + truncate_data_blocks_range(&dn, 1); + continue; + } + + if ((start + 1) << PAGE_SHIFT > i_size_read(inode)) + f2fs_i_size_write(inode, (start + 1) << PAGE_SHIFT); + + /* + * dest is reserved block, invalidate src block + * and then reserve one new block in dnode page. + */ + if (dest == NEW_ADDR) { + truncate_data_blocks_range(&dn, 1); + reserve_new_block(&dn); + continue; + } + + /* dest is valid block, try to recover from src to dest */ + if (is_valid_blkaddr(sbi, dest, META_POR)) { if (src == NULL_ADDR) { err = reserve_new_block(&dn); +#ifdef CONFIG_F2FS_FAULT_INJECTION + while (err) + err = reserve_new_block(&dn); +#endif /* We should not get -ENOSPC */ f2fs_bug_on(sbi, err); + if (err) + goto err; } /* Check the previous node page having this index */ @@ -409,28 +491,19 @@ static int do_recover_data(struct f2fs_sb_info *sbi, struct inode *inode, if (err) goto err; - set_summary(&sum, dn.nid, dn.ofs_in_node, ni.version); - /* write dummy data page */ - recover_data_page(sbi, NULL, &sum, src, dest); - dn.data_blkaddr = dest; - set_data_blkaddr(&dn); - f2fs_update_extent_cache(&dn); + f2fs_replace_block(sbi, &dn, src, dest, + ni.version, false, false); recovered++; } - dn.ofs_in_node++; } - if (IS_INODE(dn.node_page)) - sync_inode_page(&dn); - copy_node_footer(dn.node_page, page); fill_node_footer(dn.node_page, dn.nid, ni.ino, ofs_of_node(page), false); set_page_dirty(dn.node_page); err: f2fs_put_dnode(&dn); - f2fs_unlock_op(sbi); out: f2fs_msg(sbi->sb, KERN_NOTICE, "recover_data: ino = %lx, recovered = %d blocks, err = %d", @@ -438,8 +511,8 @@ static int do_recover_data(struct f2fs_sb_info *sbi, struct inode *inode, return err; } -static int recover_data(struct f2fs_sb_info *sbi, - struct list_head *head, int type) +static int recover_data(struct f2fs_sb_info *sbi, struct list_head *inode_list, + struct list_head *dir_list) { unsigned long long cp_ver = cur_cp_version(F2FS_CKPT(sbi)); struct curseg_info *curseg; @@ -448,25 +521,25 @@ static int recover_data(struct f2fs_sb_info *sbi, block_t blkaddr; /* get node pages in the current segment */ - curseg = CURSEG_I(sbi, type); + curseg = CURSEG_I(sbi, CURSEG_WARM_NODE); blkaddr = NEXT_FREE_BLKADDR(sbi, curseg); while (1) { struct fsync_inode_entry *entry; - if (blkaddr < MAIN_BLKADDR(sbi) || blkaddr >= MAX_BLKADDR(sbi)) + if (!is_valid_blkaddr(sbi, blkaddr, META_POR)) break; ra_meta_pages_cond(sbi, blkaddr); - page = get_meta_page(sbi, blkaddr); + page = get_tmp_page(sbi, blkaddr); if (cp_ver != cpver_of_node(page)) { f2fs_put_page(page, 1); break; } - entry = get_fsync_inode(head, ino_of_node(page)); + entry = get_fsync_inode(inode_list, ino_of_node(page)); if (!entry) goto next; /* @@ -474,10 +547,10 @@ static int recover_data(struct f2fs_sb_info *sbi, * In this case, we can lose the latest inode(x). * So, call recover_inode for the inode update. */ - if (entry->last_inode == blkaddr) + if (IS_INODE(page)) recover_inode(entry->inode, page); if (entry->last_dentry == blkaddr) { - err = recover_dentry(entry->inode, page); + err = recover_dentry(entry->inode, page, dir_list); if (err) { f2fs_put_page(page, 1); break; @@ -489,11 +562,8 @@ static int recover_data(struct f2fs_sb_info *sbi, break; } - if (entry->blkaddr == blkaddr) { - iput(entry->inode); - list_del(&entry->list); - kmem_cache_free(fsync_entry_slab, entry); - } + if (entry->blkaddr == blkaddr) + del_fsync_inode(entry); next: /* check next segment */ blkaddr = next_blkaddr_of_node(page); @@ -504,12 +574,14 @@ static int recover_data(struct f2fs_sb_info *sbi, return err; } -int recover_fsync_data(struct f2fs_sb_info *sbi) +int recover_fsync_data(struct f2fs_sb_info *sbi, bool check_only) { struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_WARM_NODE); struct list_head inode_list; + struct list_head dir_list; block_t blkaddr; int err; + int ret = 0; bool need_writecp = false; fsync_entry_slab = f2fs_kmem_cache_create("f2fs_fsync_inode_entry", @@ -518,35 +590,35 @@ int recover_fsync_data(struct f2fs_sb_info *sbi) return -ENOMEM; INIT_LIST_HEAD(&inode_list); - - /* step #1: find fsynced inode numbers */ - set_sbi_flag(sbi, SBI_POR_DOING); + INIT_LIST_HEAD(&dir_list); /* prevent checkpoint */ mutex_lock(&sbi->cp_mutex); blkaddr = NEXT_FREE_BLKADDR(sbi, curseg); + /* step #1: find fsynced inode numbers */ err = find_fsync_dnodes(sbi, &inode_list); - if (err) + if (err || list_empty(&inode_list)) goto out; - if (list_empty(&inode_list)) + if (check_only) { + ret = 1; goto out; + } need_writecp = true; /* step #2: recover data */ - err = recover_data(sbi, &inode_list, CURSEG_WARM_NODE); + err = recover_data(sbi, &inode_list, &dir_list); if (!err) f2fs_bug_on(sbi, !list_empty(&inode_list)); out: destroy_fsync_dnodes(&inode_list); - kmem_cache_destroy(fsync_entry_slab); /* truncate meta pages to be used by the recovery */ truncate_inode_pages_range(META_MAPPING(sbi), - MAIN_BLKADDR(sbi) << PAGE_CACHE_SHIFT, -1); + (loff_t)MAIN_BLKADDR(sbi) << PAGE_SHIFT, -1); if (err) { truncate_inode_pages(NODE_MAPPING(sbi), 0); @@ -555,11 +627,24 @@ int recover_fsync_data(struct f2fs_sb_info *sbi) clear_sbi_flag(sbi, SBI_POR_DOING); if (err) { - discard_next_dnode(sbi, blkaddr); + bool invalidate = false; + + if (test_opt(sbi, LFS)) { + update_meta_page(sbi, NULL, blkaddr); + invalidate = true; + } else if (discard_next_dnode(sbi, blkaddr)) { + invalidate = true; + } /* Flush all the NAT/SIT pages */ while (get_pages(sbi, F2FS_DIRTY_META)) sync_meta_pages(sbi, META, LONG_MAX); + + /* invalidate temporary meta page */ + if (invalidate) + invalidate_mapping_pages(META_MAPPING(sbi), + blkaddr, blkaddr); + set_ckpt_flags(sbi->ckpt, CP_ERROR_FLAG); mutex_unlock(&sbi->cp_mutex); } else if (need_writecp) { @@ -567,9 +652,12 @@ int recover_fsync_data(struct f2fs_sb_info *sbi) .reason = CP_RECOVERY, }; mutex_unlock(&sbi->cp_mutex); - write_checkpoint(sbi, &cpc); + err = write_checkpoint(sbi, &cpc); } else { mutex_unlock(&sbi->cp_mutex); } - return err; + + destroy_fsync_dnodes(&dir_list); + kmem_cache_destroy(fsync_entry_slab); + return ret ? ret: err; } diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c old mode 100644 new mode 100755 index 4aff9ea564ad..011646af340c --- a/fs/f2fs/segment.c +++ b/fs/f2fs/segment.c @@ -14,8 +14,8 @@ #include #include #include -#include #include +#include #include "f2fs.h" #include "segment.h" @@ -29,6 +29,21 @@ static struct kmem_cache *discard_entry_slab; static struct kmem_cache *sit_entry_set_slab; static struct kmem_cache *inmem_entry_slab; +static unsigned long __reverse_ulong(unsigned char *str) +{ + unsigned long tmp = 0; + int shift = 24, idx = 0; + +#if BITS_PER_LONG == 64 + shift = 56; +#endif + while (shift >= 0) { + tmp |= (unsigned long)str[idx++] << shift; + shift -= BITS_PER_BYTE; + } + return tmp; +} + /** * Copied from latest lib/llist.c * llist_for_each_entry_safe - iterate over some deleted entries of @@ -79,9 +94,9 @@ struct llist_node *llist_reverse_order(struct llist_node *head) /** * Copied from latest linux/list.h * list_last_entry - get the last element from a list - * @ptr: the list head to take the element from. - * @type: the type of the struct this is embedded in. - * @member: the name of the list_struct within the struct. + * @ptr: the list head to take the element from. + * @type: the type of the struct this is embedded in. + * @member: the name of the list_struct within the struct. * * Note, that list is expected to be not empty. */ @@ -97,27 +112,31 @@ static inline unsigned long __reverse_ffs(unsigned long word) int num = 0; #if BITS_PER_LONG == 64 - if ((word & 0xffffffff) == 0) { + if ((word & 0xffffffff00000000UL) == 0) num += 32; + else word >>= 32; - } #endif - if ((word & 0xffff) == 0) { + if ((word & 0xffff0000) == 0) num += 16; + else word >>= 16; - } - if ((word & 0xff) == 0) { + + if ((word & 0xff00) == 0) num += 8; + else word >>= 8; - } + if ((word & 0xf0) == 0) num += 4; else word >>= 4; + if ((word & 0xc) == 0) num += 2; else word >>= 2; + if ((word & 0x2) == 0) num += 1; return num; @@ -126,140 +145,103 @@ static inline unsigned long __reverse_ffs(unsigned long word) /* * __find_rev_next(_zero)_bit is copied from lib/find_next_bit.c because * f2fs_set_bit makes MSB and LSB reversed in a byte. + * @size must be integral times of unsigned long. * Example: - * LSB <--> MSB - * f2fs_set_bit(0, bitmap) => 0000 0001 - * f2fs_set_bit(7, bitmap) => 1000 0000 + * MSB <--> LSB + * f2fs_set_bit(0, bitmap) => 1000 0000 + * f2fs_set_bit(7, bitmap) => 0000 0001 */ static unsigned long __find_rev_next_bit(const unsigned long *addr, unsigned long size, unsigned long offset) { const unsigned long *p = addr + BIT_WORD(offset); - unsigned long result = offset & ~(BITS_PER_LONG - 1); + unsigned long result = size; unsigned long tmp; - unsigned long mask, submask; - unsigned long quot, rest; if (offset >= size) return size; - size -= result; + size -= (offset & ~(BITS_PER_LONG - 1)); offset %= BITS_PER_LONG; - if (!offset) - goto aligned; - - tmp = *(p++); - quot = (offset >> 3) << 3; - rest = offset & 0x7; - mask = ~0UL << quot; - submask = (unsigned char)(0xff << rest) >> rest; - submask <<= quot; - mask &= submask; - tmp &= mask; - if (size < BITS_PER_LONG) - goto found_first; - if (tmp) - goto found_middle; - - size -= BITS_PER_LONG; - result += BITS_PER_LONG; -aligned: - while (size & ~(BITS_PER_LONG-1)) { - tmp = *(p++); + + while (1) { + if (*p == 0) + goto pass; + + tmp = __reverse_ulong((unsigned char *)p); + + tmp &= ~0UL >> offset; + if (size < BITS_PER_LONG) + tmp &= (~0UL << (BITS_PER_LONG - size)); if (tmp) - goto found_middle; - result += BITS_PER_LONG; + goto found; +pass: + if (size <= BITS_PER_LONG) + break; size -= BITS_PER_LONG; + offset = 0; + p++; } - if (!size) - return result; - tmp = *p; -found_first: - tmp &= (~0UL >> (BITS_PER_LONG - size)); - if (tmp == 0UL) /* Are any bits set? */ - return result + size; /* Nope. */ -found_middle: - return result + __reverse_ffs(tmp); + return result; +found: + return result - size + __reverse_ffs(tmp); } static unsigned long __find_rev_next_zero_bit(const unsigned long *addr, unsigned long size, unsigned long offset) { const unsigned long *p = addr + BIT_WORD(offset); - unsigned long result = offset & ~(BITS_PER_LONG - 1); + unsigned long result = size; unsigned long tmp; - unsigned long mask, submask; - unsigned long quot, rest; if (offset >= size) return size; - size -= result; + size -= (offset & ~(BITS_PER_LONG - 1)); offset %= BITS_PER_LONG; - if (!offset) - goto aligned; - - tmp = *(p++); - quot = (offset >> 3) << 3; - rest = offset & 0x7; - mask = ~(~0UL << quot); - submask = (unsigned char)~((unsigned char)(0xff << rest) >> rest); - submask <<= quot; - mask += submask; - tmp |= mask; - if (size < BITS_PER_LONG) - goto found_first; - if (~tmp) - goto found_middle; - - size -= BITS_PER_LONG; - result += BITS_PER_LONG; -aligned: - while (size & ~(BITS_PER_LONG - 1)) { - tmp = *(p++); - if (~tmp) - goto found_middle; - result += BITS_PER_LONG; + + while (1) { + if (*p == ~0UL) + goto pass; + + tmp = __reverse_ulong((unsigned char *)p); + + if (offset) + tmp |= ~0UL << (BITS_PER_LONG - offset); + if (size < BITS_PER_LONG) + tmp |= ~0UL >> size; + if (tmp != ~0UL) + goto found; +pass: + if (size <= BITS_PER_LONG) + break; size -= BITS_PER_LONG; + offset = 0; + p++; } - if (!size) - return result; - tmp = *p; - -found_first: - tmp |= ~0UL << size; - if (tmp == ~0UL) /* Are any bits zero? */ - return result + size; /* Nope. */ -found_middle: - return result + __reverse_ffz(tmp); + return result; +found: + return result - size + __reverse_ffz(tmp); } void register_inmem_page(struct inode *inode, struct page *page) { struct f2fs_inode_info *fi = F2FS_I(inode); struct inmem_pages *new; - int err; - SetPagePrivate(page); f2fs_trace_pid(page); + set_page_private(page, (unsigned long)ATOMIC_WRITTEN_PAGE); + SetPagePrivate(page); + new = f2fs_kmem_cache_alloc(inmem_entry_slab, GFP_NOFS); /* add atomic page indices to the list */ new->page = page; INIT_LIST_HEAD(&new->list); -retry: + /* increase reference count with clean state */ mutex_lock(&fi->inmem_lock); - err = radix_tree_insert(&fi->inmem_root, page->index, new); - if (err == -EEXIST) { - mutex_unlock(&fi->inmem_lock); - kmem_cache_free(inmem_entry_slab, new); - return; - } else if (err) { - mutex_unlock(&fi->inmem_lock); - goto retry; - } get_page(page); list_add_tail(&new->list, &fi->inmem_pages); inc_page_count(F2FS_I_SB(inode), F2FS_INMEM_PAGES); @@ -268,86 +250,206 @@ void register_inmem_page(struct inode *inode, struct page *page) trace_f2fs_register_inmem_page(page, INMEM); } -void commit_inmem_pages(struct inode *inode, bool abort) +static int __revoke_inmem_pages(struct inode *inode, + struct list_head *head, bool drop, bool recover) +{ + struct f2fs_sb_info *sbi = F2FS_I_SB(inode); + struct inmem_pages *cur, *tmp; + int err = 0; + + list_for_each_entry_safe(cur, tmp, head, list) { + struct page *page = cur->page; + + if (drop) + trace_f2fs_commit_inmem_page(page, INMEM_DROP); + + lock_page(page); + + if (recover) { + struct dnode_of_data dn; + struct node_info ni; + + trace_f2fs_commit_inmem_page(page, INMEM_REVOKE); + + set_new_dnode(&dn, inode, NULL, NULL, 0); + if (get_dnode_of_data(&dn, page->index, LOOKUP_NODE)) { + err = -EAGAIN; + goto next; + } + get_node_info(sbi, dn.nid, &ni); + f2fs_replace_block(sbi, &dn, dn.data_blkaddr, + cur->old_addr, ni.version, true, true); + f2fs_put_dnode(&dn); + } +next: + /* we don't need to invalidate this in the sccessful status */ + if (drop || recover) + ClearPageUptodate(page); + set_page_private(page, 0); + ClearPagePrivate(page); + f2fs_put_page(page, 1); + + list_del(&cur->list); + kmem_cache_free(inmem_entry_slab, cur); + dec_page_count(F2FS_I_SB(inode), F2FS_INMEM_PAGES); + } + return err; +} + +void drop_inmem_pages(struct inode *inode) +{ + struct f2fs_inode_info *fi = F2FS_I(inode); + + clear_inode_flag(inode, FI_ATOMIC_FILE); + + mutex_lock(&fi->inmem_lock); + __revoke_inmem_pages(inode, &fi->inmem_pages, true, false); + mutex_unlock(&fi->inmem_lock); +} + +static int __commit_inmem_pages(struct inode *inode, + struct list_head *revoke_list) { struct f2fs_sb_info *sbi = F2FS_I_SB(inode); struct f2fs_inode_info *fi = F2FS_I(inode); struct inmem_pages *cur, *tmp; - bool submit_bio = false; struct f2fs_io_info fio = { + .sbi = sbi, .type = DATA, .rw = WRITE_SYNC | REQ_PRIO, + .encrypted_page = NULL, }; + bool submit_bio = false; + int err = 0; - /* - * The abort is true only when f2fs_evict_inode is called. - * Basically, the f2fs_evict_inode doesn't produce any data writes, so - * that we don't need to call f2fs_balance_fs. - * Otherwise, f2fs_gc in f2fs_balance_fs can wait forever until this - * inode becomes free by iget_locked in f2fs_iget. - */ - if (!abort) { - f2fs_balance_fs(sbi); - f2fs_lock_op(sbi); - } - - mutex_lock(&fi->inmem_lock); list_for_each_entry_safe(cur, tmp, &fi->inmem_pages, list) { - if (!abort) { - lock_page(cur->page); - if (cur->page->mapping == inode->i_mapping) { - f2fs_wait_on_page_writeback(cur->page, DATA); - if (clear_page_dirty_for_io(cur->page)) - inode_dec_dirty_pages(inode); - trace_f2fs_commit_inmem_page(cur->page, INMEM); - do_write_data_page(cur->page, &fio); - submit_bio = true; + struct page *page = cur->page; + + lock_page(page); + if (page->mapping == inode->i_mapping) { + trace_f2fs_commit_inmem_page(page, INMEM); + + set_page_dirty(page); + f2fs_wait_on_page_writeback(page, DATA, true); + if (clear_page_dirty_for_io(page)) + inode_dec_dirty_pages(inode); + + fio.page = page; + err = do_write_data_page(&fio); + if (err) { + unlock_page(page); + break; } - f2fs_put_page(cur->page, 1); - } else { - trace_f2fs_commit_inmem_page(cur->page, INMEM_DROP); - put_page(cur->page); + + /* record old blkaddr for revoking */ + cur->old_addr = fio.old_blkaddr; + + clear_cold_data(page); + submit_bio = true; } - radix_tree_delete(&fi->inmem_root, cur->page->index); - list_del(&cur->list); - kmem_cache_free(inmem_entry_slab, cur); - dec_page_count(F2FS_I_SB(inode), F2FS_INMEM_PAGES); + unlock_page(page); + list_move_tail(&cur->list, revoke_list); } - mutex_unlock(&fi->inmem_lock); - if (!abort) { - f2fs_unlock_op(sbi); - if (submit_bio) - f2fs_submit_merged_bio(sbi, DATA, WRITE); + if (submit_bio) + f2fs_submit_merged_bio_cond(sbi, inode, NULL, 0, DATA, WRITE); + + if (!err) + __revoke_inmem_pages(inode, revoke_list, false, false); + + return err; +} + +int commit_inmem_pages(struct inode *inode) +{ + struct f2fs_sb_info *sbi = F2FS_I_SB(inode); + struct f2fs_inode_info *fi = F2FS_I(inode); + struct list_head revoke_list; + int err; + + INIT_LIST_HEAD(&revoke_list); + f2fs_balance_fs(sbi, true); + f2fs_lock_op(sbi); + + mutex_lock(&fi->inmem_lock); + err = __commit_inmem_pages(inode, &revoke_list); + if (err) { + int ret; + /* + * try to revoke all committed pages, but still we could fail + * due to no memory or other reason, if that happened, EAGAIN + * will be returned, which means in such case, transaction is + * already not integrity, caller should use journal to do the + * recovery or rewrite & commit last transaction. For other + * error number, revoking was done by filesystem itself. + */ + ret = __revoke_inmem_pages(inode, &revoke_list, false, true); + if (ret) + err = ret; + + /* drop all uncommitted pages */ + __revoke_inmem_pages(inode, &fi->inmem_pages, true, false); } + mutex_unlock(&fi->inmem_lock); + + f2fs_unlock_op(sbi); + return err; } /* * This function balances dirty node and dentry pages. * In addition, it controls garbage collection. */ -void f2fs_balance_fs(struct f2fs_sb_info *sbi) +void f2fs_balance_fs(struct f2fs_sb_info *sbi, bool need) { + if (!need) + return; + + /* balance_fs_bg is able to be pending */ + if (excess_cached_nats(sbi)) + f2fs_balance_fs_bg(sbi); + /* * We should do GC or end up with checkpoint, if there are so many dirty * dir/node pages without enough free segments. */ if (has_not_enough_free_secs(sbi, 0)) { mutex_lock(&sbi->gc_mutex); - f2fs_gc(sbi); + f2fs_gc(sbi, false); } } void f2fs_balance_fs_bg(struct f2fs_sb_info *sbi) { /* try to shrink extent cache when there is no enough memory */ - f2fs_shrink_extent_tree(sbi, EXTENT_CACHE_SHRINK_NUMBER); + if (!available_free_memory(sbi, EXTENT_CACHE)) + f2fs_shrink_extent_tree(sbi, EXTENT_CACHE_SHRINK_NUMBER); + + /* check the # of cached NAT entries */ + if (!available_free_memory(sbi, NAT_ENTRIES)) + try_to_free_nats(sbi, NAT_ENTRY_PER_BLOCK); + + if (!available_free_memory(sbi, FREE_NIDS)) + try_to_free_nids(sbi, MAX_FREE_NIDS); + else + build_free_nids(sbi); - /* check the # of cached NAT entries and prefree segments */ - if (try_to_free_nats(sbi, NAT_ENTRY_PER_BLOCK) || + /* checkpoint is the only way to shrink partial cached entries */ + if (!available_free_memory(sbi, NAT_ENTRIES) || + !available_free_memory(sbi, INO_ENTRIES) || excess_prefree_segs(sbi) || - !available_free_memory(sbi, INO_ENTRIES)) + excess_dirty_nats(sbi) || + (is_idle(sbi) && f2fs_time_over(sbi, CP_TIME))) { + if (test_opt(sbi, DATA_FLUSH)) { + struct blk_plug plug; + + blk_start_plug(&plug); + sync_dirty_inodes(sbi, FILE_INODE); + blk_finish_plug(&plug); + } f2fs_sync_fs(sbi->sb, true); + stat_inc_bg_cp_count(sbi->stat_info); + } } struct __submit_bio_ret { @@ -387,10 +489,12 @@ static int issue_flush_thread(void *data) return 0; if (!llist_empty(&fcc->issue_list)) { - struct bio *bio = bio_alloc(GFP_NOIO, 0); + struct bio *bio; struct flush_cmd *cmd, *next; int ret; + bio = f2fs_bio_alloc(0); + fcc->dispatch_list = llist_del_all(&fcc->issue_list); fcc->dispatch_list = llist_reverse_order(fcc->dispatch_list); @@ -422,17 +526,28 @@ int f2fs_issue_flush(struct f2fs_sb_info *sbi) if (test_opt(sbi, NOBARRIER)) return 0; - if (!test_opt(sbi, FLUSH_MERGE)) - return blkdev_issue_flush(sbi->sb->s_bdev, GFP_KERNEL, NULL); + if (!test_opt(sbi, FLUSH_MERGE) || !atomic_read(&fcc->submit_flush)) { + struct bio *bio = f2fs_bio_alloc(0); + int ret; + + atomic_inc(&fcc->submit_flush); + bio->bi_bdev = sbi->sb->s_bdev; + ret = __submit_bio_wait(WRITE_FLUSH, bio); + atomic_dec(&fcc->submit_flush); + bio_put(bio); + return ret; + } init_completion(&cmd.wait); + atomic_inc(&fcc->submit_flush); llist_add(&cmd.llnode, &fcc->issue_list); if (!fcc->dispatch_list) wake_up(&fcc->flush_wait_queue); wait_for_completion(&cmd.wait); + atomic_dec(&fcc->submit_flush); return cmd.ret; } @@ -446,6 +561,7 @@ int create_flush_cmd_control(struct f2fs_sb_info *sbi) fcc = kzalloc(sizeof(struct flush_cmd_control), GFP_KERNEL); if (!fcc) return -ENOMEM; + atomic_set(&fcc->submit_flush, 0); init_waitqueue_head(&fcc->flush_wait_queue); init_llist_head(&fcc->issue_list); SM_I(sbi)->cmd_control_info = fcc; @@ -552,22 +668,46 @@ static int f2fs_issue_discard(struct f2fs_sb_info *sbi, { sector_t start = SECTOR_FROM_BLOCK(blkstart); sector_t len = SECTOR_FROM_BLOCK(blklen); + struct seg_entry *se; + unsigned int offset; + block_t i; + + for (i = blkstart; i < blkstart + blklen; i++) { + se = get_seg_entry(sbi, GET_SEGNO(sbi, i)); + offset = GET_BLKOFF_FROM_SEG0(sbi, i); + + if (!f2fs_test_and_set_bit(offset, se->discard_map)) + sbi->discard_blks--; + } trace_f2fs_issue_discard(sbi->sb, blkstart, blklen); return blkdev_issue_discard(sbi->sb->s_bdev, start, len, GFP_NOFS, 0); } -void discard_next_dnode(struct f2fs_sb_info *sbi, block_t blkaddr) +bool discard_next_dnode(struct f2fs_sb_info *sbi, block_t blkaddr) { - if (f2fs_issue_discard(sbi, blkaddr, 1)) { - struct page *page = grab_meta_page(sbi, blkaddr); - /* zero-filled page */ - set_page_dirty(page); - f2fs_put_page(page, 1); + int err = -EOPNOTSUPP; + + if (test_opt(sbi, DISCARD)) { + struct seg_entry *se = get_seg_entry(sbi, + GET_SEGNO(sbi, blkaddr)); + unsigned int offset = GET_BLKOFF_FROM_SEG0(sbi, blkaddr); + + if (f2fs_test_bit(offset, se->discard_map)) + return false; + + err = f2fs_issue_discard(sbi, blkaddr, 1); } + + if (err) { + update_meta_page(sbi, NULL, blkaddr); + return true; + } + return false; } static void __add_discard_entry(struct f2fs_sb_info *sbi, - struct cp_control *cpc, unsigned int start, unsigned int end) + struct cp_control *cpc, struct seg_entry *se, + unsigned int start, unsigned int end) { struct list_head *head = &SM_I(sbi)->discard_list; struct discard_entry *new, *last; @@ -588,7 +728,6 @@ static void __add_discard_entry(struct f2fs_sb_info *sbi, list_add_tail(&new->list, head); done: SM_I(sbi)->nr_discards += end - start; - cpc->trimmed += end - start; } static void add_discard_addrs(struct f2fs_sb_info *sbi, struct cp_control *cpc) @@ -598,41 +737,24 @@ static void add_discard_addrs(struct f2fs_sb_info *sbi, struct cp_control *cpc) struct seg_entry *se = get_seg_entry(sbi, cpc->trim_start); unsigned long *cur_map = (unsigned long *)se->cur_valid_map; unsigned long *ckpt_map = (unsigned long *)se->ckpt_valid_map; + unsigned long *discard_map = (unsigned long *)se->discard_map; unsigned long *dmap = SIT_I(sbi)->tmp_map; unsigned int start = 0, end = -1; bool force = (cpc->reason == CP_DISCARD); int i; - if (!force && (!test_opt(sbi, DISCARD) || - SM_I(sbi)->nr_discards >= SM_I(sbi)->max_discards)) + if (se->valid_blocks == max_blocks) return; - if (force && !se->valid_blocks) { - struct dirty_seglist_info *dirty_i = DIRTY_I(sbi); - /* - * if this segment is registered in the prefree list, then - * we should skip adding a discard candidate, and let the - * checkpoint do that later. - */ - mutex_lock(&dirty_i->seglist_lock); - if (test_bit(cpc->trim_start, dirty_i->dirty_segmap[PRE])) { - mutex_unlock(&dirty_i->seglist_lock); - cpc->trimmed += sbi->blocks_per_seg; + if (!force) { + if (!test_opt(sbi, DISCARD) || !se->valid_blocks || + SM_I(sbi)->nr_discards >= SM_I(sbi)->max_discards) return; - } - mutex_unlock(&dirty_i->seglist_lock); - - __add_discard_entry(sbi, cpc, 0, sbi->blocks_per_seg); - return; } - /* zero block will be discarded through the prefree list */ - if (!se->valid_blocks || se->valid_blocks == max_blocks) - return; - /* SIT_VBLOCK_MAP_SIZE should be multiple of sizeof(unsigned long) */ for (i = 0; i < entries; i++) - dmap[i] = force ? ~ckpt_map[i] : + dmap[i] = force ? ~ckpt_map[i] & ~discard_map[i] : (cur_map[i] ^ ckpt_map[i]) & ckpt_map[i]; while (force || SM_I(sbi)->nr_discards <= SM_I(sbi)->max_discards) { @@ -641,11 +763,11 @@ static void add_discard_addrs(struct f2fs_sb_info *sbi, struct cp_control *cpc) break; end = __find_rev_next_zero_bit(dmap, max_blocks, start + 1); - - if (force && end - start < cpc->trim_minlen) + if (force && start && end != max_blocks + && (end - start) < cpc->trim_minlen) continue; - __add_discard_entry(sbi, cpc, start, end); + __add_discard_entry(sbi, cpc, se, start, end); } } @@ -675,13 +797,15 @@ static void set_prefree_as_free_segments(struct f2fs_sb_info *sbi) mutex_unlock(&dirty_i->seglist_lock); } -void clear_prefree_segments(struct f2fs_sb_info *sbi) +void clear_prefree_segments(struct f2fs_sb_info *sbi, struct cp_control *cpc) { struct list_head *head = &(SM_I(sbi)->discard_list); struct discard_entry *entry, *this; struct dirty_seglist_info *dirty_i = DIRTY_I(sbi); unsigned long *prefree_map = dirty_i->dirty_segmap[PRE]; unsigned int start = 0, end = -1; + unsigned int secno, start_segno; + bool force = (cpc->reason == CP_DISCARD); mutex_lock(&dirty_i->seglist_lock); @@ -698,17 +822,35 @@ void clear_prefree_segments(struct f2fs_sb_info *sbi) dirty_i->nr_dirty[PRE] -= end - start; - if (!test_opt(sbi, DISCARD)) + if (force || !test_opt(sbi, DISCARD)) continue; - f2fs_issue_discard(sbi, START_BLOCK(sbi, start), + if (!test_opt(sbi, LFS) || sbi->segs_per_sec == 1) { + f2fs_issue_discard(sbi, START_BLOCK(sbi, start), (end - start) << sbi->log_blocks_per_seg); + continue; + } +next: + secno = GET_SECNO(sbi, start); + start_segno = secno * sbi->segs_per_sec; + if (!IS_CURSEC(sbi, secno) && + !get_valid_blocks(sbi, start, sbi->segs_per_sec)) + f2fs_issue_discard(sbi, START_BLOCK(sbi, start_segno), + sbi->segs_per_sec << sbi->log_blocks_per_seg); + + start = start_segno + sbi->segs_per_sec; + if (start < end) + goto next; } mutex_unlock(&dirty_i->seglist_lock); /* send small discards */ list_for_each_entry_safe(entry, this, head, list) { + if (force && entry->len < cpc->trim_minlen) + goto skip; f2fs_issue_discard(sbi, entry->blkaddr, entry->len); + cpc->trimmed += entry->len; +skip: list_del(&entry->list); SM_I(sbi)->nr_discards -= entry->len; kmem_cache_free(discard_entry_slab, entry); @@ -759,9 +901,13 @@ static void update_sit_entry(struct f2fs_sb_info *sbi, block_t blkaddr, int del) if (del > 0) { if (f2fs_test_and_set_bit(offset, se->cur_valid_map)) f2fs_bug_on(sbi, 1); + if (!f2fs_test_and_set_bit(offset, se->discard_map)) + sbi->discard_blks--; } else { if (!f2fs_test_and_clear_bit(offset, se->cur_valid_map)) f2fs_bug_on(sbi, 1); + if (f2fs_test_and_clear_bit(offset, se->discard_map)) + sbi->discard_blks++; } if (!f2fs_test_bit(offset, se->ckpt_valid_map)) se->ckpt_valid_blocks += del; @@ -805,6 +951,30 @@ void invalidate_blocks(struct f2fs_sb_info *sbi, block_t addr) mutex_unlock(&sit_i->sentry_lock); } +bool is_checkpointed_data(struct f2fs_sb_info *sbi, block_t blkaddr) +{ + struct sit_info *sit_i = SIT_I(sbi); + unsigned int segno, offset; + struct seg_entry *se; + bool is_cp = false; + + if (blkaddr == NEW_ADDR || blkaddr == NULL_ADDR) + return true; + + mutex_lock(&sit_i->sentry_lock); + + segno = GET_SEGNO(sbi, blkaddr); + se = get_seg_entry(sbi, segno); + offset = GET_BLKOFF_FROM_SEG0(sbi, blkaddr); + + if (f2fs_test_bit(offset, se->ckpt_valid_map)) + is_cp = true; + + mutex_unlock(&sit_i->sentry_lock); + + return is_cp; +} + /* * This function should be resided under the curseg_mutex lock */ @@ -837,12 +1007,12 @@ int npages_for_summary_flush(struct f2fs_sb_info *sbi, bool for_ra) } } - sum_in_page = (PAGE_CACHE_SIZE - 2 * SUM_JOURNAL_SIZE - + sum_in_page = (PAGE_SIZE - 2 * SUM_JOURNAL_SIZE - SUM_FOOTER_SIZE) / SUMMARY_SIZE; if (valid_sum_count <= sum_in_page) return 1; else if ((valid_sum_count - sum_in_page) <= - (PAGE_CACHE_SIZE - SUM_FOOTER_SIZE) / SUMMARY_SIZE) + (PAGE_SIZE - SUM_FOOTER_SIZE) / SUMMARY_SIZE) return 2; return 3; } @@ -855,12 +1025,46 @@ struct page *get_sum_page(struct f2fs_sb_info *sbi, unsigned int segno) return get_meta_page(sbi, GET_SUM_BLOCK(sbi, segno)); } +void update_meta_page(struct f2fs_sb_info *sbi, void *src, block_t blk_addr) +{ + struct page *page = grab_meta_page(sbi, blk_addr); + void *dst = page_address(page); + + if (src) + memcpy(dst, src, PAGE_SIZE); + else + memset(dst, 0, PAGE_SIZE); + set_page_dirty(page); + f2fs_put_page(page, 1); +} + static void write_sum_page(struct f2fs_sb_info *sbi, struct f2fs_summary_block *sum_blk, block_t blk_addr) { + update_meta_page(sbi, (void *)sum_blk, blk_addr); +} + +static void write_current_sum_page(struct f2fs_sb_info *sbi, + int type, block_t blk_addr) +{ + struct curseg_info *curseg = CURSEG_I(sbi, type); struct page *page = grab_meta_page(sbi, blk_addr); - void *kaddr = page_address(page); - memcpy(kaddr, sum_blk, PAGE_CACHE_SIZE); + struct f2fs_summary_block *src = curseg->sum_blk; + struct f2fs_summary_block *dst; + + dst = (struct f2fs_summary_block *)page_address(page); + + mutex_lock(&curseg->curseg_mutex); + + down_read(&curseg->journal_rwsem); + memcpy(&dst->journal, curseg->journal, SUM_JOURNAL_SIZE); + up_read(&curseg->journal_rwsem); + + memcpy(dst->entries, src->entries, SUM_ENTRY_SIZE); + memcpy(&dst->footer, &src->footer, SUM_FOOTER_SIZE); + + mutex_unlock(&curseg->curseg_mutex); + set_page_dirty(page); f2fs_put_page(page, 1); } @@ -897,9 +1101,8 @@ static void get_new_segment(struct f2fs_sb_info *sbi, if (!new_sec && ((*newseg + 1) % sbi->segs_per_sec)) { segno = find_next_zero_bit(free_i->free_segmap, - MAIN_SEGS(sbi), *newseg + 1); - if (segno - *newseg < sbi->segs_per_sec - - (*newseg % sbi->segs_per_sec)) + (hint + 1) * sbi->segs_per_sec, *newseg + 1); + if (segno < (hint + 1) * sbi->segs_per_sec) goto got_it; } find_other_zone: @@ -1131,6 +1334,9 @@ void allocate_new_segments(struct f2fs_sb_info *sbi) { int i; + if (test_opt(sbi, LFS)) + return; + for (i = CURSEG_HOT_DATA; i <= CURSEG_COLD_DATA; i++) __allocate_new_segments(sbi, i); } @@ -1145,9 +1351,9 @@ int f2fs_trim_fs(struct f2fs_sb_info *sbi, struct fstrim_range *range) __u64 end = start + F2FS_BYTES_TO_BLK(range->len) - 1; unsigned int start_segno, end_segno; struct cp_control cpc; + int err = 0; - if (range->minlen > SEGMENT_SIZE(sbi) || start >= MAX_BLKADDR(sbi) || - range->len < sbi->blocksize) + if (start >= MAX_BLKADDR(sbi) || range->len < sbi->blocksize) return -EINVAL; cpc.trimmed = 0; @@ -1159,22 +1365,29 @@ int f2fs_trim_fs(struct f2fs_sb_info *sbi, struct fstrim_range *range) end_segno = (end >= MAX_BLKADDR(sbi)) ? MAIN_SEGS(sbi) - 1 : GET_SEGNO(sbi, end); cpc.reason = CP_DISCARD; - cpc.trim_minlen = F2FS_BYTES_TO_BLK(range->minlen); + cpc.trim_minlen = max_t(__u64, 1, F2FS_BYTES_TO_BLK(range->minlen)); /* do checkpoint to issue discard commands safely */ for (; start_segno <= end_segno; start_segno = cpc.trim_end + 1) { cpc.trim_start = start_segno; - cpc.trim_end = min_t(unsigned int, rounddown(start_segno + + + if (sbi->discard_blks == 0) + break; + else if (sbi->discard_blks < BATCHED_TRIM_BLOCKS(sbi)) + cpc.trim_end = end_segno; + else + cpc.trim_end = min_t(unsigned int, + rounddown(start_segno + BATCHED_TRIM_SEGMENTS(sbi), sbi->segs_per_sec) - 1, end_segno); mutex_lock(&sbi->gc_mutex); - write_checkpoint(sbi, &cpc); + err = write_checkpoint(sbi, &cpc); mutex_unlock(&sbi->gc_mutex); } out: range->len = F2FS_BLK_TO_BYTES(cpc.trimmed); - return 0; + return err; } static bool __has_curseg_space(struct f2fs_sb_info *sbi, int type) @@ -1260,7 +1473,8 @@ void allocate_data_block(struct f2fs_sb_info *sbi, struct page *page, mutex_lock(&sit_i->sentry_lock); /* direct_io'ed data is aligned to the segment for better performance */ - if (direct_io && curseg->next_blkoff) + if (direct_io && curseg->next_blkoff && + !has_not_enough_free_secs(sbi, 0)) __allocate_new_segments(sbi, type); *new_blkaddr = NEXT_FREE_BLKADDR(sbi, curseg); @@ -1292,84 +1506,105 @@ void allocate_data_block(struct f2fs_sb_info *sbi, struct page *page, mutex_unlock(&curseg->curseg_mutex); } -static void do_write_page(struct f2fs_sb_info *sbi, struct page *page, - struct f2fs_summary *sum, - struct f2fs_io_info *fio) +static void do_write_page(struct f2fs_summary *sum, struct f2fs_io_info *fio) { - int type = __get_segment_type(page, fio->type); + int type = __get_segment_type(fio->page, fio->type); + + if (fio->type == NODE || fio->type == DATA) + mutex_lock(&fio->sbi->wio_mutex[fio->type]); - allocate_data_block(sbi, page, fio->blk_addr, &fio->blk_addr, sum, type); + allocate_data_block(fio->sbi, fio->page, fio->old_blkaddr, + &fio->new_blkaddr, sum, type); /* writeout dirty page into bdev */ - f2fs_submit_page_mbio(sbi, page, fio); + f2fs_submit_page_mbio(fio); + + if (fio->type == NODE || fio->type == DATA) + mutex_unlock(&fio->sbi->wio_mutex[fio->type]); } void write_meta_page(struct f2fs_sb_info *sbi, struct page *page) { struct f2fs_io_info fio = { + .sbi = sbi, .type = META, .rw = WRITE_SYNC | REQ_META | REQ_PRIO, - .blk_addr = page->index, + .old_blkaddr = page->index, + .new_blkaddr = page->index, + .page = page, + .encrypted_page = NULL, }; + if (unlikely(page->index >= MAIN_BLKADDR(sbi))) + fio.rw &= ~REQ_META; + set_page_writeback(page); - f2fs_submit_page_mbio(sbi, page, &fio); + f2fs_submit_page_mbio(&fio); } -void write_node_page(struct f2fs_sb_info *sbi, struct page *page, - unsigned int nid, struct f2fs_io_info *fio) +void write_node_page(unsigned int nid, struct f2fs_io_info *fio) { struct f2fs_summary sum; + set_summary(&sum, nid, 0, 0); - do_write_page(sbi, page, &sum, fio); + do_write_page(&sum, fio); } -void write_data_page(struct page *page, struct dnode_of_data *dn, - struct f2fs_io_info *fio) +void write_data_page(struct dnode_of_data *dn, struct f2fs_io_info *fio) { - struct f2fs_sb_info *sbi = F2FS_I_SB(dn->inode); + struct f2fs_sb_info *sbi = fio->sbi; struct f2fs_summary sum; struct node_info ni; f2fs_bug_on(sbi, dn->data_blkaddr == NULL_ADDR); get_node_info(sbi, dn->nid, &ni); set_summary(&sum, dn->nid, dn->ofs_in_node, ni.version); - do_write_page(sbi, page, &sum, fio); - dn->data_blkaddr = fio->blk_addr; + do_write_page(&sum, fio); + f2fs_update_data_blkaddr(dn, fio->new_blkaddr); } -void rewrite_data_page(struct page *page, struct f2fs_io_info *fio) +void rewrite_data_page(struct f2fs_io_info *fio) { - stat_inc_inplace_blocks(F2FS_P_SB(page)); - f2fs_submit_page_mbio(F2FS_P_SB(page), page, fio); + fio->new_blkaddr = fio->old_blkaddr; + stat_inc_inplace_blocks(fio->sbi); + f2fs_submit_page_mbio(fio); } -void recover_data_page(struct f2fs_sb_info *sbi, - struct page *page, struct f2fs_summary *sum, - block_t old_blkaddr, block_t new_blkaddr) +void __f2fs_replace_block(struct f2fs_sb_info *sbi, struct f2fs_summary *sum, + block_t old_blkaddr, block_t new_blkaddr, + bool recover_curseg, bool recover_newaddr) { struct sit_info *sit_i = SIT_I(sbi); struct curseg_info *curseg; unsigned int segno, old_cursegno; struct seg_entry *se; int type; + unsigned short old_blkoff; segno = GET_SEGNO(sbi, new_blkaddr); se = get_seg_entry(sbi, segno); type = se->type; - if (se->valid_blocks == 0 && !IS_CURSEG(sbi, segno)) { - if (old_blkaddr == NULL_ADDR) - type = CURSEG_COLD_DATA; - else + if (!recover_curseg) { + /* for recovery flow */ + if (se->valid_blocks == 0 && !IS_CURSEG(sbi, segno)) { + if (old_blkaddr == NULL_ADDR) + type = CURSEG_COLD_DATA; + else + type = CURSEG_WARM_DATA; + } + } else { + if (!IS_CURSEG(sbi, segno)) type = CURSEG_WARM_DATA; } + curseg = CURSEG_I(sbi, type); mutex_lock(&curseg->curseg_mutex); mutex_lock(&sit_i->sentry_lock); old_cursegno = curseg->segno; + old_blkoff = curseg->next_blkoff; /* change the current segment */ if (segno != curseg->segno) { @@ -1380,46 +1615,72 @@ void recover_data_page(struct f2fs_sb_info *sbi, curseg->next_blkoff = GET_BLKOFF_FROM_SEG0(sbi, new_blkaddr); __add_sum_entry(sbi, type, sum); - refresh_sit_entry(sbi, old_blkaddr, new_blkaddr); + if (!recover_curseg || recover_newaddr) + update_sit_entry(sbi, new_blkaddr, 1); + if (GET_SEGNO(sbi, old_blkaddr) != NULL_SEGNO) + update_sit_entry(sbi, old_blkaddr, -1); + + locate_dirty_segment(sbi, GET_SEGNO(sbi, old_blkaddr)); + locate_dirty_segment(sbi, GET_SEGNO(sbi, new_blkaddr)); + locate_dirty_segment(sbi, old_cursegno); + if (recover_curseg) { + if (old_cursegno != curseg->segno) { + curseg->next_segno = old_cursegno; + change_curseg(sbi, type, true); + } + curseg->next_blkoff = old_blkoff; + } + mutex_unlock(&sit_i->sentry_lock); mutex_unlock(&curseg->curseg_mutex); } -static inline bool is_merged_page(struct f2fs_sb_info *sbi, - struct page *page, enum page_type type) +void f2fs_replace_block(struct f2fs_sb_info *sbi, struct dnode_of_data *dn, + block_t old_addr, block_t new_addr, + unsigned char version, bool recover_curseg, + bool recover_newaddr) { - enum page_type btype = PAGE_TYPE_OF_BIO(type); - struct f2fs_bio_info *io = &sbi->write_io[btype]; - struct bio_vec *bvec; - int i; + struct f2fs_summary sum; - down_read(&io->io_rwsem); - if (!io->bio) - goto out; + set_summary(&sum, dn->nid, dn->ofs_in_node, version); - __bio_for_each_segment(bvec, io->bio, i, 0) { - if (page == bvec->bv_page) { - up_read(&io->io_rwsem); - return true; - } - } + __f2fs_replace_block(sbi, &sum, old_addr, new_addr, + recover_curseg, recover_newaddr); -out: - up_read(&io->io_rwsem); - return false; + f2fs_update_data_blkaddr(dn, new_addr); } void f2fs_wait_on_page_writeback(struct page *page, - enum page_type type) + enum page_type type, bool ordered) { if (PageWriteback(page)) { struct f2fs_sb_info *sbi = F2FS_P_SB(page); - if (is_merged_page(sbi, page, type)) - f2fs_submit_merged_bio(sbi, type, WRITE); - wait_on_page_writeback(page); + f2fs_submit_merged_bio_cond(sbi, NULL, page, 0, type, WRITE); + if (ordered) + wait_on_page_writeback(page); + else + /* wait_for_stable_page(page) doesn't support */ + wait_on_page_writeback(page); + } +} + +void f2fs_wait_on_encrypted_page_writeback(struct f2fs_sb_info *sbi, + block_t blkaddr) +{ + struct page *cpage; + + if (blkaddr == NEW_ADDR) + return; + + f2fs_bug_on(sbi, blkaddr == NULL_ADDR); + + cpage = find_lock_page(META_MAPPING(sbi), blkaddr); + if (cpage) { + f2fs_wait_on_page_writeback(cpage, DATA, true); + f2fs_put_page(cpage, 1); } } @@ -1439,12 +1700,11 @@ static int read_compacted_summaries(struct f2fs_sb_info *sbi) /* Step 1: restore nat cache */ seg_i = CURSEG_I(sbi, CURSEG_HOT_DATA); - memcpy(&seg_i->sum_blk->n_nats, kaddr, SUM_JOURNAL_SIZE); + memcpy(seg_i->journal, kaddr, SUM_JOURNAL_SIZE); /* Step 2: restore sit cache */ seg_i = CURSEG_I(sbi, CURSEG_COLD_DATA); - memcpy(&seg_i->sum_blk->n_sits, kaddr + SUM_JOURNAL_SIZE, - SUM_JOURNAL_SIZE); + memcpy(seg_i->journal, kaddr + SUM_JOURNAL_SIZE, SUM_JOURNAL_SIZE); offset = 2 * SUM_JOURNAL_SIZE; /* Step 3: restore summary entries */ @@ -1468,7 +1728,7 @@ static int read_compacted_summaries(struct f2fs_sb_info *sbi) s = (struct f2fs_summary *)(kaddr + offset); seg_i->sum_blk->entries[j] = *s; offset += SUMMARY_SIZE; - if (offset + SUMMARY_SIZE <= PAGE_CACHE_SIZE - + if (offset + SUMMARY_SIZE <= PAGE_SIZE - SUM_FOOTER_SIZE) continue; @@ -1540,7 +1800,14 @@ static int read_normal_summaries(struct f2fs_sb_info *sbi, int type) /* set uncompleted segment to curseg */ curseg = CURSEG_I(sbi, type); mutex_lock(&curseg->curseg_mutex); - memcpy(curseg->sum_blk, sum, PAGE_CACHE_SIZE); + + /* update journal info */ + down_write(&curseg->journal_rwsem); + memcpy(curseg->journal, &sum->journal, SUM_JOURNAL_SIZE); + up_write(&curseg->journal_rwsem); + + memcpy(curseg->sum_blk->entries, sum->entries, SUM_ENTRY_SIZE); + memcpy(&curseg->sum_blk->footer, &sum->footer, SUM_FOOTER_SIZE); curseg->next_segno = segno; reset_curseg(sbi, type, 0); curseg->alloc_type = ckpt->alloc_type[type]; @@ -1560,7 +1827,7 @@ static int restore_curseg_summaries(struct f2fs_sb_info *sbi) if (npages >= 2) ra_meta_pages(sbi, start_sum_block(sbi), npages, - META_CP); + META_CP, true); /* restore for compacted data summary */ if (read_compacted_summaries(sbi)) @@ -1570,7 +1837,7 @@ static int restore_curseg_summaries(struct f2fs_sb_info *sbi) if (__exist_node_summaries(sbi)) ra_meta_pages(sbi, sum_blk_addr(sbi, NR_CURSEG_TYPE, type), - NR_CURSEG_TYPE - type, META_CP); + NR_CURSEG_TYPE - type, META_CP, true); for (; type <= CURSEG_COLD_NODE; type++) { err = read_normal_summaries(sbi, type); @@ -1595,13 +1862,12 @@ static void write_compacted_summaries(struct f2fs_sb_info *sbi, block_t blkaddr) /* Step 1: write nat cache */ seg_i = CURSEG_I(sbi, CURSEG_HOT_DATA); - memcpy(kaddr, &seg_i->sum_blk->n_nats, SUM_JOURNAL_SIZE); + memcpy(kaddr, seg_i->journal, SUM_JOURNAL_SIZE); written_size += SUM_JOURNAL_SIZE; /* Step 2: write sit cache */ seg_i = CURSEG_I(sbi, CURSEG_COLD_DATA); - memcpy(kaddr + written_size, &seg_i->sum_blk->n_sits, - SUM_JOURNAL_SIZE); + memcpy(kaddr + written_size, seg_i->journal, SUM_JOURNAL_SIZE); written_size += SUM_JOURNAL_SIZE; /* Step 3: write summary entries */ @@ -1623,7 +1889,7 @@ static void write_compacted_summaries(struct f2fs_sb_info *sbi, block_t blkaddr) *summary = seg_i->sum_blk->entries[j]; written_size += SUMMARY_SIZE; - if (written_size + SUMMARY_SIZE <= PAGE_CACHE_SIZE - + if (written_size + SUMMARY_SIZE <= PAGE_SIZE - SUM_FOOTER_SIZE) continue; @@ -1647,12 +1913,8 @@ static void write_normal_summaries(struct f2fs_sb_info *sbi, else end = type + NR_CURSEG_NODE_TYPE; - for (i = type; i < end; i++) { - struct curseg_info *sum = CURSEG_I(sbi, i); - mutex_lock(&sum->curseg_mutex); - write_sum_page(sbi, sum->sum_blk, blkaddr + (i - type)); - mutex_unlock(&sum->curseg_mutex); - } + for (i = type; i < end; i++) + write_current_sum_page(sbi, i, blkaddr + (i - type)); } void write_data_summaries(struct f2fs_sb_info *sbi, block_t start_blk) @@ -1668,24 +1930,24 @@ void write_node_summaries(struct f2fs_sb_info *sbi, block_t start_blk) write_normal_summaries(sbi, start_blk, CURSEG_HOT_NODE); } -int lookup_journal_in_cursum(struct f2fs_summary_block *sum, int type, +int lookup_journal_in_cursum(struct f2fs_journal *journal, int type, unsigned int val, int alloc) { int i; if (type == NAT_JOURNAL) { - for (i = 0; i < nats_in_cursum(sum); i++) { - if (le32_to_cpu(nid_in_journal(sum, i)) == val) + for (i = 0; i < nats_in_cursum(journal); i++) { + if (le32_to_cpu(nid_in_journal(journal, i)) == val) return i; } - if (alloc && nats_in_cursum(sum) < NAT_JOURNAL_ENTRIES) - return update_nats_in_cursum(sum, 1); + if (alloc && __has_cursum_space(journal, 1, NAT_JOURNAL)) + return update_nats_in_cursum(journal, 1); } else if (type == SIT_JOURNAL) { - for (i = 0; i < sits_in_cursum(sum); i++) - if (le32_to_cpu(segno_in_journal(sum, i)) == val) + for (i = 0; i < sits_in_cursum(journal); i++) + if (le32_to_cpu(segno_in_journal(journal, i)) == val) return i; - if (alloc && sits_in_cursum(sum) < SIT_JOURNAL_ENTRIES) - return update_sits_in_cursum(sum, 1); + if (alloc && __has_cursum_space(journal, 1, SIT_JOURNAL)) + return update_sits_in_cursum(journal, 1); } return -1; } @@ -1714,7 +1976,7 @@ static struct page *get_next_sit_page(struct f2fs_sb_info *sbi, src_addr = page_address(src_page); dst_addr = page_address(dst_page); - memcpy(dst_addr, src_addr, PAGE_CACHE_SIZE); + memcpy(dst_addr, src_addr, PAGE_SIZE); set_page_dirty(dst_page); f2fs_put_page(src_page, 1); @@ -1727,7 +1989,7 @@ static struct page *get_next_sit_page(struct f2fs_sb_info *sbi, static struct sit_entry_set *grab_sit_entry_set(void) { struct sit_entry_set *ses = - f2fs_kmem_cache_alloc(sit_entry_set_slab, GFP_ATOMIC); + f2fs_kmem_cache_alloc(sit_entry_set_slab, GFP_NOFS); ses->entry_cnt = 0; INIT_LIST_HEAD(&ses->set_list); @@ -1789,20 +2051,22 @@ static void add_sits_in_set(struct f2fs_sb_info *sbi) static void remove_sits_in_journal(struct f2fs_sb_info *sbi) { struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_COLD_DATA); - struct f2fs_summary_block *sum = curseg->sum_blk; + struct f2fs_journal *journal = curseg->journal; int i; - for (i = sits_in_cursum(sum) - 1; i >= 0; i--) { + down_write(&curseg->journal_rwsem); + for (i = 0; i < sits_in_cursum(journal); i++) { unsigned int segno; bool dirtied; - segno = le32_to_cpu(segno_in_journal(sum, i)); + segno = le32_to_cpu(segno_in_journal(journal, i)); dirtied = __mark_sit_entry_dirty(sbi, segno); if (!dirtied) add_sit_entry(segno, &SM_I(sbi)->sit_entry_set); } - update_sits_in_cursum(sum, -sits_in_cursum(sum)); + update_sits_in_cursum(journal, -i); + up_write(&curseg->journal_rwsem); } /* @@ -1814,13 +2078,12 @@ void flush_sit_entries(struct f2fs_sb_info *sbi, struct cp_control *cpc) struct sit_info *sit_i = SIT_I(sbi); unsigned long *bitmap = sit_i->dirty_sentries_bitmap; struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_COLD_DATA); - struct f2fs_summary_block *sum = curseg->sum_blk; + struct f2fs_journal *journal = curseg->journal; struct sit_entry_set *ses, *tmp; struct list_head *head = &SM_I(sbi)->sit_entry_set; bool to_journal = true; struct seg_entry *se; - mutex_lock(&curseg->curseg_mutex); mutex_lock(&sit_i->sentry_lock); if (!sit_i->dirty_sentries) @@ -1837,7 +2100,7 @@ void flush_sit_entries(struct f2fs_sb_info *sbi, struct cp_control *cpc) * entries, remove all entries from journal and add and account * them in sit entry set. */ - if (!__has_cursum_space(sum, sit_i->dirty_sentries, SIT_JOURNAL)) + if (!__has_cursum_space(journal, sit_i->dirty_sentries, SIT_JOURNAL)) remove_sits_in_journal(sbi); /* @@ -1854,10 +2117,12 @@ void flush_sit_entries(struct f2fs_sb_info *sbi, struct cp_control *cpc) unsigned int segno = start_segno; if (to_journal && - !__has_cursum_space(sum, ses->entry_cnt, SIT_JOURNAL)) + !__has_cursum_space(journal, ses->entry_cnt, SIT_JOURNAL)) to_journal = false; - if (!to_journal) { + if (to_journal) { + down_write(&curseg->journal_rwsem); + } else { page = get_next_sit_page(sbi, start_segno); raw_sit = page_address(page); } @@ -1875,13 +2140,13 @@ void flush_sit_entries(struct f2fs_sb_info *sbi, struct cp_control *cpc) } if (to_journal) { - offset = lookup_journal_in_cursum(sum, + offset = lookup_journal_in_cursum(journal, SIT_JOURNAL, segno, 1); f2fs_bug_on(sbi, offset < 0); - segno_in_journal(sum, offset) = + segno_in_journal(journal, offset) = cpu_to_le32(segno); seg_info_to_raw_sit(se, - &sit_in_journal(sum, offset)); + &sit_in_journal(journal, offset)); } else { sit_offset = SIT_ENTRY_OFFSET(sit_i, segno); seg_info_to_raw_sit(se, @@ -1893,7 +2158,9 @@ void flush_sit_entries(struct f2fs_sb_info *sbi, struct cp_control *cpc) ses->entry_cnt--; } - if (!to_journal) + if (to_journal) + up_write(&curseg->journal_rwsem); + else f2fs_put_page(page, 1); f2fs_bug_on(sbi, ses->entry_cnt); @@ -1908,7 +2175,6 @@ void flush_sit_entries(struct f2fs_sb_info *sbi, struct cp_control *cpc) add_discard_addrs(sbi, cpc); } mutex_unlock(&sit_i->sentry_lock); - mutex_unlock(&curseg->curseg_mutex); set_prefree_as_free_segments(sbi); } @@ -1929,12 +2195,13 @@ static int build_sit_info(struct f2fs_sb_info *sbi) SM_I(sbi)->sit_info = sit_i; - sit_i->sentries = vzalloc(MAIN_SEGS(sbi) * sizeof(struct seg_entry)); + sit_i->sentries = f2fs_kvzalloc(MAIN_SEGS(sbi) * + sizeof(struct seg_entry), GFP_KERNEL); if (!sit_i->sentries) return -ENOMEM; bitmap_size = f2fs_bitmap_size(MAIN_SEGS(sbi)); - sit_i->dirty_sentries_bitmap = kzalloc(bitmap_size, GFP_KERNEL); + sit_i->dirty_sentries_bitmap = f2fs_kvzalloc(bitmap_size, GFP_KERNEL); if (!sit_i->dirty_sentries_bitmap) return -ENOMEM; @@ -1943,8 +2210,11 @@ static int build_sit_info(struct f2fs_sb_info *sbi) = kzalloc(SIT_VBLOCK_MAP_SIZE, GFP_KERNEL); sit_i->sentries[start].ckpt_valid_map = kzalloc(SIT_VBLOCK_MAP_SIZE, GFP_KERNEL); - if (!sit_i->sentries[start].cur_valid_map - || !sit_i->sentries[start].ckpt_valid_map) + sit_i->sentries[start].discard_map + = kzalloc(SIT_VBLOCK_MAP_SIZE, GFP_KERNEL); + if (!sit_i->sentries[start].cur_valid_map || + !sit_i->sentries[start].ckpt_valid_map || + !sit_i->sentries[start].discard_map) return -ENOMEM; } @@ -1953,8 +2223,8 @@ static int build_sit_info(struct f2fs_sb_info *sbi) return -ENOMEM; if (sbi->segs_per_sec > 1) { - sit_i->sec_entries = vzalloc(MAIN_SECS(sbi) * - sizeof(struct sec_entry)); + sit_i->sec_entries = f2fs_kvzalloc(MAIN_SECS(sbi) * + sizeof(struct sec_entry), GFP_KERNEL); if (!sit_i->sec_entries) return -ENOMEM; } @@ -1999,12 +2269,12 @@ static int build_free_segmap(struct f2fs_sb_info *sbi) SM_I(sbi)->free_info = free_i; bitmap_size = f2fs_bitmap_size(MAIN_SEGS(sbi)); - free_i->free_segmap = kmalloc(bitmap_size, GFP_KERNEL); + free_i->free_segmap = f2fs_kvmalloc(bitmap_size, GFP_KERNEL); if (!free_i->free_segmap) return -ENOMEM; sec_bitmap_size = f2fs_bitmap_size(MAIN_SECS(sbi)); - free_i->free_secmap = kmalloc(sec_bitmap_size, GFP_KERNEL); + free_i->free_secmap = f2fs_kvmalloc(sec_bitmap_size, GFP_KERNEL); if (!free_i->free_secmap) return -ENOMEM; @@ -2033,9 +2303,14 @@ static int build_curseg(struct f2fs_sb_info *sbi) for (i = 0; i < NR_CURSEG_TYPE; i++) { mutex_init(&array[i].curseg_mutex); - array[i].sum_blk = kzalloc(PAGE_CACHE_SIZE, GFP_KERNEL); + array[i].sum_blk = kzalloc(PAGE_SIZE, GFP_KERNEL); if (!array[i].sum_blk) return -ENOMEM; + init_rwsem(&array[i].journal_rwsem); + array[i].journal = kzalloc(sizeof(struct f2fs_journal), + GFP_KERNEL); + if (!array[i].journal) + return -ENOMEM; array[i].segno = NULL_SEGNO; array[i].next_blkoff = 0; } @@ -2046,14 +2321,14 @@ static void build_sit_entries(struct f2fs_sb_info *sbi) { struct sit_info *sit_i = SIT_I(sbi); struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_COLD_DATA); - struct f2fs_summary_block *sum = curseg->sum_blk; + struct f2fs_journal *journal = curseg->journal; int sit_blk_cnt = SIT_BLK_CNT(sbi); unsigned int i, start, end; unsigned int readed, start_blk = 0; - int nrpages = MAX_BIO_BLOCKS(sbi); + int nrpages = MAX_BIO_BLOCKS(sbi) * 8; do { - readed = ra_meta_pages(sbi, start_blk, nrpages, META_SIT); + readed = ra_meta_pages(sbi, start_blk, nrpages, META_SIT, true); start = start_blk * sit_i->sents_per_block; end = (start_blk + readed) * sit_i->sents_per_block; @@ -2064,16 +2339,16 @@ static void build_sit_entries(struct f2fs_sb_info *sbi) struct f2fs_sit_entry sit; struct page *page; - mutex_lock(&curseg->curseg_mutex); - for (i = 0; i < sits_in_cursum(sum); i++) { - if (le32_to_cpu(segno_in_journal(sum, i)) + down_read(&curseg->journal_rwsem); + for (i = 0; i < sits_in_cursum(journal); i++) { + if (le32_to_cpu(segno_in_journal(journal, i)) == start) { - sit = sit_in_journal(sum, i); - mutex_unlock(&curseg->curseg_mutex); + sit = sit_in_journal(journal, i); + up_read(&curseg->journal_rwsem); goto got_it; } } - mutex_unlock(&curseg->curseg_mutex); + up_read(&curseg->journal_rwsem); page = get_current_sit_page(sbi, start); sit_blk = (struct f2fs_sit_block *)page_address(page); @@ -2082,6 +2357,11 @@ static void build_sit_entries(struct f2fs_sb_info *sbi) got_it: check_block_count(sbi, start, &sit); seg_info_from_raw_sit(se, &sit); + + /* build discard map only one time */ + memcpy(se->discard_map, se->cur_valid_map, SIT_VBLOCK_MAP_SIZE); + sbi->discard_blks += sbi->blocks_per_seg - se->valid_blocks; + if (sbi->segs_per_sec > 1) { struct sec_entry *e = get_sec_entry(sbi, start); e->valid_blocks += se->valid_blocks; @@ -2140,7 +2420,7 @@ static int init_victim_secmap(struct f2fs_sb_info *sbi) struct dirty_seglist_info *dirty_i = DIRTY_I(sbi); unsigned int bitmap_size = f2fs_bitmap_size(MAIN_SECS(sbi)); - dirty_i->victim_secmap = kzalloc(bitmap_size, GFP_KERNEL); + dirty_i->victim_secmap = f2fs_kvzalloc(bitmap_size, GFP_KERNEL); if (!dirty_i->victim_secmap) return -ENOMEM; return 0; @@ -2162,7 +2442,7 @@ static int build_dirty_segmap(struct f2fs_sb_info *sbi) bitmap_size = f2fs_bitmap_size(MAIN_SEGS(sbi)); for (i = 0; i < NR_DIRTY_TYPE; i++) { - dirty_i->dirty_segmap[i] = kzalloc(bitmap_size, GFP_KERNEL); + dirty_i->dirty_segmap[i] = f2fs_kvzalloc(bitmap_size, GFP_KERNEL); if (!dirty_i->dirty_segmap[i]) return -ENOMEM; } @@ -2221,7 +2501,11 @@ int build_segment_manager(struct f2fs_sb_info *sbi) sm_info->ssa_blkaddr = le32_to_cpu(raw_super->ssa_blkaddr); sm_info->rec_prefree_segments = sm_info->main_segments * DEF_RECLAIM_PREFREE_SEGMENTS / 100; - sm_info->ipu_policy = 1 << F2FS_IPU_FSYNC; + if (sm_info->rec_prefree_segments > DEF_MAX_RECLAIM_PREFREE_SEGMENTS) + sm_info->rec_prefree_segments = DEF_MAX_RECLAIM_PREFREE_SEGMENTS; + + if (!test_opt(sbi, LFS)) + sm_info->ipu_policy = 1 << F2FS_IPU_FSYNC; sm_info->min_ipu_util = DEF_MIN_IPU_UTIL; sm_info->min_fsync_blocks = DEF_MIN_FSYNC_BLOCKS; @@ -2267,7 +2551,7 @@ static void discard_dirty_segmap(struct f2fs_sb_info *sbi, struct dirty_seglist_info *dirty_i = DIRTY_I(sbi); mutex_lock(&dirty_i->seglist_lock); - kfree(dirty_i->dirty_segmap[dirty_type]); + f2fs_kvfree(dirty_i->dirty_segmap[dirty_type]); dirty_i->nr_dirty[dirty_type] = 0; mutex_unlock(&dirty_i->seglist_lock); } @@ -2275,7 +2559,7 @@ static void discard_dirty_segmap(struct f2fs_sb_info *sbi, static void destroy_victim_secmap(struct f2fs_sb_info *sbi) { struct dirty_seglist_info *dirty_i = DIRTY_I(sbi); - kfree(dirty_i->victim_secmap); + f2fs_kvfree(dirty_i->victim_secmap); } static void destroy_dirty_segmap(struct f2fs_sb_info *sbi) @@ -2303,8 +2587,10 @@ static void destroy_curseg(struct f2fs_sb_info *sbi) if (!array) return; SM_I(sbi)->curseg_array = NULL; - for (i = 0; i < NR_CURSEG_TYPE; i++) + for (i = 0; i < NR_CURSEG_TYPE; i++) { kfree(array[i].sum_blk); + kfree(array[i].journal); + } kfree(array); } @@ -2314,8 +2600,8 @@ static void destroy_free_segmap(struct f2fs_sb_info *sbi) if (!free_i) return; SM_I(sbi)->free_info = NULL; - kfree(free_i->free_segmap); - kfree(free_i->free_secmap); + f2fs_kvfree(free_i->free_segmap); + f2fs_kvfree(free_i->free_secmap); kfree(free_i); } @@ -2331,13 +2617,14 @@ static void destroy_sit_info(struct f2fs_sb_info *sbi) for (start = 0; start < MAIN_SEGS(sbi); start++) { kfree(sit_i->sentries[start].cur_valid_map); kfree(sit_i->sentries[start].ckpt_valid_map); + kfree(sit_i->sentries[start].discard_map); } } kfree(sit_i->tmp_map); - vfree(sit_i->sentries); - vfree(sit_i->sec_entries); - kfree(sit_i->dirty_sentries_bitmap); + f2fs_kvfree(sit_i->sentries); + f2fs_kvfree(sit_i->sec_entries); + f2fs_kvfree(sit_i->dirty_sentries_bitmap); SM_I(sbi)->sit_info = NULL; kfree(sit_i->sit_bitmap); diff --git a/fs/f2fs/segment.h b/fs/f2fs/segment.h old mode 100644 new mode 100755 index 85d7fa7514b2..7073bd1549ff --- a/fs/f2fs/segment.h +++ b/fs/f2fs/segment.h @@ -15,6 +15,7 @@ #define NULL_SECNO ((unsigned int)(~0)) #define DEF_RECLAIM_PREFREE_SEGMENTS 5 /* 5% over total segments */ +#define DEF_MAX_RECLAIM_PREFREE_SEGMENTS 4096 /* 8GB in maximum */ /* L: Logical segment # in volume, R: Relative segment # in main area */ #define GET_L2R_SEGNO(free_i, segno) (segno - free_i->start_segno) @@ -136,10 +137,12 @@ enum { /* * BG_GC means the background cleaning job. * FG_GC means the on-demand cleaning job. + * FORCE_FG_GC means on-demand cleaning job in background. */ enum { BG_GC = 0, - FG_GC + FG_GC, + FORCE_FG_GC, }; /* for a function parameter to select a victim segment */ @@ -155,15 +158,17 @@ struct victim_sel_policy { }; struct seg_entry { - unsigned short valid_blocks; /* # of valid blocks */ + unsigned int type:6; /* segment type like CURSEG_XXX_TYPE */ + unsigned int valid_blocks:10; /* # of valid blocks */ + unsigned int ckpt_valid_blocks:10; /* # of valid blocks last cp */ + unsigned int padding:6; /* padding */ unsigned char *cur_valid_map; /* validity bitmap of blocks */ /* * # of valid blocks and the validity bitmap stored in the the last * checkpoint pack. This information is used by the SSR mode. */ - unsigned short ckpt_valid_blocks; - unsigned char *ckpt_valid_map; - unsigned char type; /* segment type like CURSEG_XXX_TYPE */ + unsigned char *ckpt_valid_map; /* validity bitmap of blocks last cp */ + unsigned char *discard_map; unsigned long long mtime; /* modification time of the segment */ }; @@ -175,9 +180,19 @@ struct segment_allocation { void (*allocate_segment)(struct f2fs_sb_info *, int, bool); }; +/* + * this value is set in page as a private data which indicate that + * the page is atomically written, and it is in inmem_pages list. + */ +#define ATOMIC_WRITTEN_PAGE ((unsigned long)-1) + +#define IS_ATOMIC_WRITTEN_PAGE(page) \ + (page_private(page) == (unsigned long)ATOMIC_WRITTEN_PAGE) + struct inmem_pages { struct list_head list; struct page *page; + block_t old_addr; /* for revoking when fail to commit */ }; struct sit_info { @@ -244,6 +259,8 @@ struct victim_selection { struct curseg_info { struct mutex curseg_mutex; /* lock for consistency */ struct f2fs_summary_block *sum_blk; /* cached summary block */ + struct rw_semaphore journal_rwsem; /* protect journal area */ + struct f2fs_journal *journal; /* cached journal info */ unsigned char alloc_type; /* current allocation type */ unsigned int segno; /* current segment number */ unsigned short next_blkoff; /* next block offset to write */ @@ -453,6 +470,10 @@ static inline bool need_SSR(struct f2fs_sb_info *sbi) { int node_secs = get_blocktype_secs(sbi, F2FS_DIRTY_NODES); int dent_secs = get_blocktype_secs(sbi, F2FS_DIRTY_DENTS); + + if (test_opt(sbi, LFS)) + return false; + return free_sections(sbi) <= (node_secs + 2 * dent_secs + reserved_sections(sbi) + 1); } @@ -462,6 +483,8 @@ static inline bool has_not_enough_free_secs(struct f2fs_sb_info *sbi, int freed) int node_secs = get_blocktype_secs(sbi, F2FS_DIRTY_NODES); int dent_secs = get_blocktype_secs(sbi, F2FS_DIRTY_DENTS); + node_secs += get_blocktype_secs(sbi, F2FS_DIRTY_IMETA); + if (unlikely(is_sbi_flag_set(sbi, SBI_POR_DOING))) return false; @@ -514,6 +537,9 @@ static inline bool need_inplace_update(struct inode *inode) if (S_ISDIR(inode->i_mode) || f2fs_is_atomic_file(inode)) return false; + if (test_opt(sbi, LFS)) + return false; + if (policy & (0x1 << F2FS_IPU_FORCE)) return true; if (policy & (0x1 << F2FS_IPU_SSR) && need_SSR(sbi)) @@ -527,7 +553,7 @@ static inline bool need_inplace_update(struct inode *inode) /* this is only set during fdatasync */ if (policy & (0x1 << F2FS_IPU_FSYNC) && - is_inode_flag_set(F2FS_I(inode), FI_NEED_IPU)) + is_inode_flag_set(inode, FI_NEED_IPU)) return true; return false; @@ -553,16 +579,15 @@ static inline unsigned short curseg_blkoff(struct f2fs_sb_info *sbi, int type) return curseg->next_blkoff; } -#ifdef CONFIG_F2FS_CHECK_FS static inline void check_seg_range(struct f2fs_sb_info *sbi, unsigned int segno) { - BUG_ON(segno > TOTAL_SEGS(sbi) - 1); + f2fs_bug_on(sbi, segno > TOTAL_SEGS(sbi) - 1); } static inline void verify_block_addr(struct f2fs_sb_info *sbi, block_t blk_addr) { - BUG_ON(blk_addr < SEG0_BLKADDR(sbi)); - BUG_ON(blk_addr >= MAX_BLKADDR(sbi)); + f2fs_bug_on(sbi, blk_addr < SEG0_BLKADDR(sbi) + || blk_addr >= MAX_BLKADDR(sbi)); } /* @@ -571,16 +596,11 @@ static inline void verify_block_addr(struct f2fs_sb_info *sbi, block_t blk_addr) static inline void check_block_count(struct f2fs_sb_info *sbi, int segno, struct f2fs_sit_entry *raw_sit) { +#ifdef CONFIG_F2FS_CHECK_FS bool is_valid = test_bit_le(0, raw_sit->valid_map) ? true : false; int valid_blocks = 0; int cur_pos = 0, next_pos; - /* check segment usage */ - BUG_ON(GET_SIT_VBLOCKS(raw_sit) > sbi->blocks_per_seg); - - /* check boundary of a given segment number */ - BUG_ON(segno > TOTAL_SEGS(sbi) - 1); - /* check bitmap with valid block count */ do { if (is_valid) { @@ -596,35 +616,11 @@ static inline void check_block_count(struct f2fs_sb_info *sbi, is_valid = !is_valid; } while (cur_pos < sbi->blocks_per_seg); BUG_ON(GET_SIT_VBLOCKS(raw_sit) != valid_blocks); -} -#else -static inline void check_seg_range(struct f2fs_sb_info *sbi, unsigned int segno) -{ - if (segno > TOTAL_SEGS(sbi) - 1) - set_sbi_flag(sbi, SBI_NEED_FSCK); -} - -static inline void verify_block_addr(struct f2fs_sb_info *sbi, block_t blk_addr) -{ - if (blk_addr < SEG0_BLKADDR(sbi) || blk_addr >= MAX_BLKADDR(sbi)) - set_sbi_flag(sbi, SBI_NEED_FSCK); -} - -/* - * Summary block is always treated as an invalid block - */ -static inline void check_block_count(struct f2fs_sb_info *sbi, - int segno, struct f2fs_sit_entry *raw_sit) -{ - /* check segment usage */ - if (GET_SIT_VBLOCKS(raw_sit) > sbi->blocks_per_seg) - set_sbi_flag(sbi, SBI_NEED_FSCK); - - /* check boundary of a given segment number */ - if (segno > TOTAL_SEGS(sbi) - 1) - set_sbi_flag(sbi, SBI_NEED_FSCK); -} #endif + /* check segment usage, and check boundary of a given segment number */ + f2fs_bug_on(sbi, GET_SIT_VBLOCKS(raw_sit) > sbi->blocks_per_seg + || segno > TOTAL_SEGS(sbi) - 1); +} static inline pgoff_t current_sit_addr(struct f2fs_sb_info *sbi, unsigned int start) @@ -719,9 +715,9 @@ static inline int nr_pages_to_skip(struct f2fs_sb_info *sbi, int type) if (type == DATA) return sbi->blocks_per_seg; else if (type == NODE) - return 3 * sbi->blocks_per_seg; + return 8 * sbi->blocks_per_seg; else if (type == META) - return MAX_BIO_BLOCKS(sbi); + return 8 * MAX_BIO_BLOCKS(sbi); else return 0; } @@ -739,10 +735,8 @@ static inline long nr_pages_to_write(struct f2fs_sb_info *sbi, int type, nr_to_write = wbc->nr_to_write; - if (type == DATA) - desired = 4096; - else if (type == NODE) - desired = 3 * max_hw_blocks(sbi); + if (type == NODE) + desired = 2 * max_hw_blocks(sbi); else desired = MAX_BIO_BLOCKS(sbi); diff --git a/fs/f2fs/shrinker.c b/fs/f2fs/shrinker.c new file mode 100755 index 000000000000..30fb48137e46 --- /dev/null +++ b/fs/f2fs/shrinker.c @@ -0,0 +1,141 @@ +/* + * f2fs shrinker support + * the basic infra was copied from fs/ubifs/shrinker.c + * + * Copyright (c) 2015 Motorola Mobility + * Copyright (c) 2015 Jaegeuk Kim + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ +#include +#include + +#include "f2fs.h" +#include "node.h" + +static LIST_HEAD(f2fs_list); +static DEFINE_SPINLOCK(f2fs_list_lock); +static unsigned int shrinker_run_no; + +static unsigned long __count_nat_entries(struct f2fs_sb_info *sbi) +{ + return NM_I(sbi)->nat_cnt - NM_I(sbi)->dirty_nat_cnt; +} + +static unsigned long __count_free_nids(struct f2fs_sb_info *sbi) +{ + if (NM_I(sbi)->fcnt > MAX_FREE_NIDS) + return NM_I(sbi)->fcnt - MAX_FREE_NIDS; + return 0; +} + +static unsigned long __count_extent_cache(struct f2fs_sb_info *sbi) +{ + return atomic_read(&sbi->total_zombie_tree) + + atomic_read(&sbi->total_ext_node); +} + +int f2fs_shrink_count(struct shrinker *shrink, + struct shrink_control *sc) +{ + struct f2fs_sb_info *sbi; + struct list_head *p; + unsigned long count = 0; + + spin_lock(&f2fs_list_lock); + p = f2fs_list.next; + while (p != &f2fs_list) { + sbi = list_entry(p, struct f2fs_sb_info, s_list); + + /* stop f2fs_put_super */ + if (!mutex_trylock(&sbi->umount_mutex)) { + p = p->next; + continue; + } + spin_unlock(&f2fs_list_lock); + + /* count extent cache entries */ + count += __count_extent_cache(sbi); + + /* shrink clean nat cache entries */ + count += __count_nat_entries(sbi); + + /* count free nids cache entries */ + count += __count_free_nids(sbi); + + spin_lock(&f2fs_list_lock); + p = p->next; + mutex_unlock(&sbi->umount_mutex); + } + spin_unlock(&f2fs_list_lock); + return count; +} + +int f2fs_shrink_scan(struct shrinker *shrink, + struct shrink_control *sc) +{ + unsigned long nr = sc->nr_to_scan; + struct f2fs_sb_info *sbi; + struct list_head *p; + unsigned int run_no; + unsigned long freed = 0; + + spin_lock(&f2fs_list_lock); + do { + run_no = ++shrinker_run_no; + } while (run_no == 0); + p = f2fs_list.next; + while (p != &f2fs_list) { + sbi = list_entry(p, struct f2fs_sb_info, s_list); + + if (sbi->shrinker_run_no == run_no) + break; + + /* stop f2fs_put_super */ + if (!mutex_trylock(&sbi->umount_mutex)) { + p = p->next; + continue; + } + spin_unlock(&f2fs_list_lock); + + sbi->shrinker_run_no = run_no; + + /* shrink extent cache entries */ + freed += f2fs_shrink_extent_tree(sbi, nr >> 1); + + /* shrink clean nat cache entries */ + if (freed < nr) + freed += try_to_free_nats(sbi, nr - freed); + + /* shrink free nids cache entries */ + if (freed < nr) + freed += try_to_free_nids(sbi, nr - freed); + + spin_lock(&f2fs_list_lock); + p = p->next; + list_move_tail(&sbi->s_list, &f2fs_list); + mutex_unlock(&sbi->umount_mutex); + if (freed >= nr) + break; + } + spin_unlock(&f2fs_list_lock); + return f2fs_shrink_count(NULL, NULL); +} + +void f2fs_join_shrinker(struct f2fs_sb_info *sbi) +{ + spin_lock(&f2fs_list_lock); + list_add_tail(&sbi->s_list, &f2fs_list); + spin_unlock(&f2fs_list_lock); +} + +void f2fs_leave_shrinker(struct f2fs_sb_info *sbi) +{ + f2fs_shrink_extent_tree(sbi, __count_extent_cache(sbi)); + + spin_lock(&f2fs_list_lock); + list_del(&sbi->s_list); + spin_unlock(&f2fs_list_lock); +} diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c old mode 100644 new mode 100755 index 413791a6d51a..7b5d7aa7c9cf --- a/fs/f2fs/super.c +++ b/fs/f2fs/super.c @@ -39,11 +39,43 @@ static struct proc_dir_entry *f2fs_proc_root; static struct kmem_cache *f2fs_inode_cachep; static struct kset *f2fs_kset; +#ifdef CONFIG_F2FS_FAULT_INJECTION +struct f2fs_fault_info f2fs_fault; + +char *fault_name[FAULT_MAX] = { + [FAULT_KMALLOC] = "kmalloc", + [FAULT_PAGE_ALLOC] = "page alloc", + [FAULT_ALLOC_NID] = "alloc nid", + [FAULT_ORPHAN] = "orphan", + [FAULT_BLOCK] = "no more block", + [FAULT_DIR_DEPTH] = "too big dir depth", + [FAULT_EVICT_INODE] = "evict_inode fail", +}; + +static void f2fs_build_fault_attr(unsigned int rate) +{ + if (rate) { + atomic_set(&f2fs_fault.inject_ops, 0); + f2fs_fault.inject_rate = rate; + f2fs_fault.inject_type = (1 << FAULT_MAX) - 1; + } else { + memset(&f2fs_fault, 0, sizeof(struct f2fs_fault_info)); + } +} +#endif + +/* f2fs-wide shrinker description */ +static struct shrinker f2fs_shrinker_info = { + .shrink = f2fs_shrink_scan, + .seeks = DEFAULT_SEEKS, +}; + enum { Opt_gc_background, Opt_disable_roll_forward, Opt_norecovery, Opt_discard, + Opt_nodiscard, Opt_noheap, Opt_user_xattr, Opt_nouser_xattr, @@ -55,10 +87,15 @@ enum { Opt_inline_data, Opt_inline_dentry, Opt_flush_merge, + Opt_noflush_merge, Opt_nobarrier, Opt_fastboot, Opt_extent_cache, + Opt_noextent_cache, Opt_noinline_data, + Opt_data_flush, + Opt_mode, + Opt_fault_injection, Opt_err, }; @@ -67,6 +104,7 @@ static match_table_t f2fs_tokens = { {Opt_disable_roll_forward, "disable_roll_forward"}, {Opt_norecovery, "norecovery"}, {Opt_discard, "discard"}, + {Opt_nodiscard, "nodiscard"}, {Opt_noheap, "no_heap"}, {Opt_user_xattr, "user_xattr"}, {Opt_nouser_xattr, "nouser_xattr"}, @@ -78,10 +116,15 @@ static match_table_t f2fs_tokens = { {Opt_inline_data, "inline_data"}, {Opt_inline_dentry, "inline_dentry"}, {Opt_flush_merge, "flush_merge"}, + {Opt_noflush_merge, "noflush_merge"}, {Opt_nobarrier, "nobarrier"}, {Opt_fastboot, "fastboot"}, {Opt_extent_cache, "extent_cache"}, + {Opt_noextent_cache, "noextent_cache"}, {Opt_noinline_data, "noinline_data"}, + {Opt_data_flush, "data_flush"}, + {Opt_mode, "mode=%s"}, + {Opt_fault_injection, "fault_injection=%u"}, {Opt_err, NULL}, }; @@ -91,6 +134,10 @@ enum { SM_INFO, /* struct f2fs_sm_info */ NM_INFO, /* struct f2fs_nm_info */ F2FS_SBI, /* struct f2fs_sb_info */ +#ifdef CONFIG_F2FS_FAULT_INJECTION + FAULT_INFO_RATE, /* struct f2fs_fault_info */ + FAULT_INFO_TYPE, /* struct f2fs_fault_info */ +#endif }; struct f2fs_attr { @@ -112,9 +159,27 @@ static unsigned char *__struct_ptr(struct f2fs_sb_info *sbi, int struct_type) return (unsigned char *)NM_I(sbi); else if (struct_type == F2FS_SBI) return (unsigned char *)sbi; +#ifdef CONFIG_F2FS_FAULT_INJECTION + else if (struct_type == FAULT_INFO_RATE || + struct_type == FAULT_INFO_TYPE) + return (unsigned char *)&f2fs_fault; +#endif return NULL; } +static ssize_t lifetime_write_kbytes_show(struct f2fs_attr *a, + struct f2fs_sb_info *sbi, char *buf) +{ + struct super_block *sb = sbi->sb; + + if (!sb->s_bdev->bd_part) + return snprintf(buf, PAGE_SIZE, "0\n"); + + return snprintf(buf, PAGE_SIZE, "%llu\n", + (unsigned long long)(sbi->kbytes_written + + BD_PART_WRITTEN(sbi))); +} + static ssize_t f2fs_sbi_show(struct f2fs_attr *a, struct f2fs_sb_info *sbi, char *buf) { @@ -148,6 +213,10 @@ static ssize_t f2fs_sbi_store(struct f2fs_attr *a, ret = kstrtoul(skip_spaces(buf), 0, &t); if (ret < 0) return ret; +#ifdef CONFIG_F2FS_FAULT_INJECTION + if (a->struct_type == FAULT_INFO_TYPE && t >= (1 << FAULT_MAX)) + return -EINVAL; +#endif *ui = t; return count; } @@ -193,6 +262,9 @@ static struct f2fs_attr f2fs_attr_##_name = { \ f2fs_sbi_show, f2fs_sbi_store, \ offsetof(struct struct_name, elname)) +#define F2FS_GENERAL_RO_ATTR(name) \ +static struct f2fs_attr f2fs_attr_##name = __ATTR(name, 0444, name##_show, NULL) + F2FS_RW_ATTR(GC_THREAD, f2fs_gc_kthread, gc_min_sleep_time, min_sleep_time); F2FS_RW_ATTR(GC_THREAD, f2fs_gc_kthread, gc_max_sleep_time, max_sleep_time); F2FS_RW_ATTR(GC_THREAD, f2fs_gc_kthread, gc_no_gc_sleep_time, no_gc_sleep_time); @@ -204,8 +276,17 @@ F2FS_RW_ATTR(SM_INFO, f2fs_sm_info, ipu_policy, ipu_policy); F2FS_RW_ATTR(SM_INFO, f2fs_sm_info, min_ipu_util, min_ipu_util); F2FS_RW_ATTR(SM_INFO, f2fs_sm_info, min_fsync_blocks, min_fsync_blocks); F2FS_RW_ATTR(NM_INFO, f2fs_nm_info, ram_thresh, ram_thresh); +F2FS_RW_ATTR(NM_INFO, f2fs_nm_info, ra_nid_pages, ra_nid_pages); +F2FS_RW_ATTR(NM_INFO, f2fs_nm_info, dirty_nats_ratio, dirty_nats_ratio); F2FS_RW_ATTR(F2FS_SBI, f2fs_sb_info, max_victim_search, max_victim_search); F2FS_RW_ATTR(F2FS_SBI, f2fs_sb_info, dir_level, dir_level); +F2FS_RW_ATTR(F2FS_SBI, f2fs_sb_info, cp_interval, interval_time[CP_TIME]); +F2FS_RW_ATTR(F2FS_SBI, f2fs_sb_info, idle_interval, interval_time[REQ_TIME]); +#ifdef CONFIG_F2FS_FAULT_INJECTION +F2FS_RW_ATTR(FAULT_INFO_RATE, f2fs_fault_info, inject_rate, inject_rate); +F2FS_RW_ATTR(FAULT_INFO_TYPE, f2fs_fault_info, inject_type, inject_type); +#endif +F2FS_GENERAL_RO_ATTR(lifetime_write_kbytes); #define ATTR_LIST(name) (&f2fs_attr_##name.attr) static struct attribute *f2fs_attrs[] = { @@ -222,6 +303,11 @@ static struct attribute *f2fs_attrs[] = { ATTR_LIST(max_victim_search), ATTR_LIST(dir_level), ATTR_LIST(ram_thresh), + ATTR_LIST(ra_nid_pages), + ATTR_LIST(dirty_nats_ratio), + ATTR_LIST(cp_interval), + ATTR_LIST(idle_interval), + ATTR_LIST(lifetime_write_kbytes), NULL, }; @@ -236,6 +322,22 @@ static struct kobj_type f2fs_ktype = { .release = f2fs_sb_release, }; +#ifdef CONFIG_F2FS_FAULT_INJECTION +/* sysfs for f2fs fault injection */ +static struct kobject f2fs_fault_inject; + +static struct attribute *f2fs_fault_attrs[] = { + ATTR_LIST(inject_rate), + ATTR_LIST(inject_type), + NULL +}; + +static struct kobj_type f2fs_fault_ktype = { + .default_attrs = f2fs_fault_attrs, + .sysfs_ops = &f2fs_attr_ops, +}; +#endif + void f2fs_msg(struct super_block *sb, const char *level, const char *fmt, ...) { struct va_format vaf; @@ -258,10 +360,15 @@ static void init_once(void *foo) static int parse_options(struct super_block *sb, char *options) { struct f2fs_sb_info *sbi = F2FS_SB(sb); + struct request_queue *q; substring_t args[MAX_OPT_ARGS]; char *p, *name; int arg = 0; +#ifdef CONFIG_F2FS_FAULT_INJECTION + f2fs_build_fault_attr(0); +#endif + if (!options) return 0; @@ -282,11 +389,16 @@ static int parse_options(struct super_block *sb, char *options) if (!name) return -ENOMEM; - if (strlen(name) == 2 && !strncmp(name, "on", 2)) + if (strlen(name) == 2 && !strncmp(name, "on", 2)) { set_opt(sbi, BG_GC); - else if (strlen(name) == 3 && !strncmp(name, "off", 3)) + clear_opt(sbi, FORCE_FG_GC); + } else if (strlen(name) == 3 && !strncmp(name, "off", 3)) { clear_opt(sbi, BG_GC); - else { + clear_opt(sbi, FORCE_FG_GC); + } else if (strlen(name) == 4 && !strncmp(name, "sync", 4)) { + set_opt(sbi, BG_GC); + set_opt(sbi, FORCE_FG_GC); + } else { kfree(name); return -EINVAL; } @@ -302,8 +414,17 @@ static int parse_options(struct super_block *sb, char *options) return -EINVAL; break; case Opt_discard: - set_opt(sbi, DISCARD); + q = bdev_get_queue(sb->s_bdev); + if (blk_queue_discard(q)) { + set_opt(sbi, DISCARD); + } else { + f2fs_msg(sb, KERN_WARNING, + "mounting with \"discard\" option, but " + "the device does not support discard"); + } break; + case Opt_nodiscard: + clear_opt(sbi, DISCARD); case Opt_noheap: set_opt(sbi, NOHEAP); break; @@ -365,6 +486,9 @@ static int parse_options(struct super_block *sb, char *options) case Opt_flush_merge: set_opt(sbi, FLUSH_MERGE); break; + case Opt_noflush_merge: + clear_opt(sbi, FLUSH_MERGE); + break; case Opt_nobarrier: set_opt(sbi, NOBARRIER); break; @@ -374,9 +498,42 @@ static int parse_options(struct super_block *sb, char *options) case Opt_extent_cache: set_opt(sbi, EXTENT_CACHE); break; + case Opt_noextent_cache: + clear_opt(sbi, EXTENT_CACHE); + break; case Opt_noinline_data: clear_opt(sbi, INLINE_DATA); break; + case Opt_data_flush: + set_opt(sbi, DATA_FLUSH); + break; + case Opt_mode: + name = match_strdup(&args[0]); + + if (!name) + return -ENOMEM; + if (strlen(name) == 8 && + !strncmp(name, "adaptive", 8)) { + set_opt_mode(sbi, F2FS_MOUNT_ADAPTIVE); + } else if (strlen(name) == 3 && + !strncmp(name, "lfs", 3)) { + set_opt_mode(sbi, F2FS_MOUNT_LFS); + } else { + kfree(name); + return -EINVAL; + } + kfree(name); + break; + case Opt_fault_injection: + if (args->from && match_int(args, &arg)) + return -EINVAL; +#ifdef CONFIG_F2FS_FAULT_INJECTION + f2fs_build_fault_attr(arg); +#else + f2fs_msg(sb, KERN_INFO, + "FAULT_INJECTION was not selected"); +#endif + break; default: f2fs_msg(sb, KERN_ERR, "Unrecognized mount option \"%s\" or missing value", @@ -397,25 +554,25 @@ static struct inode *f2fs_alloc_inode(struct super_block *sb) init_once((void *) fi); + if (percpu_counter_init(&fi->dirty_pages, 0)) { + kmem_cache_free(f2fs_inode_cachep, fi); + return NULL; + } + /* Initialize f2fs-specific inode info */ fi->vfs_inode.i_version = 1; - atomic_set(&fi->dirty_pages, 0); fi->i_current_depth = 1; fi->i_advise = 0; - rwlock_init(&fi->ext_lock); init_rwsem(&fi->i_sem); - INIT_RADIX_TREE(&fi->inmem_root, GFP_NOFS); + INIT_LIST_HEAD(&fi->dirty_list); + INIT_LIST_HEAD(&fi->gdirty_list); INIT_LIST_HEAD(&fi->inmem_pages); mutex_init(&fi->inmem_lock); - - set_inode_flag(fi, FI_NEW_INODE); - - if (test_opt(F2FS_SB(sb), INLINE_XATTR)) - set_inode_flag(fi, FI_INLINE_XATTR); + init_rwsem(&fi->dio_rwsem[READ]); + init_rwsem(&fi->dio_rwsem[WRITE]); /* Will be used by directory only */ fi->i_dir_level = F2FS_SB(sb)->dir_level; - return &fi->vfs_inode; } @@ -428,11 +585,71 @@ static int f2fs_drop_inode(struct inode *inode) * - f2fs_gc -> iput -> evict * - inode_wait_for_writeback(inode) */ - if (!inode_unhashed(inode) && inode->i_state & I_SYNC) + if ((!inode_unhashed(inode) && inode->i_state & I_SYNC)) { + if (!inode->i_nlink && !is_bad_inode(inode)) { + /* to avoid evict_inode call simultaneously */ + atomic_inc(&inode->i_count); + spin_unlock(&inode->i_lock); + + /* some remained atomic pages should discarded */ + if (f2fs_is_atomic_file(inode)) + drop_inmem_pages(inode); + + /* should remain fi->extent_tree for writepage */ + f2fs_destroy_extent_node(inode); + + f2fs_i_size_write(inode, 0); + + if (F2FS_HAS_BLOCKS(inode)) + f2fs_truncate(inode); + + fscrypt_put_encryption_info(inode, NULL); + spin_lock(&inode->i_lock); + atomic_dec(&inode->i_count); + } return 0; + } + return generic_drop_inode(inode); } +int f2fs_inode_dirtied(struct inode *inode) +{ + struct f2fs_sb_info *sbi = F2FS_I_SB(inode); + + spin_lock(&sbi->inode_lock[DIRTY_META]); + if (is_inode_flag_set(inode, FI_DIRTY_INODE)) { + spin_unlock(&sbi->inode_lock[DIRTY_META]); + return 1; + } + + set_inode_flag(inode, FI_DIRTY_INODE); + list_add_tail(&F2FS_I(inode)->gdirty_list, + &sbi->inode_list[DIRTY_META]); + inc_page_count(sbi, F2FS_DIRTY_IMETA); + stat_inc_dirty_inode(sbi, DIRTY_META); + spin_unlock(&sbi->inode_lock[DIRTY_META]); + + return 0; +} + +void f2fs_inode_synced(struct inode *inode) +{ + struct f2fs_sb_info *sbi = F2FS_I_SB(inode); + + spin_lock(&sbi->inode_lock[DIRTY_META]); + if (!is_inode_flag_set(inode, FI_DIRTY_INODE)) { + spin_unlock(&sbi->inode_lock[DIRTY_META]); + return; + } + list_del_init(&F2FS_I(inode)->gdirty_list); + clear_inode_flag(inode, FI_DIRTY_INODE); + clear_inode_flag(inode, FI_AUTO_RECOVER); + dec_page_count(sbi, F2FS_DIRTY_IMETA); + stat_dec_dirty_inode(F2FS_I_SB(inode), DIRTY_META); + spin_unlock(&sbi->inode_lock[DIRTY_META]); +} + /* * f2fs_dirty_inode() is called from __mark_inode_dirty() * @@ -440,7 +657,16 @@ static int f2fs_drop_inode(struct inode *inode) */ static void f2fs_dirty_inode(struct inode *inode, int flags) { - set_inode_flag(F2FS_I(inode), FI_DIRTY_INODE); + struct f2fs_sb_info *sbi = F2FS_I_SB(inode); + + if (inode->i_ino == F2FS_NODE_INO(sbi) || + inode->i_ino == F2FS_META_INO(sbi)) + return; + + if (is_inode_flag_set(inode, FI_AUTO_RECOVER)) + clear_inode_flag(inode, FI_AUTO_RECOVER); + + f2fs_inode_dirtied(inode); } static void f2fs_i_callback(struct rcu_head *head) @@ -451,22 +677,36 @@ static void f2fs_i_callback(struct rcu_head *head) static void f2fs_destroy_inode(struct inode *inode) { + percpu_counter_destroy(&F2FS_I(inode)->dirty_pages); call_rcu(&inode->i_rcu, f2fs_i_callback); } +static void destroy_percpu_info(struct f2fs_sb_info *sbi) +{ + int i; + + for (i = 0; i < NR_COUNT_TYPE; i++) + percpu_counter_destroy(&sbi->nr_pages[i]); + percpu_counter_destroy(&sbi->alloc_valid_block_count); + percpu_counter_destroy(&sbi->total_valid_inode_count); +} + static void f2fs_put_super(struct super_block *sb) { struct f2fs_sb_info *sbi = F2FS_SB(sb); if (sbi->s_proc) { remove_proc_entry("segment_info", sbi->s_proc); + remove_proc_entry("segment_bits", sbi->s_proc); remove_proc_entry(sb->s_id, f2fs_proc_root); } kobject_del(&sbi->s_kobj); - f2fs_destroy_stats(sbi); stop_gc_thread(sbi); + /* prevent remaining shrinker jobs */ + mutex_lock(&sbi->umount_mutex); + /* * We don't need to do checkpoint when superblock is clean. * But, the previous checkpoint was not done by umount, it needs to do @@ -480,13 +720,22 @@ static void f2fs_put_super(struct super_block *sb) write_checkpoint(sbi, &cpc); } + /* write_checkpoint can update stat informaion */ + f2fs_destroy_stats(sbi); + /* * normally superblock is clean, so we need to release this. * In addition, EIO will skip do checkpoint, we need this as well. */ - release_dirty_inode(sbi); + release_ino_entry(sbi, true); release_discard_addrs(sbi); + f2fs_leave_shrinker(sbi); + mutex_unlock(&sbi->umount_mutex); + + /* our cp_error case, we can wait for any writeback page */ + f2fs_flush_merged_bios(sbi); + iput(sbi->node_inode); iput(sbi->meta_inode); @@ -499,13 +748,16 @@ static void f2fs_put_super(struct super_block *sb) wait_for_completion(&sbi->s_kobj_unregister); sb->s_fs_info = NULL; - brelse(sbi->raw_super_buf); + kfree(sbi->raw_super); + + destroy_percpu_info(sbi); kfree(sbi); } int f2fs_sync_fs(struct super_block *sb, int sync) { struct f2fs_sb_info *sbi = F2FS_SB(sb); + int err = 0; trace_f2fs_sync_fs(sb, sync); @@ -515,14 +767,12 @@ int f2fs_sync_fs(struct super_block *sb, int sync) cpc.reason = __get_cp_reason(sbi); mutex_lock(&sbi->gc_mutex); - write_checkpoint(sbi, &cpc); + err = write_checkpoint(sbi, &cpc); mutex_unlock(&sbi->gc_mutex); - } else { - f2fs_balance_fs(sbi); } - f2fs_trace_ios(NULL, NULL, 1); + f2fs_trace_ios(NULL, 1); - return 0; + return err; } static int f2fs_freeze(struct super_block *sb) @@ -556,7 +806,7 @@ static int f2fs_statfs(struct dentry *dentry, struct kstatfs *buf) buf->f_bsize = sbi->blocksize; buf->f_blocks = total_count - start_count; - buf->f_bfree = buf->f_blocks - valid_user_blocks(sbi) - ovp_count; + buf->f_bfree = user_block_count - valid_user_blocks(sbi) + ovp_count; buf->f_bavail = user_block_count - valid_user_blocks(sbi); buf->f_files = sbi->total_node_count - F2FS_RESERVED_NODE_NUM; @@ -573,10 +823,14 @@ static int f2fs_show_options(struct seq_file *seq, struct dentry *root) { struct f2fs_sb_info *sbi = F2FS_SB(root->d_sb); - if (!f2fs_readonly(sbi->sb) && test_opt(sbi, BG_GC)) - seq_printf(seq, ",background_gc=%s", "on"); - else + if (!f2fs_readonly(sbi->sb) && test_opt(sbi, BG_GC)) { + if (test_opt(sbi, FORCE_FG_GC)) + seq_printf(seq, ",background_gc=%s", "sync"); + else + seq_printf(seq, ",background_gc=%s", "on"); + } else { seq_printf(seq, ",background_gc=%s", "off"); + } if (test_opt(sbi, DISABLE_ROLL_FORWARD)) seq_puts(seq, ",disable_roll_forward"); if (test_opt(sbi, DISCARD)) @@ -613,6 +867,16 @@ static int f2fs_show_options(struct seq_file *seq, struct dentry *root) seq_puts(seq, ",fastboot"); if (test_opt(sbi, EXTENT_CACHE)) seq_puts(seq, ",extent_cache"); + else + seq_puts(seq, ",noextent_cache"); + if (test_opt(sbi, DATA_FLUSH)) + seq_puts(seq, ",data_flush"); + + seq_puts(seq, ",mode="); + if (test_opt(sbi, ADAPTIVE)) + seq_puts(seq, "adaptive"); + else if (test_opt(sbi, LFS)) + seq_puts(seq, "lfs"); seq_printf(seq, ",active_logs=%u", sbi->active_logs); return 0; @@ -633,7 +897,7 @@ static int segment_info_seq_show(struct seq_file *seq, void *offset) struct seg_entry *se = get_seg_entry(sbi, i); if ((i % 10) == 0) - seq_printf(seq, "%-5d", i); + seq_printf(seq, "%-10d", i); seq_printf(seq, "%d|%-3u", se->type, get_valid_blocks(sbi, i, 1)); if ((i % 10) == 9 || i == (total_segs - 1)) @@ -645,19 +909,71 @@ static int segment_info_seq_show(struct seq_file *seq, void *offset) return 0; } -static int segment_info_open_fs(struct inode *inode, struct file *file) +static int segment_bits_seq_show(struct seq_file *seq, void *offset) { - return single_open(file, segment_info_seq_show, PDE(inode)->data); + struct super_block *sb = seq->private; + struct f2fs_sb_info *sbi = F2FS_SB(sb); + unsigned int total_segs = + le32_to_cpu(sbi->raw_super->segment_count_main); + int i, j; + + seq_puts(seq, "format: segment_type|valid_blocks|bitmaps\n" + "segment_type(0:HD, 1:WD, 2:CD, 3:HN, 4:WN, 5:CN)\n"); + + for (i = 0; i < total_segs; i++) { + struct seg_entry *se = get_seg_entry(sbi, i); + + seq_printf(seq, "%-10d", i); + seq_printf(seq, "%d|%-3u|", se->type, + get_valid_blocks(sbi, i, 1)); + for (j = 0; j < SIT_VBLOCK_MAP_SIZE; j++) + seq_printf(seq, "%x ", se->cur_valid_map[j]); + seq_putc(seq, '\n'); + } + return 0; } -static const struct file_operations f2fs_seq_segment_info_fops = { - .owner = THIS_MODULE, - .open = segment_info_open_fs, - .read = seq_read, - .llseek = seq_lseek, - .release = single_release, +#define F2FS_PROC_FILE_DEF(_name) \ +static int _name##_open_fs(struct inode *inode, struct file *file) \ +{ \ + return single_open(file, _name##_seq_show, PDE(inode)->data); \ +} \ + \ +static const struct file_operations f2fs_seq_##_name##_fops = { \ + .owner = THIS_MODULE, \ + .open = _name##_open_fs, \ + .read = seq_read, \ + .llseek = seq_lseek, \ + .release = single_release, \ }; +F2FS_PROC_FILE_DEF(segment_info); +F2FS_PROC_FILE_DEF(segment_bits); + +static void default_options(struct f2fs_sb_info *sbi) +{ + /* init some FS parameters */ + sbi->active_logs = NR_CURSEG_TYPE; + + set_opt(sbi, BG_GC); + set_opt(sbi, INLINE_DATA); + set_opt(sbi, EXTENT_CACHE); + set_opt(sbi, FLUSH_MERGE); + if (f2fs_sb_mounted_hmsmr(sbi->sb)) { + set_opt_mode(sbi, F2FS_MOUNT_LFS); + set_opt(sbi, DISCARD); + } else { + set_opt_mode(sbi, F2FS_MOUNT_ADAPTIVE); + } + +#ifdef CONFIG_F2FS_FS_XATTR + set_opt(sbi, XATTR_USER); +#endif +#ifdef CONFIG_F2FS_FS_POSIX_ACL + set_opt(sbi, POSIX_ACL); +#endif +} + static int f2fs_remount(struct super_block *sb, int *flags, char *data) { struct f2fs_sb_info *sbi = F2FS_SB(sb); @@ -665,8 +981,7 @@ static int f2fs_remount(struct super_block *sb, int *flags, char *data) int err, active_logs; bool need_restart_gc = false; bool need_stop_gc = false; - - sync_filesystem(sb); + bool no_extent_cache = !test_opt(sbi, EXTENT_CACHE); /* * Save the old mount options in case we @@ -675,8 +990,17 @@ static int f2fs_remount(struct super_block *sb, int *flags, char *data) org_mount_opt = sbi->mount_opt; active_logs = sbi->active_logs; + /* recover superblocks we couldn't write due to previous RO mount */ + if (!(*flags & MS_RDONLY) && is_sbi_flag_set(sbi, SBI_NEED_SB_WRITE)) { + err = f2fs_commit_super(sbi, false); + f2fs_msg(sb, KERN_INFO, + "Try to recover all the superblocks, ret: %d", err); + if (!err) + clear_sbi_flag(sbi, SBI_NEED_SB_WRITE); + } + sbi->mount_opt.opt = 0; - sbi->active_logs = NR_CURSEG_TYPE; + default_options(sbi); /* parse mount options */ err = parse_options(sb, data); @@ -690,6 +1014,14 @@ static int f2fs_remount(struct super_block *sb, int *flags, char *data) if (f2fs_readonly(sb) && (*flags & MS_RDONLY)) goto skip; + /* disallow enable/disable extent_cache dynamically */ + if (no_extent_cache == !!test_opt(sbi, EXTENT_CACHE)) { + err = -EINVAL; + f2fs_msg(sbi->sb, KERN_WARNING, + "switch extent_cache option is not allowed"); + goto restore_opts; + } + /* * We stop the GC thread if FS is mounted as RO * or if background_gc = off is passed in mount @@ -698,7 +1030,6 @@ static int f2fs_remount(struct super_block *sb, int *flags, char *data) if ((*flags & MS_RDONLY) || !test_opt(sbi, BG_GC)) { if (sbi->gc_thread) { stop_gc_thread(sbi); - f2fs_sync_fs(sb, 1); need_restart_gc = true; } } else if (!sbi->gc_thread) { @@ -708,6 +1039,16 @@ static int f2fs_remount(struct super_block *sb, int *flags, char *data) need_stop_gc = true; } + if (*flags & MS_RDONLY) { + writeback_inodes_sb(sb, WB_REASON_SYNC); + sync_inodes_sb(sb); + + set_sbi_flag(sbi, SBI_IS_DIRTY); + set_sbi_flag(sbi, SBI_IS_CLOSE); + f2fs_sync_fs(sb, 1); + clear_sbi_flag(sbi, SBI_IS_CLOSE); + } + /* * We stop issue flush thread if FS is mounted as RO * or if flush_merge is not passed in mount option. @@ -721,8 +1062,9 @@ static int f2fs_remount(struct super_block *sb, int *flags, char *data) } skip: /* Update the POSIXACL Flag */ - sb->s_flags = (sb->s_flags & ~MS_POSIXACL) | + sb->s_flags = (sb->s_flags & ~MS_POSIXACL) | (test_opt(sbi, POSIX_ACL) ? MS_POSIXACL : 0); + return 0; restore_gc: if (need_restart_gc) { @@ -754,6 +1096,48 @@ static struct super_operations f2fs_sops = { .remount_fs = f2fs_remount, }; +#ifdef CONFIG_F2FS_FS_ENCRYPTION +static int f2fs_get_context(struct inode *inode, void *ctx, size_t len) +{ + return f2fs_getxattr(inode, F2FS_XATTR_INDEX_ENCRYPTION, + F2FS_XATTR_NAME_ENCRYPTION_CONTEXT, + ctx, len, NULL); +} + +static int f2fs_key_prefix(struct inode *inode, u8 **key) +{ + *key = F2FS_I_SB(inode)->key_prefix; + return F2FS_I_SB(inode)->key_prefix_size; +} + +static int f2fs_set_context(struct inode *inode, const void *ctx, size_t len, + void *fs_data) +{ + return f2fs_setxattr(inode, F2FS_XATTR_INDEX_ENCRYPTION, + F2FS_XATTR_NAME_ENCRYPTION_CONTEXT, + ctx, len, fs_data, XATTR_CREATE); +} + +static unsigned f2fs_max_namelen(struct inode *inode) +{ + return S_ISLNK(inode->i_mode) ? + inode->i_sb->s_blocksize : F2FS_NAME_LEN; +} + +static struct fscrypt_operations f2fs_cryptops = { + .get_context = f2fs_get_context, + .key_prefix = f2fs_key_prefix, + .set_context = f2fs_set_context, + .is_encrypted = f2fs_encrypted_inode, + .empty_dir = f2fs_empty_dir, + .max_namelen = f2fs_max_namelen, +}; +#else +static struct fscrypt_operations f2fs_cryptops = { + .is_encrypted = f2fs_encrypted_inode, +}; +#endif + static struct inode *f2fs_nfs_get_inode(struct super_block *sb, u64 ino, u32 generation) { @@ -799,7 +1183,7 @@ static const struct export_operations f2fs_export_ops = { .get_parent = f2fs_get_parent, }; -static loff_t max_file_size(unsigned bits) +static loff_t max_file_blocks(void) { loff_t result = (DEF_ADDRS_PER_INODE - F2FS_INLINE_XATTR_ADDRS); loff_t leaf_count = ADDRS_PER_BLOCK; @@ -815,13 +1199,131 @@ static loff_t max_file_size(unsigned bits) leaf_count *= NIDS_PER_BLOCK; result += leaf_count; - result <<= bits; return result; } -static int sanity_check_raw_super(struct super_block *sb, - struct f2fs_super_block *raw_super) +static int __f2fs_commit_super(struct buffer_head *bh, + struct f2fs_super_block *super) +{ + lock_buffer(bh); + if (super) + memcpy(bh->b_data + F2FS_SUPER_OFFSET, super, sizeof(*super)); + set_buffer_uptodate(bh); + set_buffer_dirty(bh); + unlock_buffer(bh); + + /* it's rare case, we can do fua all the time */ + return __sync_dirty_buffer(bh, WRITE_FLUSH_FUA); +} + +static inline bool sanity_check_area_boundary(struct f2fs_sb_info *sbi, + struct buffer_head *bh) +{ + struct f2fs_super_block *raw_super = (struct f2fs_super_block *) + (bh->b_data + F2FS_SUPER_OFFSET); + struct super_block *sb = sbi->sb; + u32 segment0_blkaddr = le32_to_cpu(raw_super->segment0_blkaddr); + u32 cp_blkaddr = le32_to_cpu(raw_super->cp_blkaddr); + u32 sit_blkaddr = le32_to_cpu(raw_super->sit_blkaddr); + u32 nat_blkaddr = le32_to_cpu(raw_super->nat_blkaddr); + u32 ssa_blkaddr = le32_to_cpu(raw_super->ssa_blkaddr); + u32 main_blkaddr = le32_to_cpu(raw_super->main_blkaddr); + u32 segment_count_ckpt = le32_to_cpu(raw_super->segment_count_ckpt); + u32 segment_count_sit = le32_to_cpu(raw_super->segment_count_sit); + u32 segment_count_nat = le32_to_cpu(raw_super->segment_count_nat); + u32 segment_count_ssa = le32_to_cpu(raw_super->segment_count_ssa); + u32 segment_count_main = le32_to_cpu(raw_super->segment_count_main); + u32 segment_count = le32_to_cpu(raw_super->segment_count); + u32 log_blocks_per_seg = le32_to_cpu(raw_super->log_blocks_per_seg); + u64 main_end_blkaddr = main_blkaddr + + (segment_count_main << log_blocks_per_seg); + u64 seg_end_blkaddr = segment0_blkaddr + + (segment_count << log_blocks_per_seg); + + if (segment0_blkaddr != cp_blkaddr) { + f2fs_msg(sb, KERN_INFO, + "Mismatch start address, segment0(%u) cp_blkaddr(%u)", + segment0_blkaddr, cp_blkaddr); + return true; + } + + if (cp_blkaddr + (segment_count_ckpt << log_blocks_per_seg) != + sit_blkaddr) { + f2fs_msg(sb, KERN_INFO, + "Wrong CP boundary, start(%u) end(%u) blocks(%u)", + cp_blkaddr, sit_blkaddr, + segment_count_ckpt << log_blocks_per_seg); + return true; + } + + if (sit_blkaddr + (segment_count_sit << log_blocks_per_seg) != + nat_blkaddr) { + f2fs_msg(sb, KERN_INFO, + "Wrong SIT boundary, start(%u) end(%u) blocks(%u)", + sit_blkaddr, nat_blkaddr, + segment_count_sit << log_blocks_per_seg); + return true; + } + + if (nat_blkaddr + (segment_count_nat << log_blocks_per_seg) != + ssa_blkaddr) { + f2fs_msg(sb, KERN_INFO, + "Wrong NAT boundary, start(%u) end(%u) blocks(%u)", + nat_blkaddr, ssa_blkaddr, + segment_count_nat << log_blocks_per_seg); + return true; + } + + if (ssa_blkaddr + (segment_count_ssa << log_blocks_per_seg) != + main_blkaddr) { + f2fs_msg(sb, KERN_INFO, + "Wrong SSA boundary, start(%u) end(%u) blocks(%u)", + ssa_blkaddr, main_blkaddr, + segment_count_ssa << log_blocks_per_seg); + return true; + } + + if (main_end_blkaddr > seg_end_blkaddr) { + f2fs_msg(sb, KERN_INFO, + "Wrong MAIN_AREA boundary, start(%u) end(%u) block(%u)", + main_blkaddr, + segment0_blkaddr + + (segment_count << log_blocks_per_seg), + segment_count_main << log_blocks_per_seg); + return true; + } else if (main_end_blkaddr < seg_end_blkaddr) { + int err = 0; + char *res; + + /* fix in-memory information all the time */ + raw_super->segment_count = cpu_to_le32((main_end_blkaddr - + segment0_blkaddr) >> log_blocks_per_seg); + + if (f2fs_readonly(sb) || bdev_read_only(sb->s_bdev)) { + set_sbi_flag(sbi, SBI_NEED_SB_WRITE); + res = "internally"; + } else { + err = __f2fs_commit_super(bh, NULL); + res = err ? "failed" : "done"; + } + f2fs_msg(sb, KERN_INFO, + "Fix alignment : %s, start(%u) end(%u) block(%u)", + res, main_blkaddr, + segment0_blkaddr + + (segment_count << log_blocks_per_seg), + segment_count_main << log_blocks_per_seg); + if (err) + return true; + } + return false; +} + +static int sanity_check_raw_super(struct f2fs_sb_info *sbi, + struct buffer_head *bh) { + struct f2fs_super_block *raw_super = (struct f2fs_super_block *) + (bh->b_data + F2FS_SUPER_OFFSET); + struct super_block *sb = sbi->sb; unsigned int blocksize; if (F2FS_SUPER_MAGIC != le32_to_cpu(raw_super->magic)) { @@ -832,10 +1334,10 @@ static int sanity_check_raw_super(struct super_block *sb, } /* Currently, support only 4KB page cache size */ - if (F2FS_BLKSIZE != PAGE_CACHE_SIZE) { + if (F2FS_BLKSIZE != PAGE_SIZE) { f2fs_msg(sb, KERN_INFO, "Invalid page_cache_size (%lu), supports only 4KB\n", - PAGE_CACHE_SIZE); + PAGE_SIZE); return 1; } @@ -848,6 +1350,14 @@ static int sanity_check_raw_super(struct super_block *sb, return 1; } + /* check log blocks per segment */ + if (le32_to_cpu(raw_super->log_blocks_per_seg) != 9) { + f2fs_msg(sb, KERN_INFO, + "Invalid log blocks per segment (%u)\n", + le32_to_cpu(raw_super->log_blocks_per_seg)); + return 1; + } + /* Currently, support 512/1024/2048/4096 bytes sector size */ if (le32_to_cpu(raw_super->log_sectorsize) > F2FS_MAX_LOG_SECTOR_SIZE || @@ -866,10 +1376,27 @@ static int sanity_check_raw_super(struct super_block *sb, le32_to_cpu(raw_super->log_sectorsize)); return 1; } + + /* check reserved ino info */ + if (le32_to_cpu(raw_super->node_ino) != 1 || + le32_to_cpu(raw_super->meta_ino) != 2 || + le32_to_cpu(raw_super->root_ino) != 3) { + f2fs_msg(sb, KERN_INFO, + "Invalid Fs Meta Ino: node(%u) meta(%u) root(%u)", + le32_to_cpu(raw_super->node_ino), + le32_to_cpu(raw_super->meta_ino), + le32_to_cpu(raw_super->root_ino)); + return 1; + } + + /* check CP/SIT/NAT/SSA/MAIN_AREA area boundary */ + if (sanity_check_area_boundary(sbi, bh)) + return 1; + return 0; } -static int sanity_check_ckpt(struct f2fs_sb_info *sbi) +int sanity_check_ckpt(struct f2fs_sb_info *sbi) { unsigned int total, fsmeta; struct f2fs_super_block *raw_super = F2FS_RAW_SUPER(sbi); @@ -895,7 +1422,6 @@ static int sanity_check_ckpt(struct f2fs_sb_info *sbi) static void init_sb_info(struct f2fs_sb_info *sbi) { struct f2fs_super_block *raw_super = sbi->raw_super; - int i; sbi->log_sectors_per_block = le32_to_cpu(raw_super->log_sectors_per_block); @@ -915,97 +1441,171 @@ static void init_sb_info(struct f2fs_sb_info *sbi) sbi->cur_victim_sec = NULL_SECNO; sbi->max_victim_search = DEF_MAX_VICTIM_SEARCH; - for (i = 0; i < NR_COUNT_TYPE; i++) - atomic_set(&sbi->nr_pages[i], 0); - sbi->dir_level = DEF_DIR_LEVEL; + sbi->interval_time[CP_TIME] = DEF_CP_INTERVAL; + sbi->interval_time[REQ_TIME] = DEF_IDLE_INTERVAL; clear_sbi_flag(sbi, SBI_NEED_FSCK); + + INIT_LIST_HEAD(&sbi->s_list); + mutex_init(&sbi->umount_mutex); + mutex_init(&sbi->wio_mutex[NODE]); + mutex_init(&sbi->wio_mutex[DATA]); + +#ifdef CONFIG_F2FS_FS_ENCRYPTION + memcpy(sbi->key_prefix, F2FS_KEY_DESC_PREFIX, + F2FS_KEY_DESC_PREFIX_SIZE); + sbi->key_prefix_size = F2FS_KEY_DESC_PREFIX_SIZE; +#endif +} + +static int init_percpu_info(struct f2fs_sb_info *sbi) +{ + int i, err; + + for (i = 0; i < NR_COUNT_TYPE; i++) { + err = percpu_counter_init(&sbi->nr_pages[i], 0); + if (err) + return err; + } + + err = percpu_counter_init(&sbi->alloc_valid_block_count, 0); + if (err) + return err; + + return percpu_counter_init(&sbi->total_valid_inode_count, 0); } /* * Read f2fs raw super block. - * Because we have two copies of super block, so read the first one at first, - * if the first one is invalid, move to read the second one. + * Because we have two copies of super block, so read both of them + * to get the first valid one. If any one of them is broken, we pass + * them recovery flag back to the caller. */ -static int read_raw_super_block(struct super_block *sb, +static int read_raw_super_block(struct f2fs_sb_info *sbi, struct f2fs_super_block **raw_super, - struct buffer_head **raw_super_buf) + int *valid_super_block, int *recovery) { - int block = 0; + struct super_block *sb = sbi->sb; + int block; + struct buffer_head *bh; + struct f2fs_super_block *super; + int err = 0; + + super = kzalloc(sizeof(struct f2fs_super_block), GFP_KERNEL); + if (!super) + return -ENOMEM; -retry: - *raw_super_buf = sb_bread(sb, block); - if (!*raw_super_buf) { - f2fs_msg(sb, KERN_ERR, "Unable to read %dth superblock", + for (block = 0; block < 2; block++) { + bh = sb_bread(sb, block); + if (!bh) { + f2fs_msg(sb, KERN_ERR, "Unable to read %dth superblock", block + 1); - if (block == 0) { - block++; - goto retry; - } else { - return -EIO; + err = -EIO; + continue; } - } - *raw_super = (struct f2fs_super_block *) - ((char *)(*raw_super_buf)->b_data + F2FS_SUPER_OFFSET); + /* sanity checking of raw super */ + if (sanity_check_raw_super(sbi, bh)) { + f2fs_msg(sb, KERN_ERR, + "Can't find valid F2FS filesystem in %dth superblock", + block + 1); + err = -EINVAL; + brelse(bh); + continue; + } - /* sanity checking of raw super */ - if (sanity_check_raw_super(sb, *raw_super)) { - brelse(*raw_super_buf); - f2fs_msg(sb, KERN_ERR, - "Can't find valid F2FS filesystem in %dth superblock", - block + 1); - if (block == 0) { - block++; - goto retry; - } else { - return -EINVAL; + if (!*raw_super) { + memcpy(super, bh->b_data + F2FS_SUPER_OFFSET, + sizeof(*super)); + *valid_super_block = block; + *raw_super = super; } + brelse(bh); } - return 0; + /* Fail to read any one of the superblocks*/ + if (err < 0) + *recovery = 1; + + /* No valid superblock */ + if (!*raw_super) + kfree(super); + else + err = 0; + + return err; +} + +int f2fs_commit_super(struct f2fs_sb_info *sbi, bool recover) +{ + struct buffer_head *bh; + int err; + + if ((recover && f2fs_readonly(sbi->sb)) || + bdev_read_only(sbi->sb->s_bdev)) { + set_sbi_flag(sbi, SBI_NEED_SB_WRITE); + return -EROFS; + } + + /* write back-up superblock first */ + bh = sb_getblk(sbi->sb, sbi->valid_super_block ? 0: 1); + if (!bh) + return -EIO; + err = __f2fs_commit_super(bh, F2FS_RAW_SUPER(sbi)); + brelse(bh); + + /* if we are in recovery path, skip writing valid superblock */ + if (recover || err) + return err; + + /* write current valid superblock */ + bh = sb_getblk(sbi->sb, sbi->valid_super_block); + if (!bh) + return -EIO; + err = __f2fs_commit_super(bh, F2FS_RAW_SUPER(sbi)); + brelse(bh); + return err; } static int f2fs_fill_super(struct super_block *sb, void *data, int silent) { struct f2fs_sb_info *sbi; - struct f2fs_super_block *raw_super = NULL; - struct buffer_head *raw_super_buf; + struct f2fs_super_block *raw_super; struct inode *root; - long err = -EINVAL; + int err; bool retry = true, need_fsck = false; char *options = NULL; - int i; + int recovery, i, valid_super_block; + struct curseg_info *seg_i; try_onemore: + err = -EINVAL; + raw_super = NULL; + valid_super_block = -1; + recovery = 0; + /* allocate memory for f2fs-specific super block info */ sbi = kzalloc(sizeof(struct f2fs_sb_info), GFP_KERNEL); if (!sbi) return -ENOMEM; + sbi->sb = sb; + /* set a block size */ if (unlikely(!sb_set_blocksize(sb, F2FS_BLKSIZE))) { f2fs_msg(sb, KERN_ERR, "unable to set blocksize"); goto free_sbi; } - err = read_raw_super_block(sb, &raw_super, &raw_super_buf); + err = read_raw_super_block(sbi, &raw_super, &valid_super_block, + &recovery); if (err) goto free_sbi; sb->s_fs_info = sbi; - /* init some FS parameters */ - sbi->active_logs = NR_CURSEG_TYPE; - - set_opt(sbi, BG_GC); - set_opt(sbi, INLINE_DATA); + sbi->raw_super = raw_super; -#ifdef CONFIG_F2FS_FS_XATTR - set_opt(sbi, XATTR_USER); -#endif -#ifdef CONFIG_F2FS_FS_POSIX_ACL - set_opt(sbi, POSIX_ACL); -#endif + default_options(sbi); /* parse mount options */ options = kstrdup((const char *)data, GFP_KERNEL); if (data && !options) { @@ -1017,11 +1617,14 @@ static int f2fs_fill_super(struct super_block *sb, void *data, int silent) if (err) goto free_options; - sb->s_maxbytes = max_file_size(le32_to_cpu(raw_super->log_blocksize)); + sbi->max_file_blocks = max_file_blocks(); + sb->s_maxbytes = sbi->max_file_blocks << + le32_to_cpu(raw_super->log_blocksize); sb->s_max_links = F2FS_LINK_MAX; get_random_bytes(&sbi->s_next_generation, sizeof(u32)); sb->s_op = &f2fs_sops; + sb->s_cop = &f2fs_cryptops; sb->s_xattr = f2fs_xattr_handlers; sb->s_export_op = &f2fs_export_ops; sb->s_magic = F2FS_SUPER_MAGIC; @@ -1031,13 +1634,13 @@ static int f2fs_fill_super(struct super_block *sb, void *data, int silent) memcpy(sb->s_uuid, raw_super->uuid, sizeof(raw_super->uuid)); /* init f2fs-specific super block info */ - sbi->sb = sb; - sbi->raw_super = raw_super; - sbi->raw_super_buf = raw_super_buf; + sbi->valid_super_block = valid_super_block; mutex_init(&sbi->gc_mutex); mutex_init(&sbi->cp_mutex); init_rwsem(&sbi->node_write); - clear_sbi_flag(sbi, SBI_POR_DOING); + + /* disallow all the data/node/meta page writes */ + set_sbi_flag(sbi, SBI_POR_DOING); spin_lock_init(&sbi->stat_lock); init_rwsem(&sbi->read_io.io_rwsem); @@ -1053,6 +1656,10 @@ static int f2fs_fill_super(struct super_block *sb, void *data, int silent) init_waitqueue_head(&sbi->cp_wait); init_sb_info(sbi); + err = init_percpu_info(sbi); + if (err) + goto free_options; + /* get an inode for meta space */ sbi->meta_inode = f2fs_iget(sb, F2FS_META_INO(sbi)); if (IS_ERR(sbi->meta_inode)) { @@ -1067,24 +1674,19 @@ static int f2fs_fill_super(struct super_block *sb, void *data, int silent) goto free_meta_inode; } - /* sanity checking of checkpoint */ - err = -EINVAL; - if (sanity_check_ckpt(sbi)) { - f2fs_msg(sb, KERN_ERR, "Invalid F2FS checkpoint"); - goto free_cp; - } - sbi->total_valid_node_count = le32_to_cpu(sbi->ckpt->valid_node_count); - sbi->total_valid_inode_count = - le32_to_cpu(sbi->ckpt->valid_inode_count); + percpu_counter_set(&sbi->total_valid_inode_count, + le32_to_cpu(sbi->ckpt->valid_inode_count)); sbi->user_block_count = le64_to_cpu(sbi->ckpt->user_block_count); sbi->total_valid_block_count = le64_to_cpu(sbi->ckpt->valid_block_count); sbi->last_valid_block_count = sbi->total_valid_block_count; - sbi->alloc_valid_block_count = 0; - INIT_LIST_HEAD(&sbi->dir_inode_list); - spin_lock_init(&sbi->dir_inode_lock); + + for (i = 0; i < NR_INODE_TYPE; i++) { + INIT_LIST_HEAD(&sbi->inode_list[i]); + spin_lock_init(&sbi->inode_lock[i]); + } init_extent_cache_info(sbi); @@ -1104,6 +1706,17 @@ static int f2fs_fill_super(struct super_block *sb, void *data, int silent) goto free_nm; } + /* For write statistics */ + if (sb->s_bdev->bd_part) + sbi->sectors_written_start = + (u64)part_stat_read(sb->s_bdev->bd_part, sectors[1]); + + /* Read accumulated write IO statistics if exists */ + seg_i = CURSEG_I(sbi, CURSEG_HOT_NODE); + if (__exist_node_summaries(sbi)) + sbi->kbytes_written = + le64_to_cpu(seg_i->journal->info.kbytes_written); + build_gc_manager(sbi); /* get an inode for node space */ @@ -1114,8 +1727,12 @@ static int f2fs_fill_super(struct super_block *sb, void *data, int silent) goto free_nm; } + f2fs_join_shrinker(sbi); + /* if there are nt orphan nodes free them */ - recover_orphan_inodes(sbi); + err = recover_orphan_inodes(sbi); + if (err) + goto free_node_inode; /* read root inode and dentry */ root = f2fs_iget(sb, F2FS_ROOT_INO(sbi)); @@ -1143,16 +1760,11 @@ static int f2fs_fill_super(struct super_block *sb, void *data, int silent) if (f2fs_proc_root) sbi->s_proc = proc_mkdir(sb->s_id, f2fs_proc_root); - if (sbi->s_proc) + if (sbi->s_proc) { proc_create_data("segment_info", S_IRUGO, sbi->s_proc, &f2fs_seq_segment_info_fops, sb); - - if (test_opt(sbi, DISCARD)) { - struct request_queue *q = bdev_get_queue(sb->s_bdev); - if (!blk_queue_discard(q)) - f2fs_msg(sb, KERN_WARNING, - "mounting with \"discard\" option, but " - "the device does not support discard"); + proc_create_data("segment_bits", S_IRUGO, sbi->s_proc, + &f2fs_seq_segment_bits_fops, sb); } sbi->s_kobj.kset = f2fs_kset; @@ -1177,15 +1789,27 @@ static int f2fs_fill_super(struct super_block *sb, void *data, int silent) if (need_fsck) set_sbi_flag(sbi, SBI_NEED_FSCK); - err = recover_fsync_data(sbi); - if (err) { + err = recover_fsync_data(sbi, false); + if (err < 0) { need_fsck = true; f2fs_msg(sb, KERN_ERR, - "Cannot recover all fsync data errno=%ld", err); + "Cannot recover all fsync data errno=%d", err); + goto free_kobj; + } + } else { + err = recover_fsync_data(sbi, true); + + if (!f2fs_readonly(sb) && err > 0) { + err = -EINVAL; + f2fs_msg(sb, KERN_ERR, + "Need to recover fsync data"); goto free_kobj; } } + /* recover_fsync_data() cleared this already */ + clear_sbi_flag(sbi, SBI_POR_DOING); + /* * If filesystem is not mounted as read-only then * do start the gc_thread. @@ -1197,13 +1821,28 @@ static int f2fs_fill_super(struct super_block *sb, void *data, int silent) goto free_kobj; } kfree(options); + + /* recover broken superblock */ + if (recovery) { + err = f2fs_commit_super(sbi, true); + f2fs_msg(sb, KERN_INFO, + "Try to recover %dth superblock, ret: %d", + sbi->valid_super_block ? 1 : 2, err); + } + + f2fs_update_time(sbi, CP_TIME); + f2fs_update_time(sbi, REQ_TIME); return 0; free_kobj: + f2fs_sync_inode_meta(sbi); kobject_del(&sbi->s_kobj); + kobject_put(&sbi->s_kobj); + wait_for_completion(&sbi->s_kobj_unregister); free_proc: if (sbi->s_proc) { remove_proc_entry("segment_info", sbi->s_proc); + remove_proc_entry("segment_bits", sbi->s_proc); remove_proc_entry(sb->s_id, f2fs_proc_root); } f2fs_destroy_stats(sbi); @@ -1211,20 +1850,23 @@ static int f2fs_fill_super(struct super_block *sb, void *data, int silent) dput(sb->s_root); sb->s_root = NULL; free_node_inode: + mutex_lock(&sbi->umount_mutex); + f2fs_leave_shrinker(sbi); iput(sbi->node_inode); + mutex_unlock(&sbi->umount_mutex); free_nm: destroy_node_manager(sbi); free_sm: destroy_segment_manager(sbi); -free_cp: kfree(sbi->ckpt); free_meta_inode: make_bad_inode(sbi->meta_inode); iput(sbi->meta_inode); free_options: + destroy_percpu_info(sbi); kfree(options); free_sb_buf: - brelse(raw_super_buf); + kfree(raw_super); free_sbi: kfree(sbi); @@ -1303,14 +1945,37 @@ static int __init init_f2fs_fs(void) err = -ENOMEM; goto free_extent_cache; } + +#ifdef CONFIG_F2FS_FAULT_INJECTION + f2fs_fault_inject.kset = f2fs_kset; + f2fs_build_fault_attr(0); + err = kobject_init_and_add(&f2fs_fault_inject, &f2fs_fault_ktype, + NULL, "fault_injection"); + if (err) { + f2fs_fault_inject.kset = NULL; + goto free_kset; + } +#endif + register_shrinker(&f2fs_shrinker_info); + err = register_filesystem(&f2fs_fs_type); if (err) - goto free_kset; - f2fs_create_root_stats(); + goto free_shrinker; + err = f2fs_create_root_stats(); + if (err) + goto free_filesystem; f2fs_proc_root = proc_mkdir("fs/f2fs", NULL); return 0; +free_filesystem: + unregister_filesystem(&f2fs_fs_type); +free_shrinker: + unregister_shrinker(&f2fs_shrinker_info); +#ifdef CONFIG_F2FS_FAULT_INJECTION free_kset: + if (f2fs_fault_inject.kset) + kobject_put(&f2fs_fault_inject); +#endif kset_unregister(f2fs_kset); free_extent_cache: destroy_extent_cache(); @@ -1331,12 +1996,16 @@ static void __exit exit_f2fs_fs(void) remove_proc_entry("fs/f2fs", NULL); f2fs_destroy_root_stats(); unregister_filesystem(&f2fs_fs_type); + unregister_shrinker(&f2fs_shrinker_info); +#ifdef CONFIG_F2FS_FAULT_INJECTION + kobject_put(&f2fs_fault_inject); +#endif + kset_unregister(f2fs_kset); destroy_extent_cache(); destroy_checkpoint_caches(); destroy_segment_manager_caches(); destroy_node_manager_caches(); destroy_inodecache(); - kset_unregister(f2fs_kset); f2fs_destroy_trace_ios(); } diff --git a/fs/f2fs/trace.c b/fs/f2fs/trace.c old mode 100644 new mode 100755 index 875aa8179bc1..562ce0821559 --- a/fs/f2fs/trace.c +++ b/fs/f2fs/trace.c @@ -29,7 +29,8 @@ static inline void __print_last_io(void) last_io.major, last_io.minor, last_io.pid, "----------------", last_io.type, - last_io.fio.rw, last_io.fio.blk_addr, + last_io.fio.rw, + last_io.fio.new_blkaddr, last_io.len); memset(&last_io, 0, sizeof(last_io)); } @@ -80,7 +81,7 @@ void f2fs_trace_pid(struct page *page) radix_tree_preload_end(); } -void f2fs_trace_ios(struct page *page, struct f2fs_io_info *fio, int flush) +void f2fs_trace_ios(struct f2fs_io_info *fio, int flush) { struct inode *inode; pid_t pid; @@ -91,8 +92,8 @@ void f2fs_trace_ios(struct page *page, struct f2fs_io_info *fio, int flush) return; } - inode = page->mapping->host; - pid = page_private(page); + inode = fio->page->mapping->host; + pid = page_private(fio->page); major = MAJOR(inode->i_sb->s_dev); minor = MINOR(inode->i_sb->s_dev); @@ -101,7 +102,8 @@ void f2fs_trace_ios(struct page *page, struct f2fs_io_info *fio, int flush) last_io.pid == pid && last_io.type == __file_type(inode, pid) && last_io.fio.rw == fio->rw && - last_io.fio.blk_addr + last_io.len == fio->blk_addr) { + last_io.fio.new_blkaddr + last_io.len == + fio->new_blkaddr) { last_io.len++; return; } diff --git a/fs/f2fs/trace.h b/fs/f2fs/trace.h old mode 100644 new mode 100755 index 1041dbeb52ae..67db24ac1e85 --- a/fs/f2fs/trace.h +++ b/fs/f2fs/trace.h @@ -33,12 +33,12 @@ struct last_io_info { }; extern void f2fs_trace_pid(struct page *); -extern void f2fs_trace_ios(struct page *, struct f2fs_io_info *, int); +extern void f2fs_trace_ios(struct f2fs_io_info *, int); extern void f2fs_build_trace_ios(void); extern void f2fs_destroy_trace_ios(void); #else #define f2fs_trace_pid(p) -#define f2fs_trace_ios(p, i, n) +#define f2fs_trace_ios(i, n) #define f2fs_build_trace_ios() #define f2fs_destroy_trace_ios() diff --git a/fs/f2fs/xattr.c b/fs/f2fs/xattr.c old mode 100644 new mode 100755 index 0569d5f8b552..d9dfe530b1a7 --- a/fs/f2fs/xattr.c +++ b/fs/f2fs/xattr.c @@ -82,7 +82,7 @@ static int f2fs_xattr_generic_get(struct dentry *dentry, const char *name, } if (strcmp(name, "") == 0) return -EINVAL; - return f2fs_getxattr(dentry->d_inode, type, name, buffer, size, NULL); + return f2fs_getxattr(d_inode(dentry), type, name, buffer, size, NULL); } static int f2fs_xattr_generic_set(struct dentry *dentry, const char *name, @@ -107,7 +107,7 @@ static int f2fs_xattr_generic_set(struct dentry *dentry, const char *name, if (strcmp(name, "") == 0) return -EINVAL; - return f2fs_setxattr(dentry->d_inode, type, name, + return f2fs_setxattr(d_inode(dentry), type, name, value, size, NULL, flags); } @@ -129,7 +129,7 @@ static size_t f2fs_xattr_advise_list(struct dentry *dentry, char *list, static int f2fs_xattr_advise_get(struct dentry *dentry, const char *name, void *buffer, size_t size, int type) { - struct inode *inode = dentry->d_inode; + struct inode *inode = d_inode(dentry); if (strcmp(name, "") != 0) return -EINVAL; @@ -142,7 +142,7 @@ static int f2fs_xattr_advise_get(struct dentry *dentry, const char *name, static int f2fs_xattr_advise_set(struct dentry *dentry, const char *name, const void *value, size_t size, int flags, int type) { - struct inode *inode = dentry->d_inode; + struct inode *inode = d_inode(dentry); if (strcmp(name, "") != 0) return -EINVAL; @@ -152,7 +152,7 @@ static int f2fs_xattr_advise_set(struct dentry *dentry, const char *name, return -EINVAL; F2FS_I(inode)->i_advise |= *(char *)value; - mark_inode_dirty(inode); + f2fs_mark_inode_dirty_sync(inode); return 0; } @@ -346,7 +346,8 @@ static inline int write_all_xattrs(struct inode *inode, __u32 hsize, if (ipage) { inline_addr = inline_xattr_addr(ipage); - f2fs_wait_on_page_writeback(ipage, NODE); + f2fs_wait_on_page_writeback(ipage, NODE, true); + set_page_dirty(ipage); } else { page = get_node_page(sbi, inode->i_ino); if (IS_ERR(page)) { @@ -354,7 +355,7 @@ static inline int write_all_xattrs(struct inode *inode, __u32 hsize, return PTR_ERR(page); } inline_addr = inline_xattr_addr(page); - f2fs_wait_on_page_writeback(page, NODE); + f2fs_wait_on_page_writeback(page, NODE, true); } memcpy(inline_addr, txattr_addr, inline_size); f2fs_put_page(page, 1); @@ -375,7 +376,7 @@ static inline int write_all_xattrs(struct inode *inode, __u32 hsize, return PTR_ERR(xpage); } f2fs_bug_on(sbi, new_nid); - f2fs_wait_on_page_writeback(xpage, NODE); + f2fs_wait_on_page_writeback(xpage, NODE, true); } else { struct dnode_of_data dn; set_new_dnode(&dn, inode, NULL, NULL, new_nid); @@ -443,7 +444,7 @@ int f2fs_getxattr(struct inode *inode, int index, const char *name, ssize_t f2fs_listxattr(struct dentry *dentry, char *buffer, size_t buffer_size) { - struct inode *inode = dentry->d_inode; + struct inode *inode = d_inode(dentry); struct f2fs_xattr_entry *entry; void *base_addr; int error = 0; @@ -482,13 +483,12 @@ static int __f2fs_setxattr(struct inode *inode, int index, const char *name, const void *value, size_t size, struct page *ipage, int flags) { - struct f2fs_inode_info *fi = F2FS_I(inode); struct f2fs_xattr_entry *here, *last; void *base_addr; int found, newsize; size_t len; __u32 new_hsize; - int error = -ENOMEM; + int error = 0; if (name == NULL) return -EINVAL; @@ -498,12 +498,15 @@ static int __f2fs_setxattr(struct inode *inode, int index, len = strlen(name); - if (len > F2FS_NAME_LEN || size > MAX_VALUE_LEN(inode)) + if (len > F2FS_NAME_LEN) return -ERANGE; + if (size > MAX_VALUE_LEN(inode)) + return -E2BIG; + base_addr = read_all_xattrs(inode, ipage); if (!base_addr) - goto exit; + return -ENOMEM; /* find entry with wanted name. */ here = __find_xattr(base_addr, index, len, name); @@ -536,7 +539,7 @@ static int __f2fs_setxattr(struct inode *inode, int index, free = free + ENTRY_SIZE(here); if (unlikely(free < newsize)) { - error = -ENOSPC; + error = -E2BIG; goto exit; } } @@ -564,7 +567,6 @@ static int __f2fs_setxattr(struct inode *inode, int index, * Before we come here, old entry is removed. * We just write new entry. */ - memset(last, 0, newsize); last->e_name_index = index; last->e_name_len = len; memcpy(last->e_name, name, len); @@ -578,16 +580,15 @@ static int __f2fs_setxattr(struct inode *inode, int index, if (error) goto exit; - if (is_inode_flag_set(fi, FI_ACL_MODE)) { - inode->i_mode = fi->i_acl_mode; + if (is_inode_flag_set(inode, FI_ACL_MODE)) { + inode->i_mode = F2FS_I(inode)->i_acl_mode; inode->i_ctime = CURRENT_TIME; - clear_inode_flag(fi, FI_ACL_MODE); + clear_inode_flag(inode, FI_ACL_MODE); } - - if (ipage) - update_inode(inode, ipage); - else - update_inode_page(inode); + if (index == F2FS_XATTR_INDEX_ENCRYPTION && + !strcmp(name, F2FS_XATTR_NAME_ENCRYPTION_CONTEXT)) + f2fs_set_encrypted_inode(inode); + f2fs_mark_inode_dirty_sync(inode); exit: kzfree(base_addr); return error; @@ -604,7 +605,7 @@ int f2fs_setxattr(struct inode *inode, int index, const char *name, if (ipage) return __f2fs_setxattr(inode, index, name, value, size, ipage, flags); - f2fs_balance_fs(sbi); + f2fs_balance_fs(sbi, true); f2fs_lock_op(sbi); /* protect xattr_ver */ @@ -613,5 +614,6 @@ int f2fs_setxattr(struct inode *inode, int index, const char *name, up_write(&F2FS_I(inode)->i_sem); f2fs_unlock_op(sbi); + f2fs_update_time(sbi, REQ_TIME); return err; } diff --git a/fs/f2fs/xattr.h b/fs/f2fs/xattr.h old mode 100644 new mode 100755 index 95b55a0c55cd..f3f8181b5474 --- a/fs/f2fs/xattr.h +++ b/fs/f2fs/xattr.h @@ -35,6 +35,10 @@ #define F2FS_XATTR_INDEX_LUSTRE 5 #define F2FS_XATTR_INDEX_SECURITY 6 #define F2FS_XATTR_INDEX_ADVISE 7 +/* Should be same as EXT4_XATTR_INDEX_ENCRYPTION */ +#define F2FS_XATTR_INDEX_ENCRYPTION 9 + +#define F2FS_XATTR_NAME_ENCRYPTION_CONTEXT "c" struct f2fs_xattr_header { __le32 h_magic; /* magic number for identification */ @@ -124,7 +128,8 @@ extern ssize_t f2fs_listxattr(struct dentry *, char *, size_t); #define f2fs_xattr_handlers NULL static inline int f2fs_setxattr(struct inode *inode, int index, - const char *name, const void *value, size_t size, int flags) + const char *name, const void *value, size_t size, + struct page *page, int flags) { return -EOPNOTSUPP; } diff --git a/include/linux/dcache.h b/include/linux/dcache.h index ed96aa5647c9..5d165342166b 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -192,6 +192,8 @@ struct dentry_operations { #define DCACHE_MANAGED_DENTRY \ (DCACHE_MOUNTED|DCACHE_NEED_AUTOMOUNT|DCACHE_MANAGE_TRANSIT) +#define DCACHE_ENCRYPTED_WITH_KEY 0x04000000 /* dir is encrypted with a valid key */ + extern seqlock_t rename_lock; static inline int dname_external(struct dentry *dentry) diff --git a/include/linux/f2fs_fs.h b/include/linux/f2fs_fs.h old mode 100644 new mode 100755 index 591f8c3ef410..1a1d318f373f --- a/include/linux/f2fs_fs.h +++ b/include/linux/f2fs_fs.h @@ -11,6 +11,10 @@ #ifndef _LINUX_F2FS_FS_H #define _LINUX_F2FS_FS_H +#ifdef CONFIG_F2FS_FS_ENCRYPTION +#undef CONFIG_F2FS_FS_ENCRYPTION +#endif + #include #include @@ -21,7 +25,7 @@ #define F2FS_BLKSIZE 4096 /* support only 4KB block */ #define F2FS_BLKSIZE_BITS 12 /* bits for F2FS_BLKSIZE */ #define F2FS_MAX_EXTENSION 64 /* # of extension entries */ -#define F2FS_BLK_ALIGN(x) (((x) + F2FS_BLKSIZE - 1) / F2FS_BLKSIZE) +#define F2FS_BLK_ALIGN(x) (((x) + F2FS_BLKSIZE - 1) >> F2FS_BLKSIZE_BITS) #define NULL_ADDR ((block_t)0) /* used as block_t addresses */ #define NEW_ADDR ((block_t)-1) /* used as block_t addresses */ @@ -50,6 +54,9 @@ #define MAX_ACTIVE_NODE_LOGS 8 #define MAX_ACTIVE_DATA_LOGS 8 +#define VERSION_LEN 256 +#define MAX_VOLUME_NAME 512 + /* * For superblock */ @@ -82,10 +89,16 @@ struct f2fs_super_block { __le32 node_ino; /* node inode number */ __le32 meta_ino; /* meta inode number */ __u8 uuid[16]; /* 128-bit uuid for volume */ - __le16 volume_name[512]; /* volume name */ + __le16 volume_name[MAX_VOLUME_NAME]; /* volume name */ __le32 extension_count; /* # of extensions below */ __u8 extension_list[F2FS_MAX_EXTENSION][8]; /* extension array */ __le32 cp_payload; + __u8 version[VERSION_LEN]; /* the kernel version */ + __u8 init_version[VERSION_LEN]; /* the initial kernel version */ + __le32 feature; /* defined features */ + __u8 encryption_level; /* versioning level for encryption */ + __u8 encrypt_pw_salt[16]; /* Salt used for string2key algorithm */ + __u8 reserved[871]; /* valid reserved region */ } __packed; /* @@ -161,12 +174,12 @@ struct f2fs_extent { #define F2FS_INLINE_XATTR_ADDRS 50 /* 200 bytes for inline xattrs */ #define DEF_ADDRS_PER_INODE 923 /* Address Pointers in an Inode */ #define DEF_NIDS_PER_INODE 5 /* Node IDs in an Inode */ -#define ADDRS_PER_INODE(fi) addrs_per_inode(fi) +#define ADDRS_PER_INODE(inode) addrs_per_inode(inode) #define ADDRS_PER_BLOCK 1018 /* Address Pointers in a Direct Block */ #define NIDS_PER_BLOCK 1018 /* Node IDs in an Indirect Block */ -#define ADDRS_PER_PAGE(page, fi) \ - (IS_INODE(page) ? ADDRS_PER_INODE(fi) : ADDRS_PER_BLOCK) +#define ADDRS_PER_PAGE(page, inode) \ + (IS_INODE(page) ? ADDRS_PER_INODE(inode) : ADDRS_PER_BLOCK) #define NODE_DIR1_BLOCK (DEF_ADDRS_PER_INODE + 1) #define NODE_DIR2_BLOCK (DEF_ADDRS_PER_INODE + 2) @@ -336,7 +349,7 @@ struct f2fs_summary { struct summary_footer { unsigned char entry_type; /* SUM_TYPE_XXX */ - __u32 check_sum; /* summary checksum */ + __le32 check_sum; /* summary checksum */ } __packed; #define SUM_JOURNAL_SIZE (F2FS_BLKSIZE - SUM_FOOTER_SIZE -\ @@ -349,6 +362,12 @@ struct summary_footer { sizeof(struct sit_journal_entry)) #define SIT_JOURNAL_RESERVED ((SUM_JOURNAL_SIZE - 2) %\ sizeof(struct sit_journal_entry)) + +/* Reserved area should make size of f2fs_extra_info equals to + * that of nat_journal and sit_journal. + */ +#define EXTRA_INFO_RESERVED (SUM_JOURNAL_SIZE - 2 - 8) + /* * frequently updated NAT/SIT entries can be stored in the spare area in * summary blocks @@ -378,18 +397,28 @@ struct sit_journal { __u8 reserved[SIT_JOURNAL_RESERVED]; } __packed; -/* 4KB-sized summary block structure */ -struct f2fs_summary_block { - struct f2fs_summary entries[ENTRIES_IN_SUM]; +struct f2fs_extra_info { + __le64 kbytes_written; + __u8 reserved[EXTRA_INFO_RESERVED]; +} __packed; + +struct f2fs_journal { union { __le16 n_nats; __le16 n_sits; }; - /* spare area is used by NAT or SIT journals */ + /* spare area is used by NAT or SIT journals or extra info */ union { struct nat_journal nat_j; struct sit_journal sit_j; + struct f2fs_extra_info info; }; +} __packed; + +/* 4KB-sized summary block structure */ +struct f2fs_summary_block { + struct f2fs_summary entries[ENTRIES_IN_SUM]; + struct f2fs_journal journal; struct summary_footer footer; } __packed; @@ -409,15 +438,25 @@ typedef __le32 f2fs_hash_t; #define GET_DENTRY_SLOTS(x) ((x + F2FS_SLOT_LEN - 1) >> F2FS_SLOT_LEN_BITS) -/* the number of dentry in a block */ -#define NR_DENTRY_IN_BLOCK 214 - /* MAX level for dir lookup */ #define MAX_DIR_HASH_DEPTH 63 /* MAX buckets in one level of dir */ #define MAX_DIR_BUCKETS (1 << ((MAX_DIR_HASH_DEPTH / 2) - 1)) +/* + * space utilization of regular dentry and inline dentry + * regular dentry inline dentry + * bitmap 1 * 27 = 27 1 * 23 = 23 + * reserved 1 * 3 = 3 1 * 7 = 7 + * dentry 11 * 214 = 2354 11 * 182 = 2002 + * filename 8 * 214 = 1712 8 * 182 = 1456 + * total 4096 3488 + * + * Note: there are more reserved space in inline dentry than in regular + * dentry, when converting inline dentry we should handle this carefully. + */ +#define NR_DENTRY_IN_BLOCK 214 /* the number of dentry in a block */ #define SIZE_OF_DIR_ENTRY 11 /* by byte */ #define SIZE_OF_DENTRY_BITMAP ((NR_DENTRY_IN_BLOCK + BITS_PER_BYTE - 1) / \ BITS_PER_BYTE) @@ -473,4 +512,6 @@ enum { F2FS_FT_MAX }; +#define S_SHIFT 12 + #endif /* _LINUX_F2FS_FS_H */ diff --git a/include/linux/fs.h b/include/linux/fs.h index 3eba4e4af6d0..46b56fce9c86 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -355,6 +355,24 @@ struct inodes_stat_t { #define FS_IOC32_GETVERSION _IOR('v', 1, int) #define FS_IOC32_SETVERSION _IOW('v', 2, int) +/* + * File system encryption support + */ +/* Policy provided via an ioctl on the topmost directory */ +#define FS_KEY_DESCRIPTOR_SIZE 8 + +struct fscrypt_policy { + __u8 version; + __u8 contents_encryption_mode; + __u8 filenames_encryption_mode; + __u8 flags; + __u8 master_key_descriptor[FS_KEY_DESCRIPTOR_SIZE]; +} __packed; + +#define FS_IOC_SET_ENCRYPTION_POLICY _IOR('f', 19, struct fscrypt_policy) +#define FS_IOC_GET_ENCRYPTION_PWSALT _IOW('f', 20, __u8[16]) +#define FS_IOC_GET_ENCRYPTION_POLICY _IOW('f', 21, struct fscrypt_policy) + /* * Inode flags (FS_IOC_GETFLAGS / FS_IOC_SETFLAGS) */ @@ -437,6 +455,8 @@ struct kstatfs; struct vm_area_struct; struct vfsmount; struct cred; +struct fscrypt_info; +struct fscrypt_operations; extern void __init inode_init(void); extern void __init inode_init_early(void); @@ -867,6 +887,11 @@ struct inode { #ifdef CONFIG_IMA atomic_t i_readcount; /* struct files open RO */ #endif + +#if IS_ENABLED(CONFIG_FS_ENCRYPTION) + struct fscrypt_info *i_crypt_info; +#endif + void *i_private; /* fs or device private pointer */ }; @@ -1453,6 +1478,9 @@ struct super_block { const struct xattr_handler **s_xattr; struct list_head s_inodes; /* all inodes */ + + const struct fscrypt_operations *s_cop; + struct hlist_bl_head s_anon; /* anonymous dentries for (nfs) exporting */ struct list_head s_mounts; /* list of mounts; _not_ for fs use */ /* s_dentry_lru, s_nr_dentry_unused protected by dcache.c lru locks */ diff --git a/include/linux/fscrypto.h b/include/linux/fscrypto.h new file mode 100755 index 000000000000..446ae03608e0 --- /dev/null +++ b/include/linux/fscrypto.h @@ -0,0 +1,435 @@ +/* + * General per-file encryption definition + * + * Copyright (C) 2015, Google, Inc. + * + * Written by Michael Halcrow, 2015. + * Modified by Jaegeuk Kim, 2015. + */ + +#ifndef _LINUX_FSCRYPTO_H +#define _LINUX_FSCRYPTO_H + +#include +#include +#include +#include +#include + +#define FS_KEY_DERIVATION_NONCE_SIZE 16 +#define FS_ENCRYPTION_CONTEXT_FORMAT_V1 1 + +#define FS_POLICY_FLAGS_PAD_4 0x00 +#define FS_POLICY_FLAGS_PAD_8 0x01 +#define FS_POLICY_FLAGS_PAD_16 0x02 +#define FS_POLICY_FLAGS_PAD_32 0x03 +#define FS_POLICY_FLAGS_PAD_MASK 0x03 +#define FS_POLICY_FLAGS_VALID 0x03 + +/* Encryption algorithms */ +#define FS_ENCRYPTION_MODE_INVALID 0 +#define FS_ENCRYPTION_MODE_AES_256_XTS 1 +#define FS_ENCRYPTION_MODE_AES_256_GCM 2 +#define FS_ENCRYPTION_MODE_AES_256_CBC 3 +#define FS_ENCRYPTION_MODE_AES_256_CTS 4 + +/** + * Encryption context for inode + * + * Protector format: + * 1 byte: Protector format (1 = this version) + * 1 byte: File contents encryption mode + * 1 byte: File names encryption mode + * 1 byte: Flags + * 8 bytes: Master Key descriptor + * 16 bytes: Encryption Key derivation nonce + */ +struct fscrypt_context { + u8 format; + u8 contents_encryption_mode; + u8 filenames_encryption_mode; + u8 flags; + u8 master_key_descriptor[FS_KEY_DESCRIPTOR_SIZE]; + u8 nonce[FS_KEY_DERIVATION_NONCE_SIZE]; +} __packed; + +/* Encryption parameters */ +#define FS_XTS_TWEAK_SIZE 16 +#define FS_AES_128_ECB_KEY_SIZE 16 +#define FS_AES_256_GCM_KEY_SIZE 32 +#define FS_AES_256_CBC_KEY_SIZE 32 +#define FS_AES_256_CTS_KEY_SIZE 32 +#define FS_AES_256_XTS_KEY_SIZE 64 +#define FS_MAX_KEY_SIZE 64 + +#define FS_KEY_DESC_PREFIX "fscrypt:" +#define FS_KEY_DESC_PREFIX_SIZE 8 + +/* This is passed in from userspace into the kernel keyring */ +struct fscrypt_key { + u32 mode; + u8 raw[FS_MAX_KEY_SIZE]; + u32 size; +} __packed; + +struct fscrypt_info { + u8 ci_data_mode; + u8 ci_filename_mode; + u8 ci_flags; + struct crypto_ablkcipher *ci_ctfm; + struct key *ci_keyring_key; + u8 ci_master_key[FS_KEY_DESCRIPTOR_SIZE]; +}; + +#define FS_CTX_REQUIRES_FREE_ENCRYPT_FL 0x00000001 +#define FS_WRITE_PATH_FL 0x00000002 + +struct fscrypt_ctx { + union { + struct { + struct page *bounce_page; /* Ciphertext page */ + struct page *control_page; /* Original page */ + } w; + struct { + struct bio *bio; + struct work_struct work; + } r; + struct list_head free_list; /* Free list */ + }; + u8 flags; /* Flags */ + u8 mode; /* Encryption mode for tfm */ +}; + +struct fscrypt_completion_result { + struct completion completion; + int res; +}; + +#define DECLARE_FS_COMPLETION_RESULT(ecr) \ + struct fscrypt_completion_result ecr = { \ + COMPLETION_INITIALIZER((ecr).completion), 0 } + +static inline int fscrypt_key_size(int mode) +{ + switch (mode) { + case FS_ENCRYPTION_MODE_AES_256_XTS: + return FS_AES_256_XTS_KEY_SIZE; + case FS_ENCRYPTION_MODE_AES_256_GCM: + return FS_AES_256_GCM_KEY_SIZE; + case FS_ENCRYPTION_MODE_AES_256_CBC: + return FS_AES_256_CBC_KEY_SIZE; + case FS_ENCRYPTION_MODE_AES_256_CTS: + return FS_AES_256_CTS_KEY_SIZE; + default: + BUG(); + } + return 0; +} + +#define FS_FNAME_NUM_SCATTER_ENTRIES 4 +#define FS_CRYPTO_BLOCK_SIZE 16 +#define FS_FNAME_CRYPTO_DIGEST_SIZE 32 + +/** + * For encrypted symlinks, the ciphertext length is stored at the beginning + * of the string in little-endian format. + */ +struct fscrypt_symlink_data { + __le16 len; + char encrypted_path[1]; +} __packed; + +/** + * This function is used to calculate the disk space required to + * store a filename of length l in encrypted symlink format. + */ +static inline u32 fscrypt_symlink_data_len(u32 l) +{ + if (l < FS_CRYPTO_BLOCK_SIZE) + l = FS_CRYPTO_BLOCK_SIZE; + return (l + sizeof(struct fscrypt_symlink_data) - 1); +} + +struct fscrypt_str { + unsigned char *name; + u32 len; +}; + +struct fscrypt_name { + const struct qstr *usr_fname; + struct fscrypt_str disk_name; + u32 hash; + u32 minor_hash; + struct fscrypt_str crypto_buf; +}; + +#define QSTR_INIT(n, l) { .name = n, .len = l } +#define FSTR_INIT(n, l) { .name = n, .len = l } +#define FSTR_TO_QSTR(f) QSTR_INIT((f)->name, (f)->len) +#define fname_name(p) ((p)->disk_name.name) +#define fname_len(p) ((p)->disk_name.len) + +/* + * crypto opertions for filesystems + */ +struct fscrypt_operations { + int (*get_context)(struct inode *, void *, size_t); + int (*key_prefix)(struct inode *, u8 **); + int (*prepare_context)(struct inode *); + int (*set_context)(struct inode *, const void *, size_t, void *); + int (*dummy_context)(struct inode *); + bool (*is_encrypted)(struct inode *); + bool (*empty_dir)(struct inode *); + unsigned (*max_namelen)(struct inode *); +}; + +static inline bool fscrypt_dummy_context_enabled(struct inode *inode) +{ + if (inode->i_sb->s_cop->dummy_context && + inode->i_sb->s_cop->dummy_context(inode)) + return true; + return false; +} + +static inline bool fscrypt_valid_contents_enc_mode(u32 mode) +{ + return (mode == FS_ENCRYPTION_MODE_AES_256_XTS); +} + +static inline bool fscrypt_valid_filenames_enc_mode(u32 mode) +{ + return (mode == FS_ENCRYPTION_MODE_AES_256_CTS); +} + +static inline u32 fscrypt_validate_encryption_key_size(u32 mode, u32 size) +{ + if (size == fscrypt_key_size(mode)) + return size; + return 0; +} + +static inline bool fscrypt_is_dot_dotdot(const struct qstr *str) +{ + if (str->len == 1 && str->name[0] == '.') + return true; + + if (str->len == 2 && str->name[0] == '.' && str->name[1] == '.') + return true; + + return false; +} + +static inline struct page *fscrypt_control_page(struct page *page) +{ +#if IS_ENABLED(CONFIG_FS_ENCRYPTION) + return ((struct fscrypt_ctx *)page_private(page))->w.control_page; +#else + WARN_ON_ONCE(1); + return ERR_PTR(-EINVAL); +#endif +} + +static inline int fscrypt_has_encryption_key(struct inode *inode) +{ +#if IS_ENABLED(CONFIG_FS_ENCRYPTION) + return (inode->i_crypt_info != NULL); +#else + return 0; +#endif +} + +static inline void fscrypt_set_encrypted_dentry(struct dentry *dentry) +{ +#if IS_ENABLED(CONFIG_FS_ENCRYPTION) + spin_lock(&dentry->d_lock); + dentry->d_flags |= DCACHE_ENCRYPTED_WITH_KEY; + spin_unlock(&dentry->d_lock); +#endif +} + +#if IS_ENABLED(CONFIG_FS_ENCRYPTION) +extern const struct dentry_operations fscrypt_d_ops; +#endif + +static inline void fscrypt_set_d_op(struct dentry *dentry) +{ +#if IS_ENABLED(CONFIG_FS_ENCRYPTION) + d_set_d_op(dentry, &fscrypt_d_ops); +#endif +} + +#if IS_ENABLED(CONFIG_FS_ENCRYPTION) +/* crypto.c */ +extern struct kmem_cache *fscrypt_info_cachep; +int fscrypt_initialize(void); + +extern struct fscrypt_ctx *fscrypt_get_ctx(struct inode *, gfp_t); +extern void fscrypt_release_ctx(struct fscrypt_ctx *); +extern struct page *fscrypt_encrypt_page(struct inode *, struct page *, gfp_t); +extern int fscrypt_decrypt_page(struct page *); +extern void fscrypt_decrypt_bio_pages(struct fscrypt_ctx *, struct bio *); +extern void fscrypt_pullback_bio_page(struct page **, bool); +extern void fscrypt_restore_control_page(struct page *); +extern int fscrypt_zeroout_range(struct inode *, pgoff_t, sector_t, + unsigned int); +/* policy.c */ +extern int fscrypt_process_policy(struct inode *, + const struct fscrypt_policy *); +extern int fscrypt_get_policy(struct inode *, struct fscrypt_policy *); +extern int fscrypt_has_permitted_context(struct inode *, struct inode *); +extern int fscrypt_inherit_context(struct inode *, struct inode *, + void *, bool); +/* keyinfo.c */ +extern int get_crypt_info(struct inode *); +extern int fscrypt_get_encryption_info(struct inode *); +extern void fscrypt_put_encryption_info(struct inode *, struct fscrypt_info *); + +/* fname.c */ +extern int fscrypt_setup_filename(struct inode *, const struct qstr *, + int lookup, struct fscrypt_name *); +extern void fscrypt_free_filename(struct fscrypt_name *); +extern u32 fscrypt_fname_encrypted_size(struct inode *, u32); +extern int fscrypt_fname_alloc_buffer(struct inode *, u32, + struct fscrypt_str *); +extern void fscrypt_fname_free_buffer(struct fscrypt_str *); +extern int fscrypt_fname_disk_to_usr(struct inode *, u32, u32, + const struct fscrypt_str *, struct fscrypt_str *); +extern int fscrypt_fname_usr_to_disk(struct inode *, const struct qstr *, + struct fscrypt_str *); +#endif + +/* crypto.c */ +static inline struct fscrypt_ctx *fscrypt_notsupp_get_ctx(struct inode *i, + gfp_t f) +{ + return ERR_PTR(-EOPNOTSUPP); +} + +static inline void fscrypt_notsupp_release_ctx(struct fscrypt_ctx *c) +{ + return; +} + +static inline struct page *fscrypt_notsupp_encrypt_page(struct inode *i, + struct page *p, gfp_t f) +{ + return ERR_PTR(-EOPNOTSUPP); +} + +static inline int fscrypt_notsupp_decrypt_page(struct page *p) +{ + return -EOPNOTSUPP; +} + +static inline void fscrypt_notsupp_decrypt_bio_pages(struct fscrypt_ctx *c, + struct bio *b) +{ + return; +} + +static inline void fscrypt_notsupp_pullback_bio_page(struct page **p, bool b) +{ + return; +} + +static inline void fscrypt_notsupp_restore_control_page(struct page *p) +{ + return; +} + +static inline int fscrypt_notsupp_zeroout_range(struct inode *i, pgoff_t p, + sector_t s, unsigned int f) +{ + return -EOPNOTSUPP; +} + +/* policy.c */ +static inline int fscrypt_notsupp_process_policy(struct inode *i, + const struct fscrypt_policy *p) +{ + return -EOPNOTSUPP; +} + +static inline int fscrypt_notsupp_get_policy(struct inode *i, + struct fscrypt_policy *p) +{ + return -EOPNOTSUPP; +} + +static inline int fscrypt_notsupp_has_permitted_context(struct inode *p, + struct inode *i) +{ + return 0; +} + +static inline int fscrypt_notsupp_inherit_context(struct inode *p, + struct inode *i, void *v, bool b) +{ + return -EOPNOTSUPP; +} + +/* keyinfo.c */ +static inline int fscrypt_notsupp_get_encryption_info(struct inode *i) +{ + return -EOPNOTSUPP; +} + +static inline void fscrypt_notsupp_put_encryption_info(struct inode *i, + struct fscrypt_info *f) +{ + return; +} + + /* fname.c */ +static inline int fscrypt_notsupp_setup_filename(struct inode *dir, + const struct qstr *iname, + int lookup, struct fscrypt_name *fname) +{ + if (dir->i_sb->s_cop->is_encrypted(dir)) + return -EOPNOTSUPP; + + memset(fname, 0, sizeof(struct fscrypt_name)); + fname->usr_fname = iname; + fname->disk_name.name = (unsigned char *)iname->name; + fname->disk_name.len = iname->len; + return 0; +} + +static inline void fscrypt_notsupp_free_filename(struct fscrypt_name *fname) +{ + return; +} + +static inline u32 fscrypt_notsupp_fname_encrypted_size(struct inode *i, u32 s) +{ + /* never happens */ + WARN_ON(1); + return 0; +} + +static inline int fscrypt_notsupp_fname_alloc_buffer(struct inode *inode, + u32 ilen, struct fscrypt_str *crypto_str) +{ + return -EOPNOTSUPP; +} + +static inline void fscrypt_notsupp_fname_free_buffer(struct fscrypt_str *c) +{ + return; +} + +static inline int fscrypt_notsupp_fname_disk_to_usr(struct inode *inode, + u32 hash, u32 minor_hash, + const struct fscrypt_str *iname, + struct fscrypt_str *oname) +{ + return -EOPNOTSUPP; +} + +static inline int fscrypt_notsupp_fname_usr_to_disk(struct inode *inode, + const struct qstr *iname, + struct fscrypt_str *oname) +{ + return -EOPNOTSUPP; +} +#endif /* _LINUX_FSCRYPTO_H */ diff --git a/include/trace/events/f2fs.h b/include/trace/events/f2fs.h old mode 100644 new mode 100755 index f622e782ffab..4c3d40258e7f --- a/include/trace/events/f2fs.h +++ b/include/trace/events/f2fs.h @@ -17,6 +17,7 @@ { META_FLUSH, "META_FLUSH" }, \ { INMEM, "INMEM" }, \ { INMEM_DROP, "INMEM_DROP" }, \ + { INMEM_REVOKE, "INMEM_REVOKE" }, \ { IPU, "IN-PLACE" }, \ { OPU, "OUT-OF-PLACE" }) @@ -82,6 +83,7 @@ { CP_DISCARD, "Discard" }) struct victim_sel_policy; +struct f2fs_map_blocks; DECLARE_EVENT_CLASS(f2fs__inode, @@ -446,39 +448,66 @@ TRACE_EVENT(f2fs_truncate_partial_nodes, __entry->err) ); -TRACE_EVENT(f2fs_get_data_block, - TP_PROTO(struct inode *inode, sector_t iblock, - struct buffer_head *bh, int ret), +TRACE_EVENT(f2fs_map_blocks, + TP_PROTO(struct inode *inode, struct f2fs_map_blocks *map, int ret), - TP_ARGS(inode, iblock, bh, ret), + TP_ARGS(inode, map, ret), TP_STRUCT__entry( __field(dev_t, dev) __field(ino_t, ino) - __field(sector_t, iblock) - __field(sector_t, bh_start) - __field(size_t, bh_size) + __field(block_t, m_lblk) + __field(block_t, m_pblk) + __field(unsigned int, m_len) __field(int, ret) ), TP_fast_assign( __entry->dev = inode->i_sb->s_dev; __entry->ino = inode->i_ino; - __entry->iblock = iblock; - __entry->bh_start = bh->b_blocknr; - __entry->bh_size = bh->b_size; + __entry->m_lblk = map->m_lblk; + __entry->m_pblk = map->m_pblk; + __entry->m_len = map->m_len; __entry->ret = ret; ), TP_printk("dev = (%d,%d), ino = %lu, file offset = %llu, " - "start blkaddr = 0x%llx, len = 0x%llx bytes, err = %d", + "start blkaddr = 0x%llx, len = 0x%llx, err = %d", show_dev_ino(__entry), - (unsigned long long)__entry->iblock, - (unsigned long long)__entry->bh_start, - (unsigned long long)__entry->bh_size, + (unsigned long long)__entry->m_lblk, + (unsigned long long)__entry->m_pblk, + (unsigned long long)__entry->m_len, __entry->ret) ); +TRACE_EVENT(f2fs_background_gc, + + TP_PROTO(struct super_block *sb, long wait_ms, + unsigned int prefree, unsigned int free), + + TP_ARGS(sb, wait_ms, prefree, free), + + TP_STRUCT__entry( + __field(dev_t, dev) + __field(long, wait_ms) + __field(unsigned int, prefree) + __field(unsigned int, free) + ), + + TP_fast_assign( + __entry->dev = sb->s_dev; + __entry->wait_ms = wait_ms; + __entry->prefree = prefree; + __entry->free = free; + ), + + TP_printk("dev = (%d,%d), wait_ms = %ld, prefree = %u, free = %u", + show_dev(__entry), + __entry->wait_ms, + __entry->prefree, + __entry->free) +); + TRACE_EVENT(f2fs_get_victim, TP_PROTO(struct super_block *sb, int type, int gc_type, @@ -630,28 +659,32 @@ TRACE_EVENT(f2fs_direct_IO_exit, __entry->ret) ); -TRACE_EVENT(f2fs_reserve_new_block, +TRACE_EVENT(f2fs_reserve_new_blocks, - TP_PROTO(struct inode *inode, nid_t nid, unsigned int ofs_in_node), + TP_PROTO(struct inode *inode, nid_t nid, unsigned int ofs_in_node, + blkcnt_t count), - TP_ARGS(inode, nid, ofs_in_node), + TP_ARGS(inode, nid, ofs_in_node, count), TP_STRUCT__entry( __field(dev_t, dev) __field(nid_t, nid) __field(unsigned int, ofs_in_node) + __field(blkcnt_t, count) ), TP_fast_assign( __entry->dev = inode->i_sb->s_dev; __entry->nid = nid; __entry->ofs_in_node = ofs_in_node; + __entry->count = count; ), - TP_printk("dev = (%d,%d), nid = %u, ofs_in_node = %u", + TP_printk("dev = (%d,%d), nid = %u, ofs_in_node = %u, count = %llu", show_dev(__entry), (unsigned int)__entry->nid, - __entry->ofs_in_node) + __entry->ofs_in_node, + (unsigned long long)__entry->count) ); DECLARE_EVENT_CLASS(f2fs__submit_page_bio, @@ -664,7 +697,8 @@ DECLARE_EVENT_CLASS(f2fs__submit_page_bio, __field(dev_t, dev) __field(ino_t, ino) __field(pgoff_t, index) - __field(block_t, blkaddr) + __field(block_t, old_blkaddr) + __field(block_t, new_blkaddr) __field(int, rw) __field(int, type) ), @@ -673,16 +707,18 @@ DECLARE_EVENT_CLASS(f2fs__submit_page_bio, __entry->dev = page->mapping->host->i_sb->s_dev; __entry->ino = page->mapping->host->i_ino; __entry->index = page->index; - __entry->blkaddr = fio->blk_addr; + __entry->old_blkaddr = fio->old_blkaddr; + __entry->new_blkaddr = fio->new_blkaddr; __entry->rw = fio->rw; __entry->type = fio->type; ), TP_printk("dev = (%d,%d), ino = %lu, page_index = 0x%lx, " - "blkaddr = 0x%llx, rw = %s%s, type = %s", + "oldaddr = 0x%llx, newaddr = 0x%llx rw = %s%s, type = %s", show_dev_ino(__entry), (unsigned long)__entry->index, - (unsigned long long)__entry->blkaddr, + (unsigned long long)__entry->old_blkaddr, + (unsigned long long)__entry->new_blkaddr, show_bio_type(__entry->rw), show_block_type(__entry->type)) ); @@ -962,6 +998,32 @@ TRACE_EVENT(f2fs_writepages, __entry->range_cyclic) ); +TRACE_EVENT(f2fs_readpages, + + TP_PROTO(struct inode *inode, struct page *page, unsigned int nrpage), + + TP_ARGS(inode, page, nrpage), + + TP_STRUCT__entry( + __field(dev_t, dev) + __field(ino_t, ino) + __field(pgoff_t, start) + __field(unsigned int, nrpage) + ), + + TP_fast_assign( + __entry->dev = inode->i_sb->s_dev; + __entry->ino = inode->i_ino; + __entry->start = page->index; + __entry->nrpage = nrpage; + ), + + TP_printk("dev = (%d,%d), ino = %lu, start = %lu nrpage = %u", + show_dev_ino(__entry), + (unsigned long)__entry->start, + __entry->nrpage) +); + TRACE_EVENT(f2fs_write_checkpoint, TP_PROTO(struct super_block *sb, int reason, char *msg), @@ -1061,11 +1123,11 @@ TRACE_EVENT(f2fs_lookup_extent_tree_start, TRACE_EVENT_CONDITION(f2fs_lookup_extent_tree_end, TP_PROTO(struct inode *inode, unsigned int pgofs, - struct extent_node *en), + struct extent_info *ei), - TP_ARGS(inode, pgofs, en), + TP_ARGS(inode, pgofs, ei), - TP_CONDITION(en), + TP_CONDITION(ei), TP_STRUCT__entry( __field(dev_t, dev) @@ -1080,9 +1142,9 @@ TRACE_EVENT_CONDITION(f2fs_lookup_extent_tree_end, __entry->dev = inode->i_sb->s_dev; __entry->ino = inode->i_ino; __entry->pgofs = pgofs; - __entry->fofs = en->ei.fofs; - __entry->blk = en->ei.blk; - __entry->len = en->ei.len; + __entry->fofs = ei->fofs; + __entry->blk = ei->blk; + __entry->len = ei->len; ), TP_printk("dev = (%d,%d), ino = %lu, pgofs = %u, " @@ -1094,17 +1156,19 @@ TRACE_EVENT_CONDITION(f2fs_lookup_extent_tree_end, __entry->len) ); -TRACE_EVENT(f2fs_update_extent_tree, +TRACE_EVENT(f2fs_update_extent_tree_range, - TP_PROTO(struct inode *inode, unsigned int pgofs, block_t blkaddr), + TP_PROTO(struct inode *inode, unsigned int pgofs, block_t blkaddr, + unsigned int len), - TP_ARGS(inode, pgofs, blkaddr), + TP_ARGS(inode, pgofs, blkaddr, len), TP_STRUCT__entry( __field(dev_t, dev) __field(ino_t, ino) __field(unsigned int, pgofs) __field(u32, blk) + __field(unsigned int, len) ), TP_fast_assign( @@ -1112,12 +1176,15 @@ TRACE_EVENT(f2fs_update_extent_tree, __entry->ino = inode->i_ino; __entry->pgofs = pgofs; __entry->blk = blkaddr; + __entry->len = len; ), - TP_printk("dev = (%d,%d), ino = %lu, pgofs = %u, blkaddr = %u", + TP_printk("dev = (%d,%d), ino = %lu, pgofs = %u, " + "blkaddr = %u, len = %u", show_dev_ino(__entry), __entry->pgofs, - __entry->blk) + __entry->blk, + __entry->len) ); TRACE_EVENT(f2fs_shrink_extent_tree, @@ -1168,6 +1235,44 @@ TRACE_EVENT(f2fs_destroy_extent_tree, __entry->node_cnt) ); +DECLARE_EVENT_CLASS(f2fs_sync_dirty_inodes, + + TP_PROTO(struct super_block *sb, int type, s64 count), + + TP_ARGS(sb, type, count), + + TP_STRUCT__entry( + __field(dev_t, dev) + __field(int, type) + __field(s64, count) + ), + + TP_fast_assign( + __entry->dev = sb->s_dev; + __entry->type = type; + __entry->count = count; + ), + + TP_printk("dev = (%d,%d), %s, dirty count = %lld", + show_dev(__entry), + show_file_type(__entry->type), + __entry->count) +); + +DEFINE_EVENT(f2fs_sync_dirty_inodes, f2fs_sync_dirty_inodes_enter, + + TP_PROTO(struct super_block *sb, int type, s64 count), + + TP_ARGS(sb, type, count) +); + +DEFINE_EVENT(f2fs_sync_dirty_inodes, f2fs_sync_dirty_inodes_exit, + + TP_PROTO(struct super_block *sb, int type, s64 count), + + TP_ARGS(sb, type, count) +); + #endif /* _TRACE_F2FS_H */ /* This part must be outside protection */