When Ext3 needs to write new data to the disk, there's a block allocator that decides which free blocks will be used to write the data. But the Ext3 block allocator only allocates one block (4KB) at a time. That means that if the system needs to write the 100 MB data mentioned in the previous point, it will need to call the block allocator 25600 times (and it was just 100 MB!). Not only this is inefficient, it doesn't allow the block allocator to optimize the allocation policy because it doesn't knows how many total data is being allocated, it only knows about a single block. Ext4 uses a "multiblock allocator" (mballoc) which allocates many blocks in a single call, instead of a single block per call, avoiding a lot of overhead. This improves the performance, and it's particularly useful with delayed allocation and extents. This feature doesn't affect the disk format. Also, note that the Ext4 block/inode allocator has other improvements, described in detail in this paper.
Delayed allocation is a performance feature (it doesn't change the disk format) found in a few modern filesystems such as XFS, ZFS, btrfs or Reiser 4, and it consists in delaying the allocation of blocks as much as possible, contrary to what traditionally filesystems (such as Ext3, reiser3, etc) do: allocate the blocks as soon as possible. For example, if a process write()s, the filesystem code will allocate immediately the blocks where the data will be placed - even if the data is not being written right now to the disk and it's going to be kept in the cache for some time. This approach has disadvantages. For example when a process is writing continually to a file that grows, successive write()s allocate blocks for the data, but they don't know if the file will keep growing. Delayed allocation, on the other hand, does not allocate the blocks immediately when the process write()s, rather, it delays the allocation of the blocks while the file is kept in cache, until it is really going to be written to the disk. This gives the block allocator the opportunity to optimize the allocation in situations where the old system couldn't. Delayed allocation plays very nicely with the two previous features mentioned, extents and multiblock allocation, because in many workloads when the file is written finally to the disk it will be allocated in extents whose block allocation is done with the mballoc allocator. The performance is much better, and the fragmentation is much improved in some workloads.
The journal is the most used part of the disk, making the blocks that form part of it more prone to hardware failure. And recovering from a corrupted journal can lead to massive corruption. Ext4 checksums the journal data to know if the journal blocks are failing or corrupted. But journal checksumming has a bonus: it allows one to convert the two-phase commit system of Ext3's journaling to a single phase, speeding the filesystem operation up to 20% in some cases - so reliability and performance are improved at the same time.
While delayed allocation, extents and multiblock allocation help to reduce the fragmentation, with usage filesystems can still fragment. For example: You write three files in a directory and continually on the disk. Some day you need to update the file of the middle, but the updated file has grown a bit, so there's not enough room for it. You have no option but fragment the excess of data to another place of the disk, which will cause a seek, or allocate the updated file continually in another place, far from the other two files, resulting in seeks if an application needs to read all the files on a directory (say, a file manager doing thumbnails on a directory full of images). Besides, the filesystem can only care about certain types of fragmentation, it can't know, for example, that it must keep all the boot-related files contiguous, because it doesn't know which files are boot-related. To solve this issue, Ext4 will support online fragmentation, and there's a e4defrag tool which can defragment individual files or the whole filesystem.
This feature, available in Ext3 in the latest kernel versions, and emulated by glibc in the filesystems that don't support it, allows applications to preallocate disk space: Applications tell the filesystem to preallocate the space, and the filesystem preallocates the necessary blocks and data structures, but there's no data on it until the application really needs to write the data in the future. This is what P2P applications do in their own when they "preallocate" the necessary space for a download that will last hours or days, but implemented much more efficiently by the filesystem and with a generic API. This has several uses: first, to avoid applications (like P2P apps) doing it themselves inefficiently by filling a file with zeros. Second, to improve fragmentation, since the blocks will be allocated at one time, as contiguously as possible. Third, to ensure that applications always have the space they know they will need, which is important for RT-ish applications, since without preallocation the filesystem could get full in the middle of an important operation. The feature is available via the libc posix_fallocate() interface.
Post a Comment