DRAFT DRAFT DRAFT DRAFT DRAFT
Lookup operations are so numerous on active systems that a mechanism is required to help speed lookup operations so that they don't become a bottleneck for system performance. Modern flavors of UNIX(tm) maintain a cache of results from recent file system directory lookup operations [9]. In this report we refer to this cache as the operating system's directory name lookup cache, or DNLC for short. In the Linux kernel, this cache is called the directory entry cache, or dcache [1]. In most UNIX(tm) systems, the DNLC is only part of the pathname resolution logic, but the dcache is integrated into the Linux kernel's virtual file system (VFS) layer.
For file systems where data is accessed on the same system where it is stored permanently, entries in a system's DNLC can last as long as there is room to keep them in the cache. In this instance, applications run on the same operating system that controls the disk and file system metadata. The operating system is fully aware of any change to local filenames, so the DNLC is always kept up-to-date.
However when files are stored on remote systems, some kind of cache coherency must be maintained for any file metadata stored on systems where remote file data is accessed and modified. Clients of NFSv2 and v3 file servers, for example, usually expire file system metadata periodically so that it will be revalidated the next time it is accessed. This applies to any entries in a client's DNLC, and to file attributes cached by the client. Network file systems such as AFS go to great lengths to help a client maintain a coherent view of file systems it shares with other clients [2, 7].
On Linux, every lookup operation that results in a DNLC hit invokes a file system dependent operation to revalidate the cached entry before the entry is made available for use by other parts of the operating system. Most file systems that maintain file data locally do not need any cache entry revalidation. The Linux NFS client, however, takes this opportunity to revalidate the cached entry. If the entry is considered invalid, the NFS client requests a fresh on-the-wire lookup to validate the file's name and parent directory, it's file handle, and any file attributes corresponding to the cached entry.
To support certain aspects of the NFS standard, the Linux client aggressively times out its DNLC entries under certain circumstances. This is not enough to guarantee cache consistency, however. In this report, we describe the current Linux dcache entry revalidation mechanism, compare the network behavior of the Linux NFS client with other client implementations, and discuss possible improvements.
Dcache entries are also inserted into a hash table, which constitutes the system's directory entry cache. They are hashed via their name and the address of their parent dcache entry. A lookup operation in this cache starts with the parent's dentry and the name to be looked up, and returns a dentry that matches on the name and the parent.
Inodes can outlive dcache entries. If a file is renamed, its inode stays, but its dcache entry is deleted and a new one is created for the new name. Dcache entries can also represent negative lookup results, if the entry's inode pointer is NULL valued.
In the Linux kernel current as of this writing (2.4.2), this vector contains six virtual functions:
This function returns an integer value: one if the dcache entry is valid or has been revalidated, and zero if the dcache entry should be invalidated and refreshed with a file system dependent lookup operation. When the VFS layer encounters a zero return from d_revalidate, it unhashes the dentry from its parent and does a fresh real lookup to attempt to replace it.
Most filesystems leave this NULL, because all their dentries in the dcache are always valid. The NFS client defines this operation, using it to expire and revalidate dcache entries.
The NFS client leaves this operation as NULL, since it can use POSIX naming conventions supported by the VFS layer by default.
The NFS client leaves this operation as NULL, since it can use POSIX naming conventions supported by the VFS layer by default.
The NFS client defines this operation to clean up after silly renames.
The NFS client defines this operation to clean up after silly renames.
The NFS client, like most file systems, leaves this operation as NULL.
The file attribute cache and the dcache use separate time out values. Attribute cache time out logic uses time out values stored in the inode field nfs_i.attrtimeo. Dcache time out logic uses time out values stored in the dentry field d_time. This field is reserved specifically for use by file systems; the VFS layer does not touch this field. In certain special cases, of course, the time out values can be ignored. These time out values themselves vary as the client discovers how often an object changes.
The ext2 file system is the standard local file system on Linux, and is the most frequent local file system exported by the Linux NFS server. It uses 32-bit wide timestamps that count the number of seconds after January 1, 1970. Thus ext2 on-disk inodes don't resolve changes that happen during the same second. This is acceptable for local file access. However, the Linux NFS server exports timestamps with resolution down to only a second; changes to a file or directory that happen during the same second are not reflected in the timestamps. In order to detect sub-second changes to directories, the Linux NFS client currently uses dcache entry revalidation to achieve close-to-open cache coherency.
Each operation that involves altering a directory, such as rmdir, create, and so on, time stamps the parent's directory entry. These operations store updated directory attributes returned by server requests into the attribute cache. Whenever a directory's inode attributes are updated as a result of one of these operations, its dcache entry time stamp is updated to the current time on the client.
When a dcache entry is revalidated, the dcache entry's time stamp is compared with the current time on the client. In most cases, if the difference is larger than the directory's attribute timeout value, the dcache is revalidated by executing an on-the-wire lookup request, and comparing the result to information cached on the client. Normally this information doesn't change, so the dentry may be used as-is. If the information has changed (for example, if the file has been renamed) the dentry is invalidated, and another on-the-wire lookup is requested by the VFS layer to acquire the new information.
The last component of pathname lookups are a special case, however. If the last component's parent directory has changed recently, the time out value is set to zero, causing the dcache entries of files in active directories to be revalidated immediately.
Our workload is generated by building the Linux kernel in an NFS-mounted file system. The build configuration is held fixed for each build. The kernel is built immediately after the system is booted, providing a cold cache. NFS statistics are gathered via nfsstat.
The modification consists of removing logic in nfs_dentry_force_reval that shortens the attribute time-out value of the last component of pathnames.
Operation type | Linux kernel, unmodified | Linux kernel, modified |
Total packets | 108,810 | 62,718 |
Fragments | 16,388 | 16,407 |
Lookup requests | 54,405 | 11,176 |
Getattr requests | 284 | 384 |
Write requests | 7,913 | 7,914 |
Read requests | 3,361 | 3,364 |
Create requests | 770 | 770 |
Readdir requests | 409 | 410 |
While the number of read, write, create, and readdir requests remain about equal for both runs, the modified kernel generated considerably fewer lookup requests, resulting in a packet count that is nearly half that of the unmodified client.
This test illustrates an artificial lower bound for client packet count. In the case of a single client and a single server, the client can trust that it is the only accessor of these files, and thus can safely dispense with extra lookup operations. Our results show how good network traffic could be without these operations. Our goal is to create a client that approaches this lower bound, but effectively implements close-to-open cache coherency.
We feel that, in fact, most common workloads don't share directories between NFS clients, and that when they do, the applications themselves can easily be responsible for notifying remote instances of changes. Thus this type of excessive timeout is likely unnecessary for all but a few unique types of workload.
The theory of least surprise, however, requires that close-to-open cache consistency be maintained by default. System administrators might find a mount option useful to identify file systems that don't require strict close-to-open cache consistency.
Summary: | Remove faulty timeout trashing logic from nfs_dentry_force_reval. Add logic to nfs_open to revalidate the dcache entry before returning to the VFS layer. |
Pros: | This is the most straight-forward design. It makes it clear that a file's attributes are refreshed immediately and unconditionally whenever a file is opened on a client. |
Cons: |
The f_op->open method in the case of the NFS client
is nfs_open.
Invoking this function occurs late during open processing;
the dentry has already been looked up.
If nfs_open should find that the file handle is stale,
it must dissociate the dcache entry's current inode and get a
new one; the only safe way for this to happen is for nfs_open
to return -ESTALE and have the VFS layer handle the problem.
Note that the Solaris VFS layer recovers from this by invalidating DNLC and dropping file attributes, then reacquires them. If we don't expect recovering from stale file handles in open processing to be a performance path, this might be the cleanest solution. |
Summary: | Remove faulty timeout trashing logic from nfs_dentry_force_reval. Add nfs_revalidate to open_namei and open_exec. |
Pros: | This takes a familiar approach. |
Cons: |
The VFS layer invokes
nfs_revalidate before calls such as stat.
(Why doesn't it use this before open?)
If nfs_revalidate discovers a stale file descriptor, it must dissociate the dcache entry's current inode and get a new one. Extra logic must be added to recover a new file handle; see above. Finally, nfs_revalidate uses the normal timeout mechanism, so some indication that the timeout should be ignored must be passed to it. |
Summary: | Remove faulty timeout trashing logic from nfs_dentry_force_reval. Zero the d_time field when closing a file to force d_revalidate to revalidate a dcache entry immediately if it is looked up again. |
Pros: | No changes are necessary to the VFS layer. |
Cons: | The first open of a file will find no dcache entry; the dcache entry will be looked up properly. A close on one client will cause that client to retrieve the file's attributes again properly. A second lookup after only an open will not cause the file's attributes to be retrieved from the server. |
Summary: | Define a flag to d_revalidate that open_namei and open_exec can use to indicate to file system specific routines that when looking up a dentry, it will need immediate revalidation. Replace faulty timeout trashing logic from nfs_dentry_force_reval with a check for the new flag. If the new flag is present, trigger an immediate revalidation. |
Pros: | This is an easy-to-implement solution, requiring few changes to the VFS layer and NFS client. |
Cons: | This solution relies on a side effect of on-the-wire lookup requests. The lookup request revalidates cached filename information, but also returns a fresh set of file attributes. |
Note that only open and fopen need to guarantee that they get a consistent handle to a particular file for reading and writing. stat and friends are not required to retrieve fresh attributes, in fact. Thus, for the sake of close-to-open cache coherence, only open and fopen are considered an "open event" where fresh attributes need to be fetched immediately from the server.
Solaris handles the case where client A has an open file, and tries to open the same file again, but discovers that it has been replaced by client B, thereby making the file handle cached on client A "stale." In this case, Solaris's VFS layer invalidates any DNLC entries and attributes for the file, then rebuilds its state.
This is a problem both for local file system implementations such as ext2 and for NFS. If a directory that is a current working directory of some process is deleted, that process is still allowed to open("."). If the directory is deleted on a remote client, there is no way to tell it is gone until something tries to use the directory.
No lookup for "." also means that the NFS client implementation is not invoked to retrieve or refresh the directory's attributes. With the current implementation of pathname resolution on Linux, it is impossible to guarantee close-to-open cache consistency for current working directories.
We also note that the extra do_revalidate code in the VFS layer support for stat and friends is, at this time, redundant. Each of these system calls uses path_walk to find the dentry for the target object, and path_walk eventually invokes cached_lookup which will revalidate both the DNLC and inode cache. Following the path_walk call in each of these system routines, there appears a do_revalidate which invokes the inode's i_op->revalidate method.
We mitigated the performance problem by implementing support for a pre-existing client mount option called "nocto" (which stands for "no close-to-open"). For certain workloads where we know there will be little or no data sharing, we can dispense with extra lookup operations to verify file attributes during open() processing, and rely simply on attribute and dcache timeouts. Using this mount option, we obtain very close to optimal on-the-wire lookup counts.
Next we tried implementing the first solution from above; namely, adding logic to nfs_open to retrieve attributes from the server during open processing. This solution was easier to implement than we had estimated, and provided three benefits over our first attempt. First, open(".") is correctly supported. Second, we are closer to removing nfs_lookup_revalidate entirely. Finally, instead of on-the-wire lookups, this client implementation uses on-the-wire GETATTR requests, which results in a measureable performance improvement for lookup-intensive workloads.
Allowing servers and clients to form a special agreement about directories such that clients can have exclusive access to them might help tremendously. Clients would no longer be burdened by checking back with the server to see if directories have changed, reducing the number of on-the-wire lookup requests significantly.
1. | Gooch, Richard. "Overview of the Virtual File System." www.atnf.csiro.au/~rgooch/linux/vfs.txt. |
2. | Kazar, Michael Leon. "Synchronization and Caching Issues in the Andrew File System." USENIX Conference Proceedings, pp. 27-36. Winter 1988. |
3. | Ritchie, Dennis M., and Thompson, Ken. "The Unix time-sharing system." Communications of the ACM, 17(7):365-375. October 1974. |
4. | Kleiman, S. R. "Vnodes: An Architecture for Multiple File System Types in Sun Unix." USENIX Conference Proceedings. Atlanta 1986. |
5. | Sun Microsystems, Inc. "RFC 1094 - NFS: Network File System Protocol specification." IETF Network Working Group. March 1989. |
6. | Sun Microsystems, Inc. "RFC 1813 - NFS: Network File System Version 3 Protocol Specification." IETF Network Working Group. June 1995. |
7. | Howard, John H. Kazar, Michael L. Menees, Sherri G. Nichols, David A. Satyanarayanan, M. Sidebotham, Robert N. West, Michael J. "Scale and performance in a distributed file system." ACM Transactions on Computer System, volume 6(1). February 1988. |
8. | McKusick, Marshall Kirk. Joy, William N. Leffler, Samuel J. Fabry, Robert S. A Fast File System for UNIX. ACM Transactions on Computer Systems 2, Volume 3, pp. 181-197. August 1984. |
9. | Leffler, Samuel J. McKusick, Marshall Kirk. Karels, Michael J. Quarterman, John S. The Design and Implementation of the 4.3BSD UNIX Operating System. Addison-Wesley Publishing Company, 1990. |
10. | Callaghan, Brent. NFS Illustrated, Addison-Wesley Longman, Inc., 2000. |
Last Modified: Tue Jul 10 11:14:52 EDT 2001