On-disk indexing for "Project Ideas" page

Sat Sep 8 03:07:54 PDT 2007

Recently while reading "Design and Implementation of FreeBSD operation 
system" by Marshall Kirk McKusick and gnn i have found a very intresting 
paragraph regarding UFS2 implementation, indexing and B-trees. According 
to it on-disk indexes was not implemented, but some structures was 
reserved for future development. May be some SOC students could implement 
this in future. How about to adding this into Project Ideas page? 
Let me quote from the paragraph "8.3 Naming":

Finding of Names in Directories

A common request to the filesystem is to lookup a specific name in a 
directory. The kernel usually does the lookup by starting at the beginning 
of the directory and going through, comparing each entry in turn. First, 
the length of the sought-after name is compared with the length of the 
name being checked. If the lengths are identical, a string comparison of 
the name being sought and the directory entry is made. If they match, the 
search is complete; if they fail, either in the length or in the string 
comparison, the search continues with the next entry. Whenever a name is 
found, its name and containing directory are entered into the systemwide 
name cache described in Section 6.6. Whenever a search is unsuccessful, an 
entry is made in the cache showing that the name does not exist in the 
particular directory. Before starting a directory scan, the kernel looks 
for the name in the cache. If either a positive or negative entry is 
found, the directory scan can be avoided.

Another common operation is to lookup all the entries in a directory. For 
example, many programs do a stat system call on each name in a directory 
in the order that the names appear in the directory. To improve 
performance for these programs, the kernel maintains the directory offset 
of the last successful lookup for each directory. Each time that a lookup 
is done in that directory, the search is started from the offset at which 
the previous name was found (instead of from the beginning of the 
directory). For programs that step sequentially through a directory with n 
files, search time decreases from Order(n2) to Order(n).

One quick benchmark that demonstrates the maximum effectiveness of the 
cache is running the ls -l command on a directory containing 600 files. On 
a system that retains the most recent directory offset, the amount of 
system time for this test is reduced by 85 percent. Unfortunately, the 
maximum effectiveness is much greater than the average effectiveness. 
Although the cache is 90 percent effective when hit, it is applicable to 
only about 25 percent of the names being looked up. Despite the amount of 
time spent in the lookup routine itself decreasing substantially, the 
improvement is diminished because more time is spent in the routines that 
that routine calls. Each cache miss causes a directory to be accessed 
twice—once to search from the middle to the end and once to search from 
the beginning to the middle.

These caches provide good directory lookup performance but are ineffective 
for large directories that have a high rate of entry creation and 
deletion. Each time a new directory entry is created, the kernel must scan 
the entire directory to ensure that the entry does not already exist. When 
an existing entry is deleted, the kernel must scan the directory to find 
the entry to be removed. For directories with many entries these linear 
scans are time-consuming.

The approach to solving this problem in FreeBSD 5.2 is to introduce dynamic 
directory hashing that retrofits a directory indexing system to UFS [Dowse 
& Malone, 2002]. To avoid repeated linear searches of large directories, 
the dynamic directory hashing builds a hash table of directory entries on 
the fly when the directory is first accessed. This table avoids directory 
scans on later lookups, creates, and deletes. Unlike filesystems 
originally designed with large directories in mind, these indices are not 
saved on disk and so the system is backward compatible. The drawback is 
that the indices need to be built the first time that a large directory is 
encountered after each system reboot. The effect of the dynamic directory 
hashing is that large directories in UFS cause minimal performance 
problems.

When we built UFS2, we contemplated solving the large directory update 
problem by changing to a more complex directory structure such as one that 
uses B-trees. This technique is used in many modern filesystems such as 
XFS [Sweeney et al., 1996], JFS [Best & Kleikamp, 2003], ReiserFS [Reiser, 
2001], and in later versions of Ext2 [Phillips, 2001]. We decided not to 
make the change at the time that UFS2 was first implemented for several 
reasons. First, we had limited time and resources, and we wanted to get 
something working and stable that could be used in the time frame of 
FreeBSD 5.2. By keeping the same directory format, we were able to reuse 
all the directory code from UFS1, did not have to change numerous 
filesystem utilities to understand and maintain a new directory format, 
and were able to produce a stable and reliable filesystem in the time 
frame available to us. The other reason that we felt that we could retain 
the existing directory structure is because of the dynamic directory 
hashing that was added to FreeBSD.

Borrowing the technique used by the Ext2 filesystem a flag was also added 
to show that an on-disk indexing structure is supported for directories 
[Phillips, 2001]. This flag is unconditionally turned off by the existing 
implementation of UFS. In the future, if an implementation of an on-disk 
directory-indexing structure is added, the implementations that support it 
will not turn the flag off. Index-supporting kernels will maintain the 
indices and leave the flag on. If an old non-index-supporting kernel is 
run, it will turn off the flag so that when the filesystem is once again 
run under a new kernel, the new kernel will discover that the indexing 
flag has been turned off and will know that the indices may be out date 
and have to be rebuilt before being used. The only constraint on an 
implementation of the indices is that they have to be an auxiliary data 
structure that references the old linear directory format.

-- 
======================================================================  
- Best regards, Nikolay Pavlov. <<<-----------------------------------    
======================================================================  

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20070908/00697ff9/attachment.pgp