you are in an fs with millions of small files
Bakul Shah
bakul at BitBlocks.com
Thu Jun 9 19:11:19 GMT 2005
> Giorgos Keramidas <keramida at freebsd.org> writes:
> > Is there a better way to sort a linked list (not necessarily a
> > singly-linked list, like the one fts_link is used for).
>
> Don't build a linked list to begin with. The comparison function is
> known at the time the directory entries are read, so it should be a
> simple matter to read them into a red-black tree instead of a singly-
> linked list. I'm working on a patch.
Unless the tree needs to grow dynamically, a red-black tree
won't buy you anything. It is faster to use mergesort on an
array than a red black tree implementation. They have the
same O n log n complexity but per item overhead is higher in
r-b perhaps due to more mallocs. Even a skewheap will be
faster than r-b since you don't need to search, but just
return items in sort order.
Start with a small array. When it gets full double it (or
grow by half if you want to waste less memory).
In my tests, with input file constructed from multiple copies
of /usr/share/dict/words rather than a directory, the
mergesort+array was 3 times faster than red-black for upto at
least 250K entries. At about 10M entries and 100MB of data,
mergesort+array was still about twice as fast and used less
memory. Using mmap the performance doubled and memory use
went down!
Benchmark program (but not with the mmap code) attached:-)
% wc z
10000000 10000000 105733020 z
% time a.out redblack < z
15.445u 1.314s 0:16.76 99.9% 10+320297k 0+0io 0pf+0w
% time a.out array < z
9.524u 0.691s 0:10.21 100.0% 10+240289k 0+0io 0pf+0w
% time a.out -m array < z
4.632u 0.653s 0:05.28 100.0% 10+151192k 0+0io 0pf+0w
-- bakul
[source compacted to fit on one page]
#include <stdio.h>
#include <string.h>
#include <sys/tree.h>
int debug;
#define strndup(s,n) strncpy(malloc(n+1), s, n)
int compare(const void*a, const void* b) {
return strcmp(*(char**)a, *(char**)b);
}
do_ar() {
char* line;
size_t len;
int c = 0, i;
int s = 16;
char** a = (char**)malloc(s*sizeof(char*));
if (debug) fprintf(stderr, "Using array+mergesort\n");
while (line = fgetln(stdin, &len)) {
if (c >= s) a = (char**)realloc(a, (s *= 2)*sizeof(char*));
a[c++] = strndup(line, len-1);
}
mergesort(a, c, sizeof(char*), compare);
if (debug) for (i = 0; i < c; i++) printf("%s\n", a[i]);
}
typedef struct node { char *name; RB_ENTRY(node) links; } Node;
int cmp(Node* a, Node* b) { return strcmp(a->name, b->name); }
RB_HEAD(rb_tree, node);
RB_GENERATE(rb_tree, node, links, cmp);
do_rb() {
char* line;
size_t len;
struct rb_tree rb;
Node* n;
if (debug) fprintf(stderr, "Using red-back tree\n");
RB_INIT(&rb);
while (line = fgetln(stdin, &len)) {
struct node* node = malloc(sizeof *node);
node->name = strndup(line, len-1);
RB_INSERT(rb_tree, &rb, node);
}
if (debug) RB_FOREACH(n, rb_tree, &rb) printf("%s\n", n->name);
}
int
main(int c, char** v) {
if (c > 1 && strcmp(v[1], "-d") == 0) { debug = 1; c--; v++; }
if (c > 1 && v[1][0] == 'r') do_rb(); else do_ar();
}
More information about the freebsd-fs
mailing list