From freebsdlists at bsdunix.ch Wed Oct 1 00:06:45 2008 From: freebsdlists at bsdunix.ch (Thomas Vogt) Date: Wed Oct 1 00:06:52 2008 Subject: filebench on freebsd? Message-ID: <48E2BAC2.909@bsdunix.ch> Hello Has someone ever tried to compile filebench on current (64bit)? http://sourceforge.net/projects/filebench filebench is a multithreaded file system benchmark similar to postmark (single threaded). Maybe someone can help to make it work if it's not that difficult. I can't compile it: In file included from ipc.h:33, from filebench.h:55, from misc.c:35: threadflow.h:66: error: field 'al_aiocb' has incomplete type *** Error code 1 Stop in /root/filebench-1.3.4/filebench. *** Error code 1 This is: #ifdef HAVE_AIO typedef struct aiolist { int al_type; struct aiolist *al_next; struct aiolist *al_worknext; struct aiocb64 al_aiocb; } aiolist_t; #endif Regards, Thomas From rmacklem at uoguelph.ca Wed Oct 1 14:22:23 2008 From: rmacklem at uoguelph.ca (Rick Macklem) Date: Wed Oct 1 14:22:29 2008 Subject: filebench on freebsd? In-Reply-To: <48E2BAC2.909@bsdunix.ch> References: <48E2BAC2.909@bsdunix.ch> Message-ID: On Wed, 1 Oct 2008, Thomas Vogt wrote: > Hello > > Has someone ever tried to compile filebench on current (64bit)? > http://sourceforge.net/projects/filebench > > filebench is a multithreaded file system benchmark similar to postmark > (single threaded). Maybe someone can help to make it work if it's not that > difficult. > > I can't compile it: > > In file included from ipc.h:33, > from filebench.h:55, > from misc.c:35: > threadflow.h:66: error: field 'al_aiocb' has incomplete type > *** Error code 1 Which probably means that "struct aiocb64" isn't properly defined at this point. > > Stop in /root/filebench-1.3.4/filebench. > *** Error code 1 > > This is: > #ifdef HAVE_AIO > typedef struct aiolist { > int al_type; > struct aiolist *al_next; > struct aiolist *al_worknext; > struct aiocb64 al_aiocb; > } aiolist_t; > #endif > > > Regards, > Thomas > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" > From ed at 80386.nl Wed Oct 1 19:07:32 2008 From: ed at 80386.nl (Ed Schouten) Date: Wed Oct 1 19:07:46 2008 Subject: Expanding vops in vop_vectors during startup In-Reply-To: <20080912182722.GK1191@hoeg.nl> References: <20080912182722.GK1191@hoeg.nl> Message-ID: <20081001190728.GL16837@hoeg.nl> Skipped content of type multipart/mixed-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20081001/68bc982c/attachment.pgp From phk at phk.freebsd.dk Wed Oct 1 19:47:00 2008 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Wed Oct 1 19:47:07 2008 Subject: Expanding vops in vop_vectors during startup In-Reply-To: Your message of "Wed, 01 Oct 2008 21:07:28 +0200." <20081001190728.GL16837@hoeg.nl> Message-ID: <69186.1222890415@critter.freebsd.dk> In message <20081001190728.GL16837@hoeg.nl>, Ed Schouten writes: >The reason I'm sending this message, is because based on discussions I >had with several people on IRC we've basically got two different >opinions on this patch: > >- One group of people liked the idea of the patch. Some people even said > the patch looks good enough to be committed. > >- Another group of people also liked the idea, but thought it would make > no sense to commit it, because it's not like it's a bottleneck right > now. It should only be committed if an increase in performance is > notable. > >I did some tests with the patch set, by running tens of millions of >fstat(), fchown(), etc. calls to see how performance was affected. It >turns out on a kernel without any debugging options enabled, the >performance gain is only 1-2%, which sounds pretty valid to me. My resistance to this patch is not quite what you describe above: By factoring the vop vectors out, you remove the ability to let default vectors pick up new functionality later. I will admit that I have no knowledge of this level of generality, dating back from Heidemans Phd. dissertation, being used for anything sensible. Furthermore, if I am not mistaken, your patch increases the kernel size. Absent a plausible performance improvement, I don't see any point of your change. And that brings me to your "1-2%" measurement quoted in IRC and above. I have earlier ranted about the difference between benchmarking and random number generators, and you may have joined the project after the latest of these. Please search the mail-archives for that topic, and then use the handy ministat(1) program, to see if you have actually show any net speed benefit. Once and if you cross that threshold, I am going to raise my next objection: Benchmarking "tens of millions of fstat(), fchown(), etc. calls" and showing a 1-2% difference is patently bogus, and certainly no reason for the change you propose. Poul-Henning -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From ed at 80386.nl Wed Oct 1 20:21:10 2008 From: ed at 80386.nl (Ed Schouten) Date: Wed Oct 1 20:21:22 2008 Subject: Expanding vops in vop_vectors during startup In-Reply-To: <69186.1222890415@critter.freebsd.dk> References: <69186.1222890415@critter.freebsd.dk> Message-ID: <20081001202108.GO16837@hoeg.nl> Hello Poul-Henning, * Poul-Henning Kamp wrote: > In message <20081001190728.GL16837@hoeg.nl>, Ed Schouten writes: > > >The reason I'm sending this message, is because based on discussions I > >had with several people on IRC we've basically got two different > >opinions on this patch: > > > >- One group of people liked the idea of the patch. Some people even said > > the patch looks good enough to be committed. > > > >- Another group of people also liked the idea, but thought it would make > > no sense to commit it, because it's not like it's a bottleneck right > > now. It should only be committed if an increase in performance is > > notable. > > > >I did some tests with the patch set, by running tens of millions of > >fstat(), fchown(), etc. calls to see how performance was affected. It > >turns out on a kernel without any debugging options enabled, the > >performance gain is only 1-2%, which sounds pretty valid to me. > > > My resistance to this patch is not quite what you describe above: > > By factoring the vop vectors out, you remove the ability to let > default vectors pick up new functionality later. Could you give me an example of such functionality? You mean extending a vop_vector? That shouldn't be a problem, right? If such functionality really seems to be in conflict with this modification, it's not like we're carving things in stone here. > I will admit that I have no knowledge of this level of generality, > dating back from Heidemans Phd. dissertation, being used for anything > sensible. > > Furthermore, if I am not mistaken, your patch increases the kernel > size. Even though I admit I don't have that many file system types compiled into my kernel, binary size is 2203 bytes smaller with my patch applied. If you have a whole bunch of filesystems compiled into your kernel, these numbers might be a little different. We now need an extra SYSINIT per struct vop_vector. > Absent a plausible performance improvement, I don't see any point > of your change. > > And that brings me to your "1-2%" measurement quoted in IRC and > above. > > I have earlier ranted about the difference between benchmarking > and random number generators, and you may have joined the project > after the latest of these. > > Please search the mail-archives for that topic, and then use > the handy ministat(1) program, to see if you have actually > show any net speed benefit. > > Once and if you cross that threshold, I am going to raise my next > objection: > > Benchmarking "tens of millions of fstat(), fchown(), etc. calls" > and showing a 1-2% difference is patently bogus, and certainly > no reason for the change you propose. ministat(1) also mentions a 2% improvement with 95.0% confidence. Quite a nifty tool. Thanks for pointing me to it. About the benchmarks: the reason why I decided to test these things, was because I didn't want to measure disk I/O performance. I just wanted to see how performance was different with respect to VOP_*() calls. This means most of the cases I tested scenario's when data would already be available from cache or on pseudo-filesystems, where real disk I/O would not occur. But as I said: I am not going to commit it. End of discussion. -- Ed Schouten WWW: http://80386.nl/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20081001/9b5ea7bd/attachment.pgp From 000.fbsd at quip.cz Wed Oct 1 23:01:34 2008 From: 000.fbsd at quip.cz (Miroslav Lachman) Date: Wed Oct 1 23:01:41 2008 Subject: ZFS inodes issue (0 reported by df -hi) Message-ID: <48E4016C.5000909@quip.cz> I am stresstesting ZFS filesystem on my test machine and sometimes see wrong output of df in inodes columns - reporting zero used inodes when more than 53 milions are used. Bellow is df -hi reports taken in few seconds / minutes (there are running some heavy copying tasks in the background) [other partitions was stripped for clarity] root@cage ~/# df -hi Filesystem Size Used Avail Capacity iused ifree %iused Mounted tank 364G 362G 1.9G 99% 0 15915 0% /tank ^^^^^^ ^^^^^^ root@cage ~/# df -hi Filesystem Size Used Avail Capacity iused ifree %iused Mounted tank 364G 362G 1.9G 99% 53382659 15757 100% /tank root@cage ~/# df -hi Filesystem Size Used Avail Capacity iused ifree %iused Mounted tank 364G 362G 1.9G 99% 53391685 15503 100% /tank root@cage ~/# df -hi Filesystem Size Used Avail Capacity iused ifree %iused Mounted tank 364G 363G 1.3G 100% 0 10965 0% /tank ^^^^^^ ^^^^^^ root@cage ~/# df -hi Filesystem Size Used Avail Capacity iused ifree %iused Mounted tank 364G 363G 1.3G 100% 53591981 10817 100% /tank root@cage ~/# df -hi Filesystem Size Used Avail Capacity iused ifree %iused Mounted tank 364G 363G 1.0G 100% 0 8267 0% /tank ^^^^^^ ^^^^^^ root@cage ~/# df -hi Filesystem Size Used Avail Capacity iused ifree %iused Mounted tank 364G 363G 1.0G 100% 53672433 8245 100% /tank Next thing that I do not understand is how ZFS uses inodes? The total number of inodes (iused+ifree) grows by the time as filesystem is more and more filled. Version: FreeBSD 7.0-RELEASE-p2 #0: Wed Jun 18 06:48:16 UTC 2008 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64 Miroslav Lachman From roberto at keltia.freenix.fr Wed Oct 1 23:08:58 2008 From: roberto at keltia.freenix.fr (Ollivier Robert) Date: Wed Oct 1 23:09:04 2008 Subject: ZFS inodes issue (0 reported by df -hi) In-Reply-To: <48E4016C.5000909@quip.cz> References: <48E4016C.5000909@quip.cz> Message-ID: <20081001230855.GA70612@keltia.freenix.fr> According to Miroslav Lachman: > Next thing that I do not understand is how ZFS uses inodes? The total > number of inodes (iused+ifree) grows by the time as filesystem is more > and more filled. znodes in ZFS are automatically created when needed (and I assume garbage collected as well). -- Ollivier ROBERT -=- FreeBSD: The Power to Serve! -=- roberto@keltia.freenix.fr Darwin sidhe.keltia.net Version 9.4.0: Mon Jun 9 19:30:53 PDT 2008; i386 From artis.caune at gmail.com Fri Oct 3 09:10:27 2008 From: artis.caune at gmail.com (Artis Caune) Date: Fri Oct 3 09:10:34 2008 Subject: ZFS on root with atime=off Message-ID: <9e20d71e0810030148p30cb5f4xb8fe368dccaeb87@mail.gmail.com> Hi everyone, I install ZFS on root just like in Andrew ZFSOnRoot wiki page. I don't use legacy mount points. I also set "atime=off" on tank and all partitions inherit it from tank. When I reboot after install, root file system is mounted with atime option: # zfs get atime tank NAME PROPERTY VALUE SOURCE tank atime on temporary I can fix this with creating entry in fstab for root fs with noatime, but maybe there is some way how to pass options to vfs.root.mountfrom? -- regards, Artis Caune <----. CCNA <----|==================== <----' didii FreeBSD From mboxindia at gmail.com Fri Oct 3 11:25:08 2008 From: mboxindia at gmail.com (Srinivas Srinivas) Date: Fri Oct 3 11:25:14 2008 Subject: options of configuration file Message-ID: Hello, May be these are beginner questions ... could you plz answer the following questions? I think "options" line and "device" line are in the configuration file, in order to support those features and devices. I see that sed script will parse that file. Could you plz let me know what will be done in this phase and how these lines will be transferred into gcc define directives(in the case of options) and inclusion of source files for compilation(in case of device). I have seen lint in some Makefiles, but dont know why it was used. Why is lint used? The "device" line adds device support to the kernel. What exactly does this mean. A more basic question is: how the devices are detected initially by the FreeBSD with the aid of hardware and bios? I think this is a broad topic. Could you plz provide a link if there is any info, you know, in the net? Thanks, Srinivas From des at des.no Fri Oct 3 11:52:13 2008 From: des at des.no (=?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?=) Date: Fri Oct 3 11:52:20 2008 Subject: options of configuration file In-Reply-To: (Srinivas Srinivas's message of "Fri, 3 Oct 2008 16:39:07 +0530") References: Message-ID: <86fxndvrq3.fsf@ds4.des.no> "Srinivas Srinivas" writes: > I think "options" line and "device" line are in the configuration file, in > order to support those features and devices. I see that sed script will > parse that file. No, it's a C program: config(8). DES -- Dag-Erling Sm?rgrav - des@des.no From nejc at skoberne.net Sat Oct 4 21:29:12 2008 From: nejc at skoberne.net (Nejc Skoberne) Date: Sat Oct 4 21:29:33 2008 Subject: Rewinding on unionfs and Subversion Message-ID: <48E7DC0E.1060008@skoberne.net> Hello, with my friend we tried to install Subversion in a jail on a unionfs filesystem. Unfortunately, the subversion FreeBSD port doesn't do "make check" before installing, so we spent quite some time debugging to find out it was actually a filesystem bug. 1. The bug in real world. ------------------------- When trying to make a "svn commit" or "svn import", errors similar to these show up in apache error log: [Wed Oct 01 21:04:35 2008] [error] [client 10.1.1.11] Could not MERGE resource "/svn/test2/!svn/act/e28480c9-eb8f-dd11-808c-0018fe7759ca" into "/svn/test2". [409, #0] [Wed Oct 01 21:04:35 2008] [error] [client 10.1.1.11] An error occurred while committing the transaction. [409, #2] [Wed Oct 01 21:04:35 2008] [error] [client 10.1.1.11] Can't remove '/usr/local/svn/test2/db/transactions/2-2.txn/node.0.0' [409, #2] [Wed Oct 01 21:04:35 2008] [error] [client 10.1.1.11] Can't remove file '/usr/local/svn/test2/db/transactions/2-2.txn/node.0.0': No such file or directory [409, #2] The last error is also shown up at client side (svn binary). However, all actions are succesfully accomplished, so the error shouldn't appear at all. Also, when doing "make check" after building subversion from source, would also fail with identical errors. 2. Bug location --------------- After some ktracing and tracing the function calls back to the "root", we discovered that the bug is probably present in Standard C library. We are not sure yet, where in libc exactly it is located. 3. Proof of concept code ------------------------ The following code below works fine on UFS, but fails on unionfs. The code itself was taken from subversion codebase and is rewritten not to use Apache apr library, but uses libc functions directly instead. It just shows that there are discrepancies in UFS vs. unionfs behaviour. If the bug in libc was fixed to make this example work, we believe that also subversion would work normally on unionfs. ----------------------------------------------------------------------- #include #include #include #include #include typedef struct dirent direntry; int remove_dir(char *path) { int need_rewind; if (path[0] == '\0') path = "."; DIR *dir; if ((dir = opendir(path)) == NULL) { perror("opendir"); return 1; } do { need_rewind = 0; int ret; long position; direntry entry; direntry *result; while (1) { if (readdir_r(dir, &entry, &result) != 0) { perror("readdir_r"); return 1; } if (result == NULL) { break; } printf("Working on '%s'\n", entry.d_name); printf(" entry.d_fileno is %d\n", entry.d_fileno); if ((entry.d_type == DT_DIR) && ((strcmp(entry.d_name, ".") == 0) || (strcmp(entry.d_name, "..") == 0))) { continue; } else { char fullpath[1000]; need_rewind = 1; strcpy(fullpath, path); strcat(fullpath, "/"); strcat(fullpath, entry.d_name); printf(" fullpath is '%s'\n", fullpath); if (entry.d_type == DT_DIR) { if (remove_dir(fullpath)) { return 1; } } else { if (unlink(fullpath) == -1) { perror("unlink"); return 1; } } } } if (need_rewind) { printf("Rewinding\n"); rewinddir(dir); } } while (need_rewind); if (closedir(dir) == -1) { perror("closedir"); return 1; } if (rmdir(path) == -1) { perror("rmdir"); return 1; } return 0; } int main(int argc, const char *const argv[]) { if (mkdir("test", 0755) == -1) { perror("mkdir"); return 1; } int file = -1; if ((file = open("test/file", O_WRONLY | O_CREAT | O_TRUNC, 0644)) == -1) { perror("open"); return 1; } if (close(file) == -1) { perror("close"); return 1; } if (remove_dir("test")) { printf("Test failed\n"); return 1; } else { printf("It works as it should\n"); return 0; } } ------------------------------------------------------------------------- 4. Conclusion ------------- We will also file this as PR, but I wanted to share this with this list too. Also, does anyone know who is "in charge" of unionfs implementation on FreeBSD? Who is the "leading" developer? This is not the only bug left to be fixed for unionfs to be production-ready, and we are really looking forward to cooperate with (a) commiter(s), who would be prepared to fix at least the most nasty ones. We are also preparing the list of the most annoying unionfs bugs, which still need to be fixed. This issue is of course one of them. Thanks, Nejc From vwe at FreeBSD.org Sun Oct 5 17:26:08 2008 From: vwe at FreeBSD.org (vwe@FreeBSD.org) Date: Sun Oct 5 17:26:19 2008 Subject: kern/125149: [nfs][panic] changing into .zfs dir from nfs client causes panic Message-ID: <200810051726.m95HQ8mk011664@freefall.freebsd.org> Old Synopsis: [zfs][nfs] changing into .zfs dir from nfs client causes endless panic loop New Synopsis: [nfs][panic] changing into .zfs dir from nfs client causes panic State-Changed-From-To: feedback->open State-Changed-By: vwe State-Changed-When: Sun Oct 5 17:24:30 UTC 2008 State-Changed-Why: Over to maintainer(s). Responsible-Changed-From-To: freebsd-bugs->freebsd-fs Responsible-Changed-By: vwe Responsible-Changed-When: Sun Oct 5 17:24:30 UTC 2008 Responsible-Changed-Why: Over to maintainer(s). http://www.freebsd.org/cgi/query-pr.cgi?pr=125149 From dimitar.vassilev at gmail.com Sun Oct 5 19:39:48 2008 From: dimitar.vassilev at gmail.com (Dimitar Vasilev) Date: Sun Oct 5 19:39:54 2008 Subject: zfs as layer distributor Message-ID: <59adc1a0810051210t4a3503aci2bc06ba0aa5376c3@mail.gmail.com> Hi all, Does someone use zfs as layer distributor on the top of hardware raid - (RAID10,RAID6,etc)? Could you give feedback on benefits and downsides. thanks in advance! From andrew at modulus.org Mon Oct 6 00:02:09 2008 From: andrew at modulus.org (Andrew Snow) Date: Mon Oct 6 00:02:21 2008 Subject: zfs as layer distributor In-Reply-To: <59adc1a0810051210t4a3503aci2bc06ba0aa5376c3@mail.gmail.com> References: <59adc1a0810051210t4a3503aci2bc06ba0aa5376c3@mail.gmail.com> Message-ID: <48E9556C.9060004@modulus.org> Dimitar Vasilev wrote: > Hi all, > Does someone use zfs as layer distributor on the top of hardware raid - > (RAID10,RAID6,etc)? I've found ZFS works faster when given more than one disk device. The reason being, it is smart about writing journal logs and metadata copies to different devices, resulting in higher performance by using idle disks. It also provides more "channels" for write clustering so higher throughput on write-heavy loads. Secondly if you use ZFS to provide RAID1 or RAID5, due to checksumming it can be smarter about which data it chooses in the event of a checksum failure. Hardware RAID can only do this with RAID6. Finally, when ZFS issues "flush cache" command to the disk for metadata and journal logs, there is less data to flush when you give it multiple smaller devices. If you have a single monolithic RAID device with a large (eg. 256mb) cache, it can ruin performance while the RAID card flushes its entire cache. (This can be disabled with a sysctl). - Andrew From dimitar.vassilev at gmail.com Mon Oct 6 05:25:48 2008 From: dimitar.vassilev at gmail.com (Dimitar Vasilev) Date: Mon Oct 6 05:25:54 2008 Subject: zfs as layer distributor In-Reply-To: <48E9556C.9060004@modulus.org> References: <59adc1a0810051210t4a3503aci2bc06ba0aa5376c3@mail.gmail.com> <48E9556C.9060004@modulus.org> Message-ID: <59adc1a0810052225w6d8c1b78r226ffe8ca4ccf35d@mail.gmail.com> 2008/10/6 Andrew Snow > Dimitar Vasilev wrote: > >> Hi all, >> Does someone use zfs as layer distributor on the top of hardware raid - >> (RAID10,RAID6,etc)? >> > > I've found ZFS works faster when given more than one disk device. The > reason being, it is smart about writing journal logs and metadata copies to > different devices, resulting in higher performance by using idle disks. It > also provides more "channels" for write clustering so higher throughput on > write-heavy loads. > > Secondly if you use ZFS to provide RAID1 or RAID5, due to checksumming it > can be smarter about which data it chooses in the event of a checksum > failure. Hardware RAID can only do this with RAID6. > > Finally, when ZFS issues "flush cache" command to the disk for metadata and > journal logs, there is less data to flush when you give it multiple smaller > devices. If you have a single monolithic RAID device with a large (eg. > 256mb) cache, it can ruin performance while the RAID card flushes its entire > cache. (This can be disabled with a sysctl). > > - Andrew Thanks Andrew, I have an Areca 1120 with RAID-6 and on the top of it a zfs as a layer distributor. So far I can tell the following: 1)works nice and fast 2)can be pain in the rear if your controller spits one of the disks due to power surge/etc. 3) zfs snapshots caused some crashes and bad descriptors on 7.0-stable as of 3 months behind- but it's somewhat expected. I'm thinking of raidz2 and setting the disks as pass-through. Would love if someone to hear if someone has tested hardware raid6 and zfs over it. Best regards, Dimitar From neldredge at math.ucsd.edu Mon Oct 6 06:40:02 2008 From: neldredge at math.ucsd.edu (Nate Eldredge) Date: Mon Oct 6 06:40:09 2008 Subject: kern/127213: [tmpfs] sendfile on tmpfs data corruption Message-ID: <200810060640.m966e2qg084501@freefall.freebsd.org> The following reply was made to PR kern/127213; it has been noted by GNATS. From: Nate Eldredge To: bug-followup@FreeBSD.org, citrin@citrin.ru Cc: Subject: Re: kern/127213: [tmpfs] sendfile on tmpfs data corruption Date: Sun, 5 Oct 2008 23:20:42 -0700 (PDT) Hi, I investigated this a bit. First, note that this bug has some security implications, because it appears that the garbage written by sendfile is kernel memory contents, which could contain something sensitive. It is sufficient for an attacker to have read access to a file on a mounted tmpfs. So it should really get fixed. I'm not terribly familiar with vfs or vm internals, but it appears that sendfile causes VOP_READ to be called with the IO_VMIO flag and a dummy uio. tmpfs_read (in sys/fs/tmpfs/tmpfs_vnops.c) doesn't handle this correctly; it always just copies the data to the supplied uio, which in this case does nothing. It looks like the data is supposed to make it into vn->v_object, and tmpfs_read doesn't do that. (If I understand it correctly, on a normal filesystem this is taken care of by bread().) I am not sure what the correct semantics of IO_VMIO are supposed to be, so I don't know what the correct fix would be. However, a quick fix is to not have a v_object at all; remove the call to vnode_create_vobject in tmpfs_open. This seems to be legal since procfs, etc, work that way. It does however mean that sendfile doesn't work at all. I am curious what was the point of having a v_object in the first place, since the data is already in virtual memory. Unless the goal was just to make sendfile work, which evidently wasn't successful. Incidentally, to the initial reporter, what application do you have that requires sendfile? In my experience, most things will fall back to a read/write loop if sendfile fails, since sendfile isn't available on all systems or under all circumstances. So if you apply the quick fix so that sendfile always fails, it might at least get your application working again. -- Nate Eldredge neldredge@math.ucsd.edu From andrew at modulus.org Mon Oct 6 07:39:17 2008 From: andrew at modulus.org (Andrew Snow) Date: Mon Oct 6 07:39:25 2008 Subject: zfs as layer distributor In-Reply-To: <59adc1a0810052225w6d8c1b78r226ffe8ca4ccf35d@mail.gmail.com> References: <59adc1a0810051210t4a3503aci2bc06ba0aa5376c3@mail.gmail.com> <48E9556C.9060004@modulus.org> <59adc1a0810052225w6d8c1b78r226ffe8ca4ccf35d@mail.gmail.com> Message-ID: <48E9C08C.8040108@modulus.org> Dimitar Vasilev wrote: > Would love if someone to hear if someone has tested hardware raid6 and > zfs over it. Yes, I am using 3ware RAID6 over 16 disks as a single volume, because we also had UFS partitions that we wanted to keep. The performance is more than adequate, but not anywhere near if you used them as single disks. Personally - based on prior experience with certain hardware - I'd trust ZFS software raid over Areca hardware :-) How many disks do you have? If you can split up your disk pack into a group of between 5 and 10 smaller RAIDs, that is the optimal range for ZFS performance. - Andrew From maxim at macomnet.ru Mon Oct 6 07:40:04 2008 From: maxim at macomnet.ru (Maxim Konovalov) Date: Mon Oct 6 07:40:09 2008 Subject: kern/127213: [tmpfs] sendfile on tmpfs data corruption Message-ID: <200810060740.m967e3jt015065@freefall.freebsd.org> The following reply was made to PR kern/127213; it has been noted by GNATS. From: Maxim Konovalov To: Nate Eldredge Cc: bug-followup@freebsd.org Subject: Re: kern/127213: [tmpfs] sendfile on tmpfs data corruption Date: Mon, 6 Oct 2008 11:36:38 +0400 (MSD) Hello, On Mon, 6 Oct 2008, 06:40-0000, Nate Eldredge wrote: [...] > Incidentally, to the initial reporter, what application do you have > that requires sendfile? In my experience, most things will fall > back to a read/write loop if sendfile fails, since sendfile isn't > available on all systems or under all circumstances. So if you > apply the quick fix so that sendfile always fails, it might at > least get your application working again. > As stated in the PR Andrey used nginx (ports/www/nginx). But I could easily reproduce the bug with the stock ftpd(8) with the ftproot on tmpfs. -- Maxim Konovalov From dimitar.vassilev at gmail.com Mon Oct 6 08:30:43 2008 From: dimitar.vassilev at gmail.com (Dimitar Vasilev) Date: Mon Oct 6 08:30:50 2008 Subject: zfs as layer distributor In-Reply-To: <48E9C08C.8040108@modulus.org> References: <59adc1a0810051210t4a3503aci2bc06ba0aa5376c3@mail.gmail.com> <48E9556C.9060004@modulus.org> <59adc1a0810052225w6d8c1b78r226ffe8ca4ccf35d@mail.gmail.com> <48E9C08C.8040108@modulus.org> Message-ID: <59adc1a0810060130j14a1707ao132e980dc5f17138@mail.gmail.com> 2008/10/6 Andrew Snow > Dimitar Vasilev wrote: > >> Would love if someone to hear if someone has tested hardware raid6 and zfs >> over it. >> > > Yes, I am using 3ware RAID6 over 16 disks as a single volume, because we > also had UFS partitions that we wanted to keep. > > The performance is more than adequate, but not anywhere near if you used > them as single disks. Personally - based on prior experience with certain > hardware - I'd trust ZFS software raid over Areca hardware :-) > > How many disks do you have? If you can split up your disk pack into a group > of between 5 and 10 smaller RAIDs, that is the optimal range for ZFS > performance. > > > - Andrew I got 8 disks out of 12 possible. As the local reps of Areca told me - RAID6 is good over 12 disks. So next time I think I will go with raidz2 and pass-through. Best regards, Dimitar From bugmaster at FreeBSD.org Mon Oct 6 11:06:54 2008 From: bugmaster at FreeBSD.org (FreeBSD bugmaster) Date: Mon Oct 6 11:07:46 2008 Subject: Current problem reports assigned to freebsd-fs@FreeBSD.org Message-ID: <200810061106.m96B6sKP035463@freefall.freebsd.org> Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/127420 fs [gjournal] [panic] Journal overflow on gmirrored gjour o kern/127213 fs [tmpfs] sendfile on tmpfs data corruption o kern/127029 fs [panic] mount(8): trying to mount a write protected zi o kern/126287 fs [ufs] [panic] Kernel panics while mounting an UFS file o kern/125536 fs [ext2fs] ext 2 mounts cleanly but fails on commands li o kern/125149 fs [nfs][panic] changing into .zfs dir from nfs client ca o kern/124621 fs [ext3] Cannot mount ext2fs partition o kern/122888 fs [zfs] zfs hang w/ prefetch on, zil off while running t o bin/122172 fs [fs]: amd(8) automount daemon dies on 6.3-STABLE i386, o bin/121072 fs [smbfs] mount_smbfs(8) cannot normally convert the cha o bin/118249 fs mv(1): moving a directory changes its mtime o kern/116170 fs [panic] Kernel panic when mounting /tmp o kern/114955 fs [cd9660] [patch] [request] support for mask,dirmask,ui o kern/114847 fs [ntfs] [patch] [request] dirmask support for NTFS ala o kern/114676 fs [ufs] snapshot creation panics: snapacct_ufs2: bad blo o bin/114468 fs [patch] [request] add -d option to umount(8) to detach o bin/113838 fs [patch] [request] mount(8): add support for relative p o bin/113049 fs [patch] [request] make quot(8) use getopt(3) and show o kern/112658 fs [smbfs] [patch] smbfs and caching problems (resolves b o kern/93942 fs [vfs] [patch] panic: ufs_dirbad: bad dir (patch from D 20 problems total. From citrin at citrin.ru Mon Oct 6 21:40:04 2008 From: citrin at citrin.ru (Anton Yuzhaninov) Date: Mon Oct 6 21:40:10 2008 Subject: kern/127213: [tmpfs] sendfile on tmpfs data corruption Message-ID: <200810062140.m96Le4uk088615@freefall.freebsd.org> The following reply was made to PR kern/127213; it has been noted by GNATS. From: Anton Yuzhaninov To: bug-followup@FreeBSD.org Cc: Subject: Re: kern/127213: [tmpfs] sendfile on tmpfs data corruption Date: Tue, 07 Oct 2008 01:38:50 +0400 > Incidentally, to the initial reporter, what application do you have that > requires sendfile? We want to use tmpfs with our homegrown application, which can work only using sendfile(). Currently we use md+ufs, but with md data present in memory twice - in md and in VM cache. -- Anton Yuzhaninov From neldredge at math.ucsd.edu Mon Oct 6 22:30:04 2008 From: neldredge at math.ucsd.edu (Nate Eldredge) Date: Mon Oct 6 22:30:11 2008 Subject: kern/127213: [tmpfs] sendfile on tmpfs data corruption Message-ID: <200810062230.m96MU4lE091342@freefall.freebsd.org> The following reply was made to PR kern/127213; it has been noted by GNATS. From: Nate Eldredge To: Maxim Konovalov Cc: bug-followup@freebsd.org, JH Subject: Re: kern/127213: [tmpfs] sendfile on tmpfs data corruption Date: Mon, 6 Oct 2008 15:22:57 -0700 (PDT) On Mon, 6 Oct 2008, Maxim Konovalov wrote: > Hello, > > On Mon, 6 Oct 2008, 06:40-0000, Nate Eldredge wrote: > > [...] >> Incidentally, to the initial reporter, what application do you have >> that requires sendfile? In my experience, most things will fall >> back to a read/write loop if sendfile fails, since sendfile isn't >> available on all systems or under all circumstances. So if you >> apply the quick fix so that sendfile always fails, it might at >> least get your application working again. >> > As stated in the PR Andrey used nginx (ports/www/nginx). But I could > easily reproduce the bug with the stock ftpd(8) with the ftproot on > tmpfs. To simplify matters further, here is the testcase I used when testing this, which uses sendfile to send some data over a unix domain socket. Do: ./server /tmpfs/data mysocket & ./client mysocket >data.out cmp /tmpfs/data data.out If things work right, data and data.out should be identical. But if data is a file on a tmpfs, data.out contains apparently random kernel memory contents. # This is a shell archive. Save it in a file, remove anything before # this line, and then unpack it by entering "sh file". Note, it may # create directories; files and directories will be owned by you and # have default permissions. # # This archive contains: # # Makefile # client.c # server.c # util.c # util.h # echo x - Makefile sed 's/^X//' >Makefile << 'END-of-Makefile' XCC = gcc XCFLAGS = -Wall -W -g X Xall : server client X Xserver : server.o util.o X $(CC) -o $@ $> X Xclient : client.o util.o X $(CC) -o $@ $> X Xserver.o client.o util.o : util.h X Xclean : X rm -f server client *.o END-of-Makefile echo x - client.c sed 's/^X//' >client.c << 'END-of-client.c' X#include X#include X#include X#include X#include "util.h" X Xint main(int argc, char *argv[]) { X int s; X if (argc < 2) { X fprintf(stderr, "Usage: %s socketname\n", argv[0]); X exit(1); X } X if ((s = connect_unix_socket(argv[1])) < 0) { X exit(1); X } X fake_sendfile(s, 1); X return 0; X} X X X END-of-client.c echo x - server.c sed 's/^X//' >server.c << 'END-of-server.c' X#include X#include X#include X#include X#include "util.h" X Xint main(int argc, char *argv[]) { X int f, listener, connection; X if (argc < 3) { X fprintf(stderr, "Usage: %s filename socketname\n", argv[0]); X exit(1); X } X if ((f = open(argv[1], O_RDONLY)) < 0) { X perror(argv[1]); X exit(1); X } X if ((listener = listen_unix_socket(argv[2])) < 0) { X exit(1); X } X if ((connection = accept_unix_socket(listener)) >= 0) { X real_sendfile(f, connection); X } X return 0; X} X X X END-of-server.c echo x - util.c sed 's/^X//' >util.c << 'END-of-util.c' X/* send data from file to unix domain socket */ X X#include X#include X#include X#include X#include X#include X#include X#include X#include X#include X Xint create_unix_socket(void) { X int fd; X if ((fd = socket(PF_LOCAL, SOCK_STREAM, 0)) < 0) { X perror("socket"); X return -1; X } X return fd; X} X Xint make_unix_sockaddr(const char *pathname, struct sockaddr_un *sa) { X memset(sa, 0, sizeof(*sa)); X sa->sun_family = PF_LOCAL; X if (strlen(pathname) + 1 > sizeof(sa->sun_path)) { X fprintf(stderr, "%s: pathname too long (max %lu)\n", X pathname, sizeof(sa->sun_path)); X errno = ENAMETOOLONG; X return -1; X } X strcpy(sa->sun_path, pathname); X return 0; X} X Xstatic char *sockname; Xvoid delete_socket(void) { X unlink(sockname); X} X Xint listen_unix_socket(const char *path) { X int fd; X struct sockaddr_un sa; X if (make_unix_sockaddr(path, &sa) < 0) X return -1; X if ((fd = create_unix_socket()) < 0) X return -1; X if (bind(fd, (struct sockaddr *)&sa, sizeof(sa)) < 0) { X perror("bind"); X close(fd); X return -1; X } X sockname = strdup(path); X atexit(delete_socket); X X if (listen(fd, 5) < 0) { X perror("listen"); X close(fd); X return -1; X } X return fd; X} X Xint accept_unix_socket(int fd) { X int s; X if ((s = accept(fd, NULL, 0)) < 0) { X perror("accept"); X return -1; X } X return s; X} X Xint connect_unix_socket(const char *path) { X int fd; X struct sockaddr_un sa; X if (make_unix_sockaddr(path, &sa) < 0) X return -1; X if ((fd = create_unix_socket()) < 0) X return -1; X if (connect(fd, (struct sockaddr *)&sa, sizeof(sa)) < 0) { X perror("connect"); X return -1; X } X return fd; X} X X#define BUFSIZE 65536 X Xint fake_sendfile(int from, int to) { X char buf[BUFSIZE]; X int v; X int sent = 0; X while ((v = read(from, buf, BUFSIZE)) > 0) { X int d = 0; X while (d < v) { X int w = write(to, buf, v - d); X if (w <= 0) { X perror("write"); X return -1; X } X d += w; X sent += w; X } X } X if (v != 0) { X perror("read"); X return -1; X } X return sent; X} X Xint real_sendfile(int from, int to) { X int v; X v = sendfile(from, to, 0, 0, NULL, NULL, 0); X if (v < 0) { X perror("sendfile"); X } X return v; X} X X END-of-util.c echo x - util.h sed 's/^X//' >util.h << 'END-of-util.h' X/* send data from file to unix domain socket */ X X#include X#include X#include X#include X#include X#include X#include X Xint create_unix_socket(void); Xint make_unix_sockaddr(const char *pathname, struct sockaddr_un *sa); Xint listen_unix_socket(const char *path); Xint accept_unix_socket(int fd); Xint connect_unix_socket(const char *path); Xint fake_sendfile(int from, int to); Xint real_sendfile(int from, int to); X X END-of-util.h exit -- Nate Eldredge neldredge@math.ucsd.edu From maxim at macomnet.ru Tue Oct 7 03:50:05 2008 From: maxim at macomnet.ru (Maxim Konovalov) Date: Tue Oct 7 03:50:11 2008 Subject: kern/127213: [tmpfs] sendfile on tmpfs data corruption Message-ID: <200810070350.m973o411017984@freefall.freebsd.org> The following reply was made to PR kern/127213; it has been noted by GNATS. From: Maxim Konovalov To: Nate Eldredge Cc: bug-followup@freebsd.org, JH Subject: Re: kern/127213: [tmpfs] sendfile on tmpfs data corruption Date: Tue, 7 Oct 2008 07:43:17 +0400 (MSD) On Mon, 6 Oct 2008, 15:22-0700, Nate Eldredge wrote: > On Mon, 6 Oct 2008, Maxim Konovalov wrote: > > > Hello, > > > > On Mon, 6 Oct 2008, 06:40-0000, Nate Eldredge wrote: > > > > [...] > > > Incidentally, to the initial reporter, what application do you have > > > that requires sendfile? In my experience, most things will fall > > > back to a read/write loop if sendfile fails, since sendfile isn't > > > available on all systems or under all circumstances. So if you > > > apply the quick fix so that sendfile always fails, it might at > > > least get your application working again. > > > > > As stated in the PR Andrey used nginx (ports/www/nginx). But I could > > easily reproduce the bug with the stock ftpd(8) with the ftproot on > > tmpfs. > > To simplify matters further, here is the testcase I used when > testing this, which uses sendfile to send some data over a unix > domain socket. Do: > > ./server /tmpfs/data mysocket & > ./client mysocket >data.out > cmp /tmpfs/data data.out > > If things work right, data and data.out should be identical. But if > data is a file on a tmpfs, data.out contains apparently random > kernel memory contents. > Hi Nate, It'd be really nice if you extend src/tools/regression/sockets/sendfile regression test for this bug. Now it doesn't detect this case. -- Maxim Konovalov From jh at saunalahti.fi Tue Oct 7 15:40:05 2008 From: jh at saunalahti.fi (Jaakko Heinonen) Date: Tue Oct 7 15:40:11 2008 Subject: kern/125149: [zfs][nfs] changing into .zfs dir from nfs client causes endless panic loop Message-ID: <200810071540.m97Fe4Oi012308@freefall.freebsd.org> The following reply was made to PR kern/125149; it has been noted by GNATS. From: Jaakko Heinonen To: Volker Werth Cc: Weldon Godfrey , bug-followup@FreeBSD.org Subject: Re: kern/125149: [zfs][nfs] changing into .zfs dir from nfs client causes endless panic loop Date: Tue, 7 Oct 2008 18:36:30 +0300 Hi, On 2008-10-02, Volker Werth wrote: > > #8 0xffffffff804f06fa in vput (vp=0x0) at atomic.h:142 > > #9 0xffffffff8060670d in nfsrv_readdirplus (nfsd=0xffffff000584f100, > > slp=0xffffff0005725900, > > td=0xffffff00059a0340, mrq=0xffffffffdf761af0) at > > /usr/src/sys/nfsserver/nfs_serv.c:3613 > > #10 0xffffffff80615a5d in nfssvc (td=Variable "td" is not available. > > ) at /usr/src/sys/nfsserver/nfs_syscalls.c:461 > > #11 0xffffffff8072f377 in syscall (frame=0xffffffffdf761c70) at > > /usr/src/sys/amd64/amd64/trap.c:852 > > #12 0xffffffff807158bb in Xfast_syscall () at > > /usr/src/sys/amd64/amd64/exception.S:290 > > #13 0x000000080068746c in ?? () > > Previous frame inner to this frame (corrupt stack?) > > I think the problem is the NULL pointer to vput. A maintainer needs to > check how nvp can get a NULL pointer (judging by assuming my fresh > codebase is not too different from yours). The bug is reproducible with nfs clients using readdirplus. FreeBSD client doesn't use readdirplus by default but you can enable it with -l mount option. Here are steps to reproduce the panic with FreeBSD nfs client: - nfs export a zfs file system - on client mount the file system with -l mount option and list the zfs control directory # mount_nfs -l x.x.x.x:/tank /mnt # ls /mnt/.zfs I see two bugs here: 1) nfsrv_readdirplus() doesn't check VFS_VGET() error status properly. It only checks for EOPNOTSUPP but other errors are ignored. This is the final reason for the panic and in theory it could happen for other file systems too. In this case VFS_VGET() returns EINVAL and results NULL nvp. 2) zfs VFS_VGET() returns EINVAL for .zfs control directory entries. Looking at zfs_vget() it tries find corresponding znode to fulfill the request. However control directory entries don't have backing znodes. Here is a patch which fixes 1). The patch prevents system from panicing but a fix for 2) is needed to make readdirplus work with .zfs directory. %%% Index: sys/nfsserver/nfs_serv.c =================================================================== --- sys/nfsserver/nfs_serv.c (revision 183511) +++ sys/nfsserver/nfs_serv.c (working copy) @@ -3597,9 +3597,12 @@ again: * Probe one of the directory entries to see if the filesystem * supports VGET. */ - if (VFS_VGET(vp->v_mount, dp->d_fileno, LK_EXCLUSIVE, &nvp) == - EOPNOTSUPP) { - error = NFSERR_NOTSUPP; + error = VFS_VGET(vp->v_mount, dp->d_fileno, LK_EXCLUSIVE, &nvp); + if (error) { + if (error == EOPNOTSUPP) + error = NFSERR_NOTSUPP; + else + error = NFSERR_SERVERFAULT; vrele(vp); vp = NULL; free((caddr_t)cookies, M_TEMP); %%% And here's an attempt to add support for .zfs control directory entries (bug 2)) in zfs_vget(). The patch is very experimental and it only works for snapshots which are already active (mounted). %%% Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c =================================================================== --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c (revision 183587) +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c (working copy) @@ -759,9 +759,10 @@ zfs_vget(vfs_t *vfsp, ino_t ino, int fla VN_RELE(ZTOV(zp)); err = EINVAL; } - if (err != 0) - *vpp = NULL; - else { + if (err != 0) { + /* try .zfs control directory */ + err = zfsctl_vget(vfsp, ino, flags, vpp); + } else { *vpp = ZTOV(zp); vn_lock(*vpp, flags); } Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c =================================================================== --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c (revision 183587) +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c (working copy) @@ -1047,6 +1047,63 @@ zfsctl_lookup_objset(vfs_t *vfsp, uint64 return (error); } +int +zfsctl_vget(vfs_t *vfsp, uint64_t nodeid, int flags, vnode_t **vpp) +{ + zfsvfs_t *zfsvfs = vfsp->vfs_data; + vnode_t *dvp, *vp; + zfsctl_snapdir_t *sdp; + zfsctl_node_t *zcp; + zfs_snapentry_t *sep; + int error; + + *vpp = NULL; + + ASSERT(zfsvfs->z_ctldir != NULL); + error = zfsctl_root_lookup(zfsvfs->z_ctldir, "snapshot", &dvp, + NULL, 0, NULL, kcred); + if (error != 0) + return (error); + + if (nodeid == ZFSCTL_INO_ROOT || nodeid == ZFSCTL_INO_SNAPDIR) { + if (nodeid == ZFSCTL_INO_SNAPDIR) + *vpp = dvp; + else { + VN_RELE(dvp); + *vpp = zfsvfs->z_ctldir; + VN_HOLD(*vpp); + } + /* XXX: LK_RETRY? */ + vn_lock(*vpp, flags | LK_RETRY); + return (0); + } + + sdp = dvp->v_data; + + mutex_enter(&sdp->sd_lock); + sep = avl_first(&sdp->sd_snaps); + while (sep != NULL) { + vp = sep->se_root; + zcp = vp->v_data; + if (zcp->zc_id == nodeid) + break; + + sep = AVL_NEXT(&sdp->sd_snaps, sep); + } + + if (sep != NULL) { + VN_HOLD(vp); + *vpp = vp; + vn_lock(*vpp, flags); + } else + error = EINVAL; + + mutex_exit(&sdp->sd_lock); + + VN_RELE(dvp); + + return (error); +} /* * Unmount any snapshots for the given filesystem. This is called from * zfs_umount() - if we have a ctldir, then go through and unmount all the Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h =================================================================== --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h (revision 183587) +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h (working copy) @@ -60,6 +60,7 @@ int zfsctl_root_lookup(vnode_t *dvp, cha int flags, vnode_t *rdir, cred_t *cr); int zfsctl_lookup_objset(vfs_t *vfsp, uint64_t objsetid, zfsvfs_t **zfsvfsp); +int zfsctl_vget(vfs_t *vfsp, uint64_t nodeid, int flags, vnode_t **vpp); #define ZFSCTL_INO_ROOT 0x1 #define ZFSCTL_INO_SNAPDIR 0x2 %%% -- Jaakko From wgodfrey at ena.com Wed Oct 8 21:20:04 2008 From: wgodfrey at ena.com (Weldon Godfrey) Date: Wed Oct 8 21:20:11 2008 Subject: kern/125149: [zfs][nfs] changing into .zfs dir from nfs clientcauses endless panic loop Message-ID: <200810082120.m98LK3fx090569@freefall.freebsd.org> The following reply was made to PR kern/125149; it has been noted by GNATS. From: "Weldon Godfrey" To: "Jaakko Heinonen" , "Volker Werth" Cc: Subject: RE: kern/125149: [zfs][nfs] changing into .zfs dir from nfs clientcauses endless panic loop Date: Wed, 8 Oct 2008 16:06:50 -0500 Thanks! I will apply these patches tomorrow. Weldon -----Original Message----- From: Jaakko Heinonen [mailto:jh@saunalahti.fi]=20 Sent: Tuesday, October 07, 2008 10:37 AM To: Volker Werth Cc: Weldon Godfrey; bug-followup@freebsd.org Subject: Re: kern/125149: [zfs][nfs] changing into .zfs dir from nfs clientcauses endless panic loop Hi, On 2008-10-02, Volker Werth wrote: > > #8 0xffffffff804f06fa in vput (vp=3D0x0) at atomic.h:142 > > #9 0xffffffff8060670d in nfsrv_readdirplus (nfsd=3D0xffffff000584f100, > > slp=3D0xffffff0005725900,=20 > > td=3D0xffffff00059a0340, mrq=3D0xffffffffdf761af0) at > > /usr/src/sys/nfsserver/nfs_serv.c:3613 > > #10 0xffffffff80615a5d in nfssvc (td=3DVariable "td" is not = available. > > ) at /usr/src/sys/nfsserver/nfs_syscalls.c:461 > > #11 0xffffffff8072f377 in syscall (frame=3D0xffffffffdf761c70) at > > /usr/src/sys/amd64/amd64/trap.c:852 > > #12 0xffffffff807158bb in Xfast_syscall () at > > /usr/src/sys/amd64/amd64/exception.S:290 > > #13 0x000000080068746c in ?? () > > Previous frame inner to this frame (corrupt stack?) >=20 > I think the problem is the NULL pointer to vput. A maintainer needs to > check how nvp can get a NULL pointer (judging by assuming my fresh > codebase is not too different from yours). The bug is reproducible with nfs clients using readdirplus. FreeBSD client doesn't use readdirplus by default but you can enable it with -l mount option. Here are steps to reproduce the panic with FreeBSD nfs client: - nfs export a zfs file system - on client mount the file system with -l mount option and list the zfs control directory # mount_nfs -l x.x.x.x:/tank /mnt # ls /mnt/.zfs I see two bugs here: 1) nfsrv_readdirplus() doesn't check VFS_VGET() error status properly. It only checks for EOPNOTSUPP but other errors are ignored. This is the final reason for the panic and in theory it could happen for other file systems too. In this case VFS_VGET() returns EINVAL and results NULL nvp. 2) zfs VFS_VGET() returns EINVAL for .zfs control directory entries. Looking at zfs_vget() it tries find corresponding znode to fulfill the request. However control directory entries don't have backing znodes. Here is a patch which fixes 1). The patch prevents system from panicing but a fix for 2) is needed to make readdirplus work with .zfs directory. %%% Index: sys/nfsserver/nfs_serv.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/nfsserver/nfs_serv.c (revision 183511) +++ sys/nfsserver/nfs_serv.c (working copy) @@ -3597,9 +3597,12 @@ again: * Probe one of the directory entries to see if the filesystem * supports VGET. */ - if (VFS_VGET(vp->v_mount, dp->d_fileno, LK_EXCLUSIVE, &nvp) =3D=3D - EOPNOTSUPP) { - error =3D NFSERR_NOTSUPP; + error =3D VFS_VGET(vp->v_mount, dp->d_fileno, LK_EXCLUSIVE, &nvp); + if (error) { + if (error =3D=3D EOPNOTSUPP) + error =3D NFSERR_NOTSUPP; + else + error =3D NFSERR_SERVERFAULT; vrele(vp); vp =3D NULL; free((caddr_t)cookies, M_TEMP); %%% And here's an attempt to add support for .zfs control directory entries (bug 2)) in zfs_vget(). The patch is very experimental and it only works for snapshots which are already active (mounted). %%% Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c (revision 183587) +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c (working copy) @@ -759,9 +759,10 @@ zfs_vget(vfs_t *vfsp, ino_t ino, int fla VN_RELE(ZTOV(zp)); err =3D EINVAL; } - if (err !=3D 0) - *vpp =3D NULL; - else { + if (err !=3D 0) { + /* try .zfs control directory */ + err =3D zfsctl_vget(vfsp, ino, flags, vpp); + } else { *vpp =3D ZTOV(zp); vn_lock(*vpp, flags); } Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c (revision 183587) +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c (working copy) @@ -1047,6 +1047,63 @@ zfsctl_lookup_objset(vfs_t *vfsp, uint64 return (error); } =20 +int +zfsctl_vget(vfs_t *vfsp, uint64_t nodeid, int flags, vnode_t **vpp) +{ + zfsvfs_t *zfsvfs =3D vfsp->vfs_data; + vnode_t *dvp, *vp; + zfsctl_snapdir_t *sdp; + zfsctl_node_t *zcp; + zfs_snapentry_t *sep; + int error; + + *vpp =3D NULL; + + ASSERT(zfsvfs->z_ctldir !=3D NULL); + error =3D zfsctl_root_lookup(zfsvfs->z_ctldir, "snapshot", &dvp, + NULL, 0, NULL, kcred); + if (error !=3D 0) + return (error); + + if (nodeid =3D=3D ZFSCTL_INO_ROOT || nodeid =3D=3D ZFSCTL_INO_SNAPDIR) = { + if (nodeid =3D=3D ZFSCTL_INO_SNAPDIR) + *vpp =3D dvp; + else { + VN_RELE(dvp); + *vpp =3D zfsvfs->z_ctldir; + VN_HOLD(*vpp); + } + /* XXX: LK_RETRY? */ + vn_lock(*vpp, flags | LK_RETRY); + return (0); + } + =09 + sdp =3D dvp->v_data; + + mutex_enter(&sdp->sd_lock); + sep =3D avl_first(&sdp->sd_snaps); + while (sep !=3D NULL) { + vp =3D sep->se_root; + zcp =3D vp->v_data; + if (zcp->zc_id =3D=3D nodeid) + break; + + sep =3D AVL_NEXT(&sdp->sd_snaps, sep); + } + + if (sep !=3D NULL) { + VN_HOLD(vp); + *vpp =3D vp; + vn_lock(*vpp, flags); + } else + error =3D EINVAL; + + mutex_exit(&sdp->sd_lock); + + VN_RELE(dvp); + + return (error); +} /* * Unmount any snapshots for the given filesystem. This is called from * zfs_umount() - if we have a ctldir, then go through and unmount all the Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h (revision 183587) +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h (working copy) @@ -60,6 +60,7 @@ int zfsctl_root_lookup(vnode_t *dvp, cha int flags, vnode_t *rdir, cred_t *cr); =20 int zfsctl_lookup_objset(vfs_t *vfsp, uint64_t objsetid, zfsvfs_t **zfsvfsp); +int zfsctl_vget(vfs_t *vfsp, uint64_t nodeid, int flags, vnode_t **vpp); =20 #define ZFSCTL_INO_ROOT 0x1 #define ZFSCTL_INO_SNAPDIR 0x2 %%% --=20 Jaakko From wgodfrey at ena.com Thu Oct 9 13:20:04 2008 From: wgodfrey at ena.com (Weldon Godfrey) Date: Thu Oct 9 13:20:10 2008 Subject: kern/125149: [zfs][nfs] changing into .zfs dir from nfs clientcauses endless panic loop Message-ID: <200810091320.m99DK4VU011916@freefall.freebsd.org> The following reply was made to PR kern/125149; it has been noted by GNATS. From: "Weldon Godfrey" To: "Jaakko Heinonen" , "Volker Werth" Cc: Subject: RE: kern/125149: [zfs][nfs] changing into .zfs dir from nfs clientcauses endless panic loop Date: Thu, 9 Oct 2008 08:19:38 -0500 I am rebuilding right now. FYI --- I modified the patch (corrected number of lines) -@@ -1047,6 +1047,63 @@ zfsctl_lookup_objset(vfs_t *vfsp, uint64 +@@ -1047,6 +1047,62 @@ zfsctl_lookup_objset(vfs_t *vfsp, uint64 Weldon From wgodfrey at ena.com Thu Oct 9 16:30:04 2008 From: wgodfrey at ena.com (Weldon Godfrey) Date: Thu Oct 9 16:30:11 2008 Subject: kern/125149: [zfs][nfs] changing into .zfs dir from nfs clientcauses endless panic loop Message-ID: <200810091630.m99GU3kF025823@freefall.freebsd.org> The following reply was made to PR kern/125149; it has been noted by GNATS. From: "Weldon Godfrey" To: "Jaakko Heinonen" , "Volker Werth" Cc: Subject: RE: kern/125149: [zfs][nfs] changing into .zfs dir from nfs clientcauses endless panic loop Date: Thu, 9 Oct 2008 11:23:12 -0500 Is this patch based on 8-CURRENT or 7-RELEASE? If 8-CURRENT, I don't know if I can test as I would like to stick with 7-RELEASE for now. However, I would like to move to ZFS11 so if there is a patch for 7 for ZFS11 (assuming your patch is based in the v11 code), I would like to apply that. /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zf s/zfs_ctldir.c:1073:33: error: macro "vn_lock" requires 3 arguments, but only 2 given /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zf s/zfs_ctldir.c: In function 'zfsctl_vget': /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zf s/zfs_ctldir.c:1073: error: 'vn_lock' undeclared (first use in this function) /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zf s/zfs_ctldir.c:1073: error: (Each undeclared identifier is reported only once /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zf s/zfs_ctldir.c:1073: error: for each function it appears in.) /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zf s/zfs_ctldir.c:1093:22: error: macro "vn_lock" requires 3 arguments, but only 2 given Weldon -----Original Message----- From: Jaakko Heinonen [mailto:jh@saunalahti.fi]=20 Sent: Tuesday, October 07, 2008 10:37 AM To: Volker Werth Cc: Weldon Godfrey; bug-followup@freebsd.org Subject: Re: kern/125149: [zfs][nfs] changing into .zfs dir from nfs clientcauses endless panic loop Hi, On 2008-10-02, Volker Werth wrote: > > #8 0xffffffff804f06fa in vput (vp=3D0x0) at atomic.h:142 > > #9 0xffffffff8060670d in nfsrv_readdirplus (nfsd=3D0xffffff000584f100, > > slp=3D0xffffff0005725900,=20 > > td=3D0xffffff00059a0340, mrq=3D0xffffffffdf761af0) at > > /usr/src/sys/nfsserver/nfs_serv.c:3613 > > #10 0xffffffff80615a5d in nfssvc (td=3DVariable "td" is not = available. > > ) at /usr/src/sys/nfsserver/nfs_syscalls.c:461 > > #11 0xffffffff8072f377 in syscall (frame=3D0xffffffffdf761c70) at > > /usr/src/sys/amd64/amd64/trap.c:852 > > #12 0xffffffff807158bb in Xfast_syscall () at > > /usr/src/sys/amd64/amd64/exception.S:290 > > #13 0x000000080068746c in ?? () > > Previous frame inner to this frame (corrupt stack?) >=20 > I think the problem is the NULL pointer to vput. A maintainer needs to > check how nvp can get a NULL pointer (judging by assuming my fresh > codebase is not too different from yours). The bug is reproducible with nfs clients using readdirplus. FreeBSD client doesn't use readdirplus by default but you can enable it with -l mount option. Here are steps to reproduce the panic with FreeBSD nfs client: - nfs export a zfs file system - on client mount the file system with -l mount option and list the zfs control directory # mount_nfs -l x.x.x.x:/tank /mnt # ls /mnt/.zfs I see two bugs here: 1) nfsrv_readdirplus() doesn't check VFS_VGET() error status properly. It only checks for EOPNOTSUPP but other errors are ignored. This is the final reason for the panic and in theory it could happen for other file systems too. In this case VFS_VGET() returns EINVAL and results NULL nvp. 2) zfs VFS_VGET() returns EINVAL for .zfs control directory entries. Looking at zfs_vget() it tries find corresponding znode to fulfill the request. However control directory entries don't have backing znodes. Here is a patch which fixes 1). The patch prevents system from panicing but a fix for 2) is needed to make readdirplus work with .zfs directory. %%% Index: sys/nfsserver/nfs_serv.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/nfsserver/nfs_serv.c (revision 183511) +++ sys/nfsserver/nfs_serv.c (working copy) @@ -3597,9 +3597,12 @@ again: * Probe one of the directory entries to see if the filesystem * supports VGET. */ - if (VFS_VGET(vp->v_mount, dp->d_fileno, LK_EXCLUSIVE, &nvp) =3D=3D - EOPNOTSUPP) { - error =3D NFSERR_NOTSUPP; + error =3D VFS_VGET(vp->v_mount, dp->d_fileno, LK_EXCLUSIVE, &nvp); + if (error) { + if (error =3D=3D EOPNOTSUPP) + error =3D NFSERR_NOTSUPP; + else + error =3D NFSERR_SERVERFAULT; vrele(vp); vp =3D NULL; free((caddr_t)cookies, M_TEMP); %%% And here's an attempt to add support for .zfs control directory entries (bug 2)) in zfs_vget(). The patch is very experimental and it only works for snapshots which are already active (mounted). %%% Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c (revision 183587) +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c (working copy) @@ -759,9 +759,10 @@ zfs_vget(vfs_t *vfsp, ino_t ino, int fla VN_RELE(ZTOV(zp)); err =3D EINVAL; } - if (err !=3D 0) - *vpp =3D NULL; - else { + if (err !=3D 0) { + /* try .zfs control directory */ + err =3D zfsctl_vget(vfsp, ino, flags, vpp); + } else { *vpp =3D ZTOV(zp); vn_lock(*vpp, flags); } Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c (revision 183587) +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c (working copy) @@ -1047,6 +1047,63 @@ zfsctl_lookup_objset(vfs_t *vfsp, uint64 return (error); } =20 +int +zfsctl_vget(vfs_t *vfsp, uint64_t nodeid, int flags, vnode_t **vpp) +{ + zfsvfs_t *zfsvfs =3D vfsp->vfs_data; + vnode_t *dvp, *vp; + zfsctl_snapdir_t *sdp; + zfsctl_node_t *zcp; + zfs_snapentry_t *sep; + int error; + + *vpp =3D NULL; + + ASSERT(zfsvfs->z_ctldir !=3D NULL); + error =3D zfsctl_root_lookup(zfsvfs->z_ctldir, "snapshot", &dvp, + NULL, 0, NULL, kcred); + if (error !=3D 0) + return (error); + + if (nodeid =3D=3D ZFSCTL_INO_ROOT || nodeid =3D=3D ZFSCTL_INO_SNAPDIR) = { + if (nodeid =3D=3D ZFSCTL_INO_SNAPDIR) + *vpp =3D dvp; + else { + VN_RELE(dvp); + *vpp =3D zfsvfs->z_ctldir; + VN_HOLD(*vpp); + } + /* XXX: LK_RETRY? */ + vn_lock(*vpp, flags | LK_RETRY); + return (0); + } + =09 + sdp =3D dvp->v_data; + + mutex_enter(&sdp->sd_lock); + sep =3D avl_first(&sdp->sd_snaps); + while (sep !=3D NULL) { + vp =3D sep->se_root; + zcp =3D vp->v_data; + if (zcp->zc_id =3D=3D nodeid) + break; + + sep =3D AVL_NEXT(&sdp->sd_snaps, sep); + } + + if (sep !=3D NULL) { + VN_HOLD(vp); + *vpp =3D vp; + vn_lock(*vpp, flags); + } else + error =3D EINVAL; + + mutex_exit(&sdp->sd_lock); + + VN_RELE(dvp); + + return (error); +} /* * Unmount any snapshots for the given filesystem. This is called from * zfs_umount() - if we have a ctldir, then go through and unmount all the Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h (revision 183587) +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h (working copy) @@ -60,6 +60,7 @@ int zfsctl_root_lookup(vnode_t *dvp, cha int flags, vnode_t *rdir, cred_t *cr); =20 int zfsctl_lookup_objset(vfs_t *vfsp, uint64_t objsetid, zfsvfs_t **zfsvfsp); +int zfsctl_vget(vfs_t *vfsp, uint64_t nodeid, int flags, vnode_t **vpp); =20 #define ZFSCTL_INO_ROOT 0x1 #define ZFSCTL_INO_SNAPDIR 0x2 %%% --=20 Jaakko From jh at saunalahti.fi Thu Oct 9 19:50:04 2008 From: jh at saunalahti.fi (Jaakko Heinonen) Date: Thu Oct 9 19:50:10 2008 Subject: kern/125149: [zfs][nfs] changing into .zfs dir from nfs clientcauses endless panic loop Message-ID: <200810091950.m99Jo48P041883@freefall.freebsd.org> The following reply was made to PR kern/125149; it has been noted by GNATS. From: Jaakko Heinonen To: Weldon Godfrey Cc: bug-followup@freebsd.org Subject: Re: kern/125149: [zfs][nfs] changing into .zfs dir from nfs clientcauses endless panic loop Date: Thu, 9 Oct 2008 22:44:38 +0300 On 2008-10-09, Weldon Godfrey wrote: > Is this patch based on 8-CURRENT or 7-RELEASE? If 8-CURRENT, I don't > know if I can test as I would like to stick with 7-RELEASE for now. Patches are against head. Sorry that I didn't mention that. The nfs patch applies against RELENG_7 with offset and here's the zfs patch against RELENG_7. (Disclaimer: compile tested only) %%% Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c =================================================================== --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c (revision 183727) +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c (working copy) @@ -759,9 +759,10 @@ zfs_vget(vfs_t *vfsp, ino_t ino, int fla VN_RELE(ZTOV(zp)); err = EINVAL; } - if (err != 0) - *vpp = NULL; - else { + if (err != 0) { + /* try .zfs control directory */ + err = zfsctl_vget(vfsp, ino, flags, vpp); + } else { *vpp = ZTOV(zp); vn_lock(*vpp, flags, curthread); } Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c =================================================================== --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c (revision 183727) +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c (working copy) @@ -1044,6 +1044,63 @@ zfsctl_lookup_objset(vfs_t *vfsp, uint64 return (error); } +int +zfsctl_vget(vfs_t *vfsp, uint64_t nodeid, int flags, vnode_t **vpp) +{ + zfsvfs_t *zfsvfs = vfsp->vfs_data; + vnode_t *dvp, *vp; + zfsctl_snapdir_t *sdp; + zfsctl_node_t *zcp; + zfs_snapentry_t *sep; + int error; + + *vpp = NULL; + + ASSERT(zfsvfs->z_ctldir != NULL); + error = zfsctl_root_lookup(zfsvfs->z_ctldir, "snapshot", &dvp, + NULL, 0, NULL, kcred); + if (error != 0) + return (error); + + if (nodeid == ZFSCTL_INO_ROOT || nodeid == ZFSCTL_INO_SNAPDIR) { + if (nodeid == ZFSCTL_INO_SNAPDIR) + *vpp = dvp; + else { + VN_RELE(dvp); + *vpp = zfsvfs->z_ctldir; + VN_HOLD(*vpp); + } + /* XXX: LK_RETRY? */ + vn_lock(*vpp, flags | LK_RETRY, curthread); + return (0); + } + + sdp = dvp->v_data; + + mutex_enter(&sdp->sd_lock); + sep = avl_first(&sdp->sd_snaps); + while (sep != NULL) { + vp = sep->se_root; + zcp = vp->v_data; + if (zcp->zc_id == nodeid) + break; + + sep = AVL_NEXT(&sdp->sd_snaps, sep); + } + + if (sep != NULL) { + VN_HOLD(vp); + *vpp = vp; + vn_lock(*vpp, flags, curthread); + } else + error = EINVAL; + + mutex_exit(&sdp->sd_lock); + + VN_RELE(dvp); + + return (error); +} /* * Unmount any snapshots for the given filesystem. This is called from * zfs_umount() - if we have a ctldir, then go through and unmount all the Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h =================================================================== --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h (revision 183727) +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h (working copy) @@ -60,6 +60,7 @@ int zfsctl_root_lookup(vnode_t *dvp, cha int flags, vnode_t *rdir, cred_t *cr); int zfsctl_lookup_objset(vfs_t *vfsp, uint64_t objsetid, zfsvfs_t **zfsvfsp); +int zfsctl_vget(vfs_t *vfsp, uint64_t nodeid, int flags, vnode_t **vpp); #define ZFSCTL_INO_ROOT 0x1 #define ZFSCTL_INO_SNAPDIR 0x2 %%% -- Jaakko From wgodfrey at ena.com Fri Oct 10 13:20:05 2008 From: wgodfrey at ena.com (Weldon Godfrey) Date: Fri Oct 10 13:20:12 2008 Subject: kern/125149: [zfs][nfs] changing into .zfs dir from nfsclientcauses endless panic loop Message-ID: <200810101320.m9ADK5g4063010@freefall.freebsd.org> The following reply was made to PR kern/125149; it has been noted by GNATS. From: "Weldon Godfrey" To: "Jaakko Heinonen" Cc: Subject: RE: kern/125149: [zfs][nfs] changing into .zfs dir from nfsclientcauses endless panic loop Date: Fri, 10 Oct 2008 08:11:17 -0500 That's okay, although I won't be able to help test since I am close to using the system in production. We can live without needing to go to .zfs directory from a client. Also, I have set the nordirplus option on the clients now. =20 Which, btw, could this also be the other issue I was seeing? When we tested rigoriously from CentOS 3.x clients, after 2-3 hrs of testing, the system would panic. From the fbsd-fs list, it was noted from the backtrace that the vnode was becoming invalid. This seemed to be less of a case with CentOS 5.x clients (by a lot, although I did get 1 panic recently). I am rerunning the tests over this weekend. Thank you for helping! Weldon -----Original Message----- From: Jaakko Heinonen [mailto:jh@saunalahti.fi]=20 Sent: Thursday, October 09, 2008 2:45 PM To: Weldon Godfrey Cc: bug-followup@freebsd.org Subject: Re: kern/125149: [zfs][nfs] changing into .zfs dir from nfsclientcauses endless panic loop On 2008-10-09, Weldon Godfrey wrote: > Is this patch based on 8-CURRENT or 7-RELEASE? If 8-CURRENT, I don't > know if I can test as I would like to stick with 7-RELEASE for now. Patches are against head. Sorry that I didn't mention that. The nfs patch applies against RELENG_7 with offset and here's the zfs patch against RELENG_7. (Disclaimer: compile tested only) %%% Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c (revision 183727) +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c (working copy) @@ -759,9 +759,10 @@ zfs_vget(vfs_t *vfsp, ino_t ino, int fla VN_RELE(ZTOV(zp)); err =3D EINVAL; } - if (err !=3D 0) - *vpp =3D NULL; - else { + if (err !=3D 0) { + /* try .zfs control directory */ + err =3D zfsctl_vget(vfsp, ino, flags, vpp); + } else { *vpp =3D ZTOV(zp); vn_lock(*vpp, flags, curthread); } Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c (revision 183727) +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c (working copy) @@ -1044,6 +1044,63 @@ zfsctl_lookup_objset(vfs_t *vfsp, uint64 return (error); } =20 +int +zfsctl_vget(vfs_t *vfsp, uint64_t nodeid, int flags, vnode_t **vpp) +{ + zfsvfs_t *zfsvfs =3D vfsp->vfs_data; + vnode_t *dvp, *vp; + zfsctl_snapdir_t *sdp; + zfsctl_node_t *zcp; + zfs_snapentry_t *sep; + int error; + + *vpp =3D NULL; + + ASSERT(zfsvfs->z_ctldir !=3D NULL); + error =3D zfsctl_root_lookup(zfsvfs->z_ctldir, "snapshot", &dvp, + NULL, 0, NULL, kcred); + if (error !=3D 0) + return (error); + + if (nodeid =3D=3D ZFSCTL_INO_ROOT || nodeid =3D=3D ZFSCTL_INO_SNAPDIR) = { + if (nodeid =3D=3D ZFSCTL_INO_SNAPDIR) + *vpp =3D dvp; + else { + VN_RELE(dvp); + *vpp =3D zfsvfs->z_ctldir; + VN_HOLD(*vpp); + } + /* XXX: LK_RETRY? */ + vn_lock(*vpp, flags | LK_RETRY, curthread); + return (0); + } + =09 + sdp =3D dvp->v_data; + + mutex_enter(&sdp->sd_lock); + sep =3D avl_first(&sdp->sd_snaps); + while (sep !=3D NULL) { + vp =3D sep->se_root; + zcp =3D vp->v_data; + if (zcp->zc_id =3D=3D nodeid) + break; + + sep =3D AVL_NEXT(&sdp->sd_snaps, sep); + } + + if (sep !=3D NULL) { + VN_HOLD(vp); + *vpp =3D vp; + vn_lock(*vpp, flags, curthread); + } else + error =3D EINVAL; + + mutex_exit(&sdp->sd_lock); + + VN_RELE(dvp); + + return (error); +} /* * Unmount any snapshots for the given filesystem. This is called from * zfs_umount() - if we have a ctldir, then go through and unmount all the Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h (revision 183727) +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h (working copy) @@ -60,6 +60,7 @@ int zfsctl_root_lookup(vnode_t *dvp, cha int flags, vnode_t *rdir, cred_t *cr); =20 int zfsctl_lookup_objset(vfs_t *vfsp, uint64_t objsetid, zfsvfs_t **zfsvfsp); +int zfsctl_vget(vfs_t *vfsp, uint64_t nodeid, int flags, vnode_t **vpp); =20 #define ZFSCTL_INO_ROOT 0x1 #define ZFSCTL_INO_SNAPDIR 0x2 %%% --=20 Jaakko From snb at moduli.net Mon Oct 13 10:14:20 2008 From: snb at moduli.net (Nick Barkas) Date: Mon Oct 13 10:14:27 2008 Subject: CFT: vm_lowmem event handler patch for dirhash Message-ID: Hello These past few months I worked with David Malone on a Google summer of code project for allocating memory for dirhash dynamically. The hope is that this will allow more memory to go to dirhashes on systems that have memory to spare, so that performance working with large directories can be improved. We decided to actually keep the current memory allocation scheme for dirhash unchanged, but I added a vm_lowmem event handler (see EVENTHANDLER(9)) so that older dirhashes will be deleted when the kernel signals that it needs more memory. This should allow vfs.ufs.dirhash_maxmem to be increased quite a bit above its current 2MB default. I have patches against FreeBSD 7-STABLE (I've only tested this one on 7.0-RELEASE) here: http://www.nada.kth.se/~barkas/dirhash_lowmem_7-stable_2008-8-14.patch and 8-CURRENT here: http://www.nada.kth.se/~barkas/dirhash_lowmem_head_2008-10-12.patch Please try these out if you can, and let me know if you see any performance benefits! I've only tested this code with some benchmark scripts, so I am very interested to see how it does under real workloads. Note that the patches do not yet change the default vfs.ufs.dirhash_maxmem because I haven't figured out what would be a new reasonable default yet. To allow for more than 2MB of dirhashes you'll need to set that yourself. In most of my testing on a system with 1GB of memory, I set dirhash_maxmem to 64MB. This seems to be more than enough space to fit, for example, dirhashes for a million email messages in maildirs. There are some other parameters that one could tune and potentially achieve better performance gains. The new vfs.ufs.dirhash_reclaimage sysctl sets the number of seconds dirhashes can remain unused before a vm_lowmem event will unconditionally delete them. The default of 5s works reasonably well in all my tests, although it is somewhat workload dependent. If you change this value and see different performance under your workload, I would definitely like to hear about it. For more information and a bunch of graphs with results from my benchmarking, take a look at http://wiki.freebsd.org/DirhashDynamicMemory. Also, I'll be giving a talk about this project quite soon now at EuroBSDCon 2008. Nick From bugmaster at FreeBSD.org Mon Oct 13 11:06:49 2008 From: bugmaster at FreeBSD.org (FreeBSD bugmaster) Date: Mon Oct 13 11:07:44 2008 Subject: Current problem reports assigned to freebsd-fs@FreeBSD.org Message-ID: <200810131106.m9DB6mEW029414@freefall.freebsd.org> Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/127420 fs [gjournal] [panic] Journal overflow on gmirrored gjour o kern/127213 fs [tmpfs] sendfile on tmpfs data corruption o kern/127029 fs [panic] mount(8): trying to mount a write protected zi o kern/126287 fs [ufs] [panic] Kernel panics while mounting an UFS file o kern/125536 fs [ext2fs] ext 2 mounts cleanly but fails on commands li o kern/125149 fs [nfs][panic] changing into .zfs dir from nfs client ca o kern/124621 fs [ext3] Cannot mount ext2fs partition o kern/122888 fs [zfs] zfs hang w/ prefetch on, zil off while running t o bin/122172 fs [fs]: amd(8) automount daemon dies on 6.3-STABLE i386, o bin/121072 fs [smbfs] mount_smbfs(8) cannot normally convert the cha o bin/118249 fs mv(1): moving a directory changes its mtime o kern/116170 fs [panic] Kernel panic when mounting /tmp o kern/114955 fs [cd9660] [patch] [request] support for mask,dirmask,ui o kern/114847 fs [ntfs] [patch] [request] dirmask support for NTFS ala o kern/114676 fs [ufs] snapshot creation panics: snapacct_ufs2: bad blo o bin/114468 fs [patch] [request] add -d option to umount(8) to detach o bin/113838 fs [patch] [request] mount(8): add support for relative p o bin/113049 fs [patch] [request] make quot(8) use getopt(3) and show o kern/112658 fs [smbfs] [patch] smbfs and caching problems (resolves b o kern/93942 fs [vfs] [patch] panic: ufs_dirbad: bad dir (patch from D 20 problems total. From 000.fbsd at quip.cz Mon Oct 13 11:44:03 2008 From: 000.fbsd at quip.cz (Miroslav Lachman) Date: Mon Oct 13 11:44:10 2008 Subject: ZFS on backup fileserver - RAM usage Message-ID: <48F334A0.3080005@quip.cz> I am planning to install new server for backups with 4x 1TB SATA II drives in RAIDZ. There will be about 20 separated filesystems from one zpool, few jails with ssh (scp/sftp), rsync and maybe FTP daemons, no other services with huge RAM utilization. As FreeBSD 7.1(-BETA) amd64 still have some limits of kernel space memory, are there any benefits to put more then 2GB or 3GB in this server? Will it be more stabel or faster with for example 6GB of RAM? (I can buy it, RAM is really cheap in these days, but will it have some sense or is it vaste?) I am using this tuning on testing machine (with 2GB RAM): vm.kmem_size="1024M" vm.kmem_size_max="1024M" vfs.zfs.prefetch_disable="1" vfs.zfs.arc_min="16M" vfs.zfs.arc_max="64M" kern.maxvnodes="400000" (recommendations from http://wiki.freebsd.org/ZFSTuningGuide) Have somebody better results with another values? Miroslav Lachman From koitsu at FreeBSD.org Mon Oct 13 12:38:26 2008 From: koitsu at FreeBSD.org (Jeremy Chadwick) Date: Mon Oct 13 12:38:33 2008 Subject: ZFS on backup fileserver - RAM usage In-Reply-To: <48F334A0.3080005@quip.cz> References: <48F334A0.3080005@quip.cz> Message-ID: <20081013123823.GA18738@icarus.home.lan> On Mon, Oct 13, 2008 at 01:44:32PM +0200, Miroslav Lachman wrote: > I am planning to install new server for backups with 4x 1TB SATA II > drives in RAIDZ. There will be about 20 separated filesystems from one > zpool, few jails with ssh (scp/sftp), rsync and maybe FTP daemons, no > other services with huge RAM utilization. As FreeBSD 7.1(-BETA) amd64 > still have some limits of kernel space memory, are there any benefits to > put more then 2GB or 3GB in this server? Will it be more stabel or > faster with for example 6GB of RAM? (I can buy it, RAM is really cheap > in these days, but will it have some sense or is it vaste?) Adding more RAM will work just fine for userland programs, meaning they will be able to make use of the additional RAM. The kernel, with regards to kmap and kmem, however, will not. If you need that functionality, you'll have to run CURRENT. > I am using this tuning on testing machine (with 2GB RAM): > vm.kmem_size="1024M" > vm.kmem_size_max="1024M" > vfs.zfs.prefetch_disable="1" > vfs.zfs.arc_min="16M" > vfs.zfs.arc_max="64M" > kern.maxvnodes="400000" > > (recommendations from http://wiki.freebsd.org/ZFSTuningGuide) > > Have somebody better results with another values? The values look fine, but keep in mind that you still may encounter crashing with that kind of load (you're sticking a lot of stuff on one single box, all of which utilises ZFS heavily). You'll simply need to tune these as those situations arise. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | From ivoras at freebsd.org Mon Oct 13 13:05:19 2008 From: ivoras at freebsd.org (Ivan Voras) Date: Mon Oct 13 13:05:26 2008 Subject: CFT: vm_lowmem event handler patch for dirhash In-Reply-To: References: Message-ID: Nick Barkas wrote: > For more information and a bunch of graphs with results from my > benchmarking, take a look at > http://wiki.freebsd.org/DirhashDynamicMemory. Also, I'll be giving a > talk about this project quite soon now at EuroBSDCon 2008. It's interesting to see that the 2 MB cache is sometimes a little bit faster than the 64 MB one (e.g. kernel build, svn operations, mail). Can you point to an explanation? A bad hash function? Bucket count too low? Experimental inaccuracy? -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 252 bytes Desc: OpenPGP digital signature Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20081013/b85e2913/signature.pgp From jh at saunalahti.fi Mon Oct 13 15:20:10 2008 From: jh at saunalahti.fi (Jaakko Heinonen) Date: Mon Oct 13 15:20:16 2008 Subject: kern/125149: [zfs][nfs] changing into .zfs dir from nfsclientcauses endless panic loop Message-ID: <200810131520.m9DFK9lP054795@freefall.freebsd.org> The following reply was made to PR kern/125149; it has been noted by GNATS. From: Jaakko Heinonen To: Weldon Godfrey Cc: bug-followup@freebsd.org Subject: Re: kern/125149: [zfs][nfs] changing into .zfs dir from nfsclientcauses endless panic loop Date: Mon, 13 Oct 2008 18:11:46 +0300 On 2008-10-10, Weldon Godfrey wrote: > Which, btw, could this also be the other issue I was seeing? When we > tested rigoriously from CentOS 3.x clients, after 2-3 hrs of testing, > the system would panic. From the fbsd-fs list, it was noted from the > backtrace that the vnode was becoming invalid. Well, if you mean this message http://lists.freebsd.org/pipermail/freebsd-fs/2008-August/005120.html and Rick's analysis is correct I am quite certain that they are different issues. -- Jaakko From matt at corp.spry.com Mon Oct 13 20:43:13 2008 From: matt at corp.spry.com (Matt Simerson) Date: Mon Oct 13 20:43:26 2008 Subject: ZFS on backup fileserver - RAM usage In-Reply-To: <48F334A0.3080005@quip.cz> References: <48F334A0.3080005@quip.cz> Message-ID: <9AAEBB23-75E8-49B2-BA2F-0AF98F79280F@corp.spry.com> It all depends on your workload. If you work your backup serves hard (as I do, backing up thousands of OS instances), you'll have significant reliability problems using FreeBSD 7.1 and ZFS. After a crash that corrupted my file systems, I have moved to 8-head with Pawel's latest patch. My backup servers have between 16 and 24 disks each. The ones with 16GB of RAM crash far less frequently than my server that has only 2GB. That one is getting upgraded soon. Matt On Oct 13, 2008, at 4:44 AM, Miroslav Lachman wrote: > I am planning to install new server for backups with 4x 1TB SATA II > drives in RAIDZ. There will be about 20 separated filesystems from > one zpool, few jails with ssh (scp/sftp), rsync and maybe FTP > daemons, no other services with huge RAM utilization. As FreeBSD > 7.1(-BETA) amd64 still have some limits of kernel space memory, are > there any benefits to put more then 2GB or 3GB in this server? Will > it be more stabel or faster with for example 6GB of RAM? (I can buy > it, RAM is really cheap in these days, but will it have some sense > or is it vaste?) > > I am using this tuning on testing machine (with 2GB RAM): > vm.kmem_size="1024M" > vm.kmem_size_max="1024M" > vfs.zfs.prefetch_disable="1" > vfs.zfs.arc_min="16M" > vfs.zfs.arc_max="64M" > kern.maxvnodes="400000" > > (recommendations from http://wiki.freebsd.org/ZFSTuningGuide) > > Have somebody better results with another values? > > Miroslav Lachman > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" From 000.fbsd at quip.cz Mon Oct 13 21:08:09 2008 From: 000.fbsd at quip.cz (Miroslav Lachman) Date: Mon Oct 13 21:08:16 2008 Subject: ZFS on backup fileserver - RAM usage In-Reply-To: <9AAEBB23-75E8-49B2-BA2F-0AF98F79280F@corp.spry.com> References: <48F334A0.3080005@quip.cz> <9AAEBB23-75E8-49B2-BA2F-0AF98F79280F@corp.spry.com> Message-ID: <48F3B8D6.6060309@quip.cz> Matt Simerson wrote: > > It all depends on your workload. If you work your backup serves hard > (as I do, backing up thousands of OS instances), you'll have > significant reliability problems using FreeBSD 7.1 and ZFS. After a > crash that corrupted my file systems, I have moved to 8-head with > Pawel's latest patch. > > My backup servers have between 16 and 24 disks each. The ones with 16GB > of RAM crash far less frequently than my server that has only 2GB. That > one is getting upgraded soon. > > Matt I am planning to backup about 10-15 servers (mainly webservers and few mailservers) and not expecting high load. Did 8-current with the latest ZFS patch fixed all stability problems? Thanks for suggestions to both of you. Miroslav Lachman From matt at corp.spry.com Mon Oct 13 21:21:54 2008 From: matt at corp.spry.com (Matt Simerson) Date: Mon Oct 13 21:22:01 2008 Subject: ZFS on backup fileserver - RAM usage In-Reply-To: <48F3B8D6.6060309@quip.cz> References: <48F334A0.3080005@quip.cz> <9AAEBB23-75E8-49B2-BA2F-0AF98F79280F@corp.spry.com> <48F3B8D6.6060309@quip.cz> Message-ID: <16C9B293-7BBE-496D-BA0B-DC78299186ED@corp.spry.com> On Oct 13, 2008, at 2:08 PM, Miroslav Lachman wrote: > Matt Simerson wrote: >> It all depends on your workload. If you work your backup serves >> hard (as I do, backing up thousands of OS instances), you'll have >> significant reliability problems using FreeBSD 7.1 and ZFS. After >> a crash that corrupted my file systems, I have moved to 8-head >> with Pawel's latest patch. >> My backup servers have between 16 and 24 disks each. The ones with >> 16GB of RAM crash far less frequently than my server that has only >> 2GB. That one is getting upgraded soon. >> Matt > > I am planning to backup about 10-15 servers (mainly webservers and > few mailservers) and not expecting high load. > Did 8-current with the latest ZFS patch fixed all stability problems? > > Thanks for suggestions to both of you. > > Miroslav Lachman No, there are still stability issues under heavy load. The are just far less frequent under 8-current than under 7. I couldn't keep my systems up for more than 2 days before switching to 8. Running 8-head was better, but so far the best available configuration is 8-head with "the patch" applied. Matt From yalur at mail.ru Mon Oct 13 21:55:28 2008 From: yalur at mail.ru (Ruslan Kovtun) Date: Mon Oct 13 21:55:34 2008 Subject: ZFS on backup fileserver - RAM usage In-Reply-To: <16C9B293-7BBE-496D-BA0B-DC78299186ED@corp.spry.com> References: <48F334A0.3080005@quip.cz> <48F3B8D6.6060309@quip.cz> <16C9B293-7BBE-496D-BA0B-DC78299186ED@corp.spry.com> Message-ID: <200810140052.14718.yalur@mail.ru> > was better, but so far the best available configuration is 8-head with > "the patch" applied. Is this patch already applied after cvsup or I need apply it manualy? ______________________________ > On Oct 13, 2008, at 2:08 PM, Miroslav Lachman wrote: > > Matt Simerson wrote: > >> It all depends on your workload. If you work your backup serves > >> hard (as I do, backing up thousands of OS instances), you'll have > >> significant reliability problems using FreeBSD 7.1 and ZFS. After > >> a crash that corrupted my file systems, I have moved to 8-head > >> with Pawel's latest patch. > >> My backup servers have between 16 and 24 disks each. The ones with > >> 16GB of RAM crash far less frequently than my server that has only > >> 2GB. That one is getting upgraded soon. > >> Matt > > > > I am planning to backup about 10-15 servers (mainly webservers and > > few mailservers) and not expecting high load. > > Did 8-current with the latest ZFS patch fixed all stability problems? > > > > Thanks for suggestions to both of you. > > > > Miroslav Lachman > > No, there are still stability issues under heavy load. The are just > far less frequent under 8-current than under 7. I couldn't keep my > systems up for more than 2 days before switching to 8. Running 8-head > was better, but so far the best available configuration is 8-head with > "the patch" applied. > > Matt > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" -- ________________ Ruslan Kovtun mailto: yalur@mail.ru mob: +380503557878, +380919015095 ICQ: 277696182 From koitsu at FreeBSD.org Mon Oct 13 21:57:28 2008 From: koitsu at FreeBSD.org (Jeremy Chadwick) Date: Mon Oct 13 21:57:39 2008 Subject: ZFS on backup fileserver - RAM usage In-Reply-To: <200810140052.14718.yalur@mail.ru> References: <48F334A0.3080005@quip.cz> <48F3B8D6.6060309@quip.cz> <16C9B293-7BBE-496D-BA0B-DC78299186ED@corp.spry.com> <200810140052.14718.yalur@mail.ru> Message-ID: <20081013215722.GA29946@icarus.home.lan> On Tue, Oct 14, 2008 at 12:52:14AM +0300, Ruslan Kovtun wrote: > > was better, but so far the best available configuration is 8-head with > > "the patch" applied. > > Is this patch already applied after cvsup or I need apply it manualy? AFAIK, the ZFS patch in question *has not* been committed to HEAD; you will need to apply the patch manually. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | From snb at moduli.net Tue Oct 14 16:53:53 2008 From: snb at moduli.net (Nick Barkas) Date: Tue Oct 14 16:54:00 2008 Subject: CFT: vm_lowmem event handler patch for dirhash In-Reply-To: References: Message-ID: On Mon, Oct 13, 2008 at 2:22 PM, Ivan Voras wrote: > Nick Barkas wrote: > >> For more information and a bunch of graphs with results from my >> benchmarking, take a look at >> http://wiki.freebsd.org/DirhashDynamicMemory. Also, I'll be giving a >> talk about this project quite soon now at EuroBSDCon 2008. > > It's interesting to see that the 2 MB cache is sometimes a little bit > faster than the 64 MB one (e.g. kernel build, svn operations, mail). Can > you point to an explanation? A bad hash function? Bucket count too low? > Experimental inaccuracy? Yes, some of the benchmark results have been a bit surprising to me. On 7.0, at least, the results seem pretty reasonable. The kernel build and svn operations tests were faster with 2MB than 64MB of memory without my vm_lowmem handler, or with the patch while using certain reclaim age values that apparently were not so good. This makes sense to me because, perhaps, these tasks can run faster when more memory is available for things other than dirhash. In both of these cases, using a 64MB limit for dirhash with the reclaim age at 5 seconds outperformed the default 2MB limit on an unpatched kernel. Mail creation is faster in all cases when there is a higher memory limit for dirhash, presumably because this is a task (inserting files into huge directories) that dirhash optimizes really well. On -CURRENT things seem to make less sense, though. Both the kernel build and svn operations are fastest when using 64MB of memory for dirhash, with no vm_lowmem handler. Mail creation is surprisingly fastest when using only a 2MB limit for dirhash, and slowest when using 64MB on an unpatched kernel. This is pretty much the opposite of what we see on 7.0. Using the kernel with the vm_lowmem handler results in performance that is usually somewhere between the results we get with the 2MB and 64MB unpatched kernel. I don't have a very good theory to explain these results right now. Most of the changes in the dirhash code between the 7 and 8 branches involve differences in locking. It would probably be necessary to do some profiling of the kernel and the benchmark processes both to get a better idea of what's going on. Before I do that, though, I was hoping to see what kind of results others may find using my code with a real world application. It is certainly possible that my results are strange simply because my tests are not so realistic :) Nick From yalur at mail.ru Tue Oct 14 21:00:14 2008 From: yalur at mail.ru (Ruslan Kovtun) Date: Tue Oct 14 21:00:23 2008 Subject: ZFS on backup fileserver - RAM usage In-Reply-To: <16C9B293-7BBE-496D-BA0B-DC78299186ED@corp.spry.com> References: <48F334A0.3080005@quip.cz> <48F3B8D6.6060309@quip.cz> <16C9B293-7BBE-496D-BA0B-DC78299186ED@corp.spry.com> Message-ID: <200810142359.34263.yalur@mail.ru> I tried to apply this patch (zfs_20080727.patch) but I have found several errors (see below). Is this problem with patch or I need manualy apply these changes? Patching file cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h using Plan A... Hunk #11 failed at 347. Patching file cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h using Plan A... Hunk #11 failed at 347. Patching file sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c using Plan A... Hunk #26 failed at 1053. Patching file sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_replay.c using Plan A... Hunk #18 failed at 766. Patching file sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c using Plan A... Hunk #82 failed at 3478. Patching file sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c using Plan A... Hunk #6 failed at 136. Hunk #13 failed at 560. Hunk #18 failed at 759. Hunk #20 failed at 877. Hunk #26 failed at 1336. Patching file sys/kern/kern_jail.c using Plan A... Hunk #1 failed at 34. ____________________________________________________ > On Oct 13, 2008, at 2:08 PM, Miroslav Lachman wrote: > > Matt Simerson wrote: > >> It all depends on your workload. If you work your backup serves > >> hard (as I do, backing up thousands of OS instances), you'll have > >> significant reliability problems using FreeBSD 7.1 and ZFS. After > >> a crash that corrupted my file systems, I have moved to 8-head > >> with Pawel's latest patch. > >> My backup servers have between 16 and 24 disks each. The ones with > >> 16GB of RAM crash far less frequently than my server that has only > >> 2GB. That one is getting upgraded soon. > >> Matt > > > > I am planning to backup about 10-15 servers (mainly webservers and > > few mailservers) and not expecting high load. > > Did 8-current with the latest ZFS patch fixed all stability problems? > > > > Thanks for suggestions to both of you. > > > > Miroslav Lachman > > No, there are still stability issues under heavy load. The are just > far less frequent under 8-current than under 7. I couldn't keep my > systems up for more than 2 days before switching to 8. Running 8-head > was better, but so far the best available configuration is 8-head with > "the patch" applied. > > Matt > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" -- ________________ Ruslan Kovtun mailto: yalur@mail.ru mob: +380503557878, +380919015095 ICQ: 277696182 From ler at lerctr.org Tue Oct 14 22:27:22 2008 From: ler at lerctr.org (Larry Rosenman) Date: Tue Oct 14 22:27:29 2008 Subject: ZFS on backup fileserver - RAM usage In-Reply-To: <200810142359.34263.yalur@mail.ru> References: <48F334A0.3080005@quip.cz> <48F3B8D6.6060309@quip.cz> <16C9B293-7BBE-496D-BA0B-DC78299186ED@corp.spry.com> <200810142359.34263.yalur@mail.ru> Message-ID: <00a901c92e49$b480ccc0$1d826640$@org> This is a known issue. HEAD has diverged from the sources that the patch was generated against. I've been running with a HEAD from 2008-08-24 with the patch and upgraded ZFS pool/FS's and have no complaints. I'm just waiting patiently for pjd@FreeBSD.org to either update the patch or commit the updated bits to HEAD. I'm not going to update my system again till something is in svn/cvs. -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 512-248-2683 E-Mail: ler@lerctr.org US Mail: 430 Valona Loop, Round Rock, TX 78681-3893 -----Original Message----- From: owner-freebsd-fs@freebsd.org [mailto:owner-freebsd-fs@freebsd.org] On Behalf Of Ruslan Kovtun Sent: Tuesday, October 14, 2008 4:00 PM To: freebsd-fs@freebsd.org Subject: Re: ZFS on backup fileserver - RAM usage I tried to apply this patch (zfs_20080727.patch) but I have found several errors (see below). Is this problem with patch or I need manualy apply these changes? Patching file cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h using Plan A... Hunk #11 failed at 347. Patching file cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h using Plan A... Hunk #11 failed at 347. Patching file sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c using Plan A... Hunk #26 failed at 1053. Patching file sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_replay.c using Plan A... Hunk #18 failed at 766. Patching file sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c using Plan A... Hunk #82 failed at 3478. Patching file sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c using Plan A... Hunk #6 failed at 136. Hunk #13 failed at 560. Hunk #18 failed at 759. Hunk #20 failed at 877. Hunk #26 failed at 1336. Patching file sys/kern/kern_jail.c using Plan A... Hunk #1 failed at 34. ____________________________________________________ > On Oct 13, 2008, at 2:08 PM, Miroslav Lachman wrote: > > Matt Simerson wrote: > >> It all depends on your workload. If you work your backup serves > >> hard (as I do, backing up thousands of OS instances), you'll have > >> significant reliability problems using FreeBSD 7.1 and ZFS. After > >> a crash that corrupted my file systems, I have moved to 8-head > >> with Pawel's latest patch. > >> My backup servers have between 16 and 24 disks each. The ones with > >> 16GB of RAM crash far less frequently than my server that has only > >> 2GB. That one is getting upgraded soon. > >> Matt > > > > I am planning to backup about 10-15 servers (mainly webservers and > > few mailservers) and not expecting high load. > > Did 8-current with the latest ZFS patch fixed all stability problems? > > > > Thanks for suggestions to both of you. > > > > Miroslav Lachman > > No, there are still stability issues under heavy load. The are just > far less frequent under 8-current than under 7. I couldn't keep my > systems up for more than 2 days before switching to 8. Running 8-head > was better, but so far the best available configuration is 8-head with > "the patch" applied. > > Matt > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" -- ________________ Ruslan Kovtun mailto: yalur@mail.ru mob: +380503557878, +380919015095 ICQ: 277696182 _______________________________________________ freebsd-fs@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" From matt at corp.spry.com Tue Oct 14 23:18:08 2008 From: matt at corp.spry.com (Matt Simerson) Date: Tue Oct 14 23:18:15 2008 Subject: ZFS on backup fileserver - RAM usage In-Reply-To: <200810142359.34263.yalur@mail.ru> References: <48F334A0.3080005@quip.cz> <48F3B8D6.6060309@quip.cz> <16C9B293-7BBE-496D-BA0B-DC78299186ED@corp.spry.com> <200810142359.34263.yalur@mail.ru> Message-ID: <5B6268D5-3418-4BB1-A013-D04F299C68C1@corp.spry.com> As I mentioned earlier, you want to sync to -HEAD as of the date of the patch. Try something like this: $ more /usr/local/etc/cvsup-head *default host=cvsup8.FreeBSD.org *default base=/var/db *default prefix=/usr *default release=cvs tag=. *default delete use-rel-suffix *default date=2008.08.13.00.00.00 *default compress src-all On Oct 14, 2008, at 1:59 PM, Ruslan Kovtun wrote: > I tried to apply this patch (zfs_20080727.patch) but I have found > several > errors (see below). Is this problem with patch or I need manualy > apply these > changes? > > Patching file cddl/contrib/opensolaris/lib/libzpool/common/sys/ > zfs_context.h > using Plan A... > Hunk #11 failed at 347. > > Patching file cddl/contrib/opensolaris/lib/libzpool/common/sys/ > zfs_context.h > using Plan A... > Hunk #11 failed at 347. > > Patching file sys/cddl/contrib/opensolaris/uts/common/fs/zfs/ > zfs_ctldir.c > using Plan A... > Hunk #26 failed at 1053. > > Patching file sys/cddl/contrib/opensolaris/uts/common/fs/zfs/ > zfs_replay.c > using Plan A... > Hunk #18 failed at 766. > > Patching file sys/cddl/contrib/opensolaris/uts/common/fs/zfs/ > zfs_vnops.c using > Plan A... > Hunk #82 failed at 3478. > > Patching file sys/cddl/contrib/opensolaris/uts/common/fs/zfs/ > zfs_znode.c using > Plan A... > Hunk #6 failed at 136. > Hunk #13 failed at 560. > Hunk #18 failed at 759. > Hunk #20 failed at 877. > Hunk #26 failed at 1336. > > Patching file sys/kern/kern_jail.c using Plan A... > Hunk #1 failed at 34. > > > ____________________________________________________ >> On Oct 13, 2008, at 2:08 PM, Miroslav Lachman wrote: >>> Matt Simerson wrote: >>>> It all depends on your workload. If you work your backup serves >>>> hard (as I do, backing up thousands of OS instances), you'll have >>>> significant reliability problems using FreeBSD 7.1 and ZFS. After >>>> a crash that corrupted my file systems, I have moved to 8-head >>>> with Pawel's latest patch. >>>> My backup servers have between 16 and 24 disks each. The ones with >>>> 16GB of RAM crash far less frequently than my server that has only >>>> 2GB. That one is getting upgraded soon. >>>> Matt >>> >>> I am planning to backup about 10-15 servers (mainly webservers and >>> few mailservers) and not expecting high load. >>> Did 8-current with the latest ZFS patch fixed all stability >>> problems? >>> >>> Thanks for suggestions to both of you. >>> >>> Miroslav Lachman >> >> No, there are still stability issues under heavy load. The are just >> far less frequent under 8-current than under 7. I couldn't keep my >> systems up for more than 2 days before switching to 8. Running 8- >> head >> was better, but so far the best available configuration is 8-head >> with >> "the patch" applied. >> >> Matt >> _______________________________________________ >> freebsd-fs@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-fs >> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" > > > > -- > ________________ > Ruslan Kovtun > mailto: yalur@mail.ru > mob: +380503557878, +380919015095 > ICQ: 277696182 From yalur at mail.ru Wed Oct 15 09:30:04 2008 From: yalur at mail.ru (Ruslan Kovtun) Date: Wed Oct 15 09:30:11 2008 Subject: ZFS on backup fileserver - RAM usage In-Reply-To: <5B6268D5-3418-4BB1-A013-D04F299C68C1@corp.spry.com> References: <48F334A0.3080005@quip.cz> <200810142359.34263.yalur@mail.ru> <5B6268D5-3418-4BB1-A013-D04F299C68C1@corp.spry.com> Message-ID: <200810151229.54582.yalur@mail.ru> Thanks, Matt. I decide to correct these errors manually. Now rebuilding is in process. But I am afraid that I have missed something and it can corrupt my data in zfs pool. Lets check. :) __________________________________________ > As I mentioned earlier, you want to sync to -HEAD as of the date of > the patch. Try something like this: > > > $ more /usr/local/etc/cvsup-head > *default host=cvsup8.FreeBSD.org > *default base=/var/db > *default prefix=/usr > *default release=cvs tag=. > *default delete use-rel-suffix > *default date=2008.08.13.00.00.00 > *default compress > src-all > > On Oct 14, 2008, at 1:59 PM, Ruslan Kovtun wrote: > > I tried to apply this patch (zfs_20080727.patch) but I have found > > several > > errors (see below). Is this problem with patch or I need manualy > > apply these > > changes? > > > > Patching file cddl/contrib/opensolaris/lib/libzpool/common/sys/ > > zfs_context.h > > using Plan A... > > Hunk #11 failed at 347. > > > > Patching file cddl/contrib/opensolaris/lib/libzpool/common/sys/ > > zfs_context.h > > using Plan A... > > Hunk #11 failed at 347. > > > > Patching file sys/cddl/contrib/opensolaris/uts/common/fs/zfs/ > > zfs_ctldir.c > > using Plan A... > > Hunk #26 failed at 1053. > > > > Patching file sys/cddl/contrib/opensolaris/uts/common/fs/zfs/ > > zfs_replay.c > > using Plan A... > > Hunk #18 failed at 766. > > > > Patching file sys/cddl/contrib/opensolaris/uts/common/fs/zfs/ > > zfs_vnops.c using > > Plan A... > > Hunk #82 failed at 3478. > > > > Patching file sys/cddl/contrib/opensolaris/uts/common/fs/zfs/ > > zfs_znode.c using > > Plan A... > > Hunk #6 failed at 136. > > Hunk #13 failed at 560. > > Hunk #18 failed at 759. > > Hunk #20 failed at 877. > > Hunk #26 failed at 1336. > > > > Patching file sys/kern/kern_jail.c using Plan A... > > Hunk #1 failed at 34. > > > > > > ____________________________________________________ > > > >> On Oct 13, 2008, at 2:08 PM, Miroslav Lachman wrote: > >>> Matt Simerson wrote: > >>>> It all depends on your workload. If you work your backup serves > >>>> hard (as I do, backing up thousands of OS instances), you'll have > >>>> significant reliability problems using FreeBSD 7.1 and ZFS. After > >>>> a crash that corrupted my file systems, I have moved to 8-head > >>>> with Pawel's latest patch. > >>>> My backup servers have between 16 and 24 disks each. The ones with > >>>> 16GB of RAM crash far less frequently than my server that has only > >>>> 2GB. That one is getting upgraded soon. > >>>> Matt > >>> > >>> I am planning to backup about 10-15 servers (mainly webservers and > >>> few mailservers) and not expecting high load. > >>> Did 8-current with the latest ZFS patch fixed all stability > >>> problems? > >>> > >>> Thanks for suggestions to both of you. > >>> > >>> Miroslav Lachman > >> > >> No, there are still stability issues under heavy load. The are just > >> far less frequent under 8-current than under 7. I couldn't keep my > >> systems up for more than 2 days before switching to 8. Running 8- > >> head > >> was better, but so far the best available configuration is 8-head > >> with > >> "the patch" applied. > >> > >> Matt > >> _______________________________________________ > >> freebsd-fs@freebsd.org mailing list > >> http://lists.freebsd.org/mailman/listinfo/freebsd-fs > >> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" > > > > -- > > ________________ > > Ruslan Kovtun > > mailto: yalur@mail.ru > > mob: +380503557878, +380919015095 > > ICQ: 277696182 > > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" -- ________________ Ruslan Kovtun mailto: yalur@mail.ru mob: +380503557878, +380919015095 ICQ: 277696182 From yalur at mail.ru Wed Oct 15 17:54:57 2008 From: yalur at mail.ru (Ruslan Kovtun) Date: Wed Oct 15 17:55:04 2008 Subject: ZFS on backup fileserver - RAM usage In-Reply-To: <5B6268D5-3418-4BB1-A013-D04F299C68C1@corp.spry.com> References: <48F334A0.3080005@quip.cz> <200810142359.34263.yalur@mail.ru> <5B6268D5-3418-4BB1-A013-D04F299C68C1@corp.spry.com> Message-ID: <200810152054.53638.yalur@mail.ru> I have downloaded snapshot ftp://ftp.freebsd.org/pub/FreeBSD/snapshots/200807/8.0-CURRENT-200807-amd64-disc1.iso and extract src folder from it. I applied patch zfs_20080727.patch without ay errors. Buildworld finished succesfull but make kernel failed (see below). How can I solve this problem? /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:178: error: expected declaration specifiers or '...' before string constant /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:178: error: expected declaration specifiers or '...' before '&' token /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:178: warning: data definition has no type or storage class /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:178: warning: type defaults to 'int' in declaration of 'TUNABLE_QUAD' /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:181: error: expected ')' before '(' token /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:183: error: expected ')' before '(' token /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:398: error: expected ')' before '(' token /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:400: error: expected ')' before '(' token *** Error code 1 Stop in /usr/src/sys/modules/zfs. *** Error code 1 Stop in /usr/src/sys/modules. *** Error code 1 Stop in /usr/obj/usr/src/sys/MYKERNEL. *** Error code 1 Stop in /usr/src. *** Error code 1 Stop in /usr/src. ___________________________________________ > As I mentioned earlier, you want to sync to -HEAD as of the date of > the patch. Try something like this: > > > $ more /usr/local/etc/cvsup-head > *default host=cvsup8.FreeBSD.org > *default base=/var/db > *default prefix=/usr > *default release=cvs tag=. > *default delete use-rel-suffix > *default date=2008.08.13.00.00.00 > *default compress > src-all > > On Oct 14, 2008, at 1:59 PM, Ruslan Kovtun wrote: > > I tried to apply this patch (zfs_20080727.patch) but I have found > > several > > errors (see below). Is this problem with patch or I need manualy > > apply these > > changes? > > > > Patching file cddl/contrib/opensolaris/lib/libzpool/common/sys/ > > zfs_context.h > > using Plan A... > > Hunk #11 failed at 347. > > > > Patching file cddl/contrib/opensolaris/lib/libzpool/common/sys/ > > zfs_context.h > > using Plan A... > > Hunk #11 failed at 347. > > > > Patching file sys/cddl/contrib/opensolaris/uts/common/fs/zfs/ > > zfs_ctldir.c > > using Plan A... > > Hunk #26 failed at 1053. > > > > Patching file sys/cddl/contrib/opensolaris/uts/common/fs/zfs/ > > zfs_replay.c > > using Plan A... > > Hunk #18 failed at 766. > > > > Patching file sys/cddl/contrib/opensolaris/uts/common/fs/zfs/ > > zfs_vnops.c using > > Plan A... > > Hunk #82 failed at 3478. > > > > Patching file sys/cddl/contrib/opensolaris/uts/common/fs/zfs/ > > zfs_znode.c using > > Plan A... > > Hunk #6 failed at 136. > > Hunk #13 failed at 560. > > Hunk #18 failed at 759. > > Hunk #20 failed at 877. > > Hunk #26 failed at 1336. > > > > Patching file sys/kern/kern_jail.c using Plan A... > > Hunk #1 failed at 34. > > > > > > ____________________________________________________ > > > >> On Oct 13, 2008, at 2:08 PM, Miroslav Lachman wrote: > >>> Matt Simerson wrote: > >>>> It all depends on your workload. If you work your backup serves > >>>> hard (as I do, backing up thousands of OS instances), you'll have > >>>> significant reliability problems using FreeBSD 7.1 and ZFS. After > >>>> a crash that corrupted my file systems, I have moved to 8-head > >>>> with Pawel's latest patch. > >>>> My backup servers have between 16 and 24 disks each. The ones with > >>>> 16GB of RAM crash far less frequently than my server that has only > >>>> 2GB. That one is getting upgraded soon. > >>>> Matt > >>> > >>> I am planning to backup about 10-15 servers (mainly webservers and > >>> few mailservers) and not expecting high load. > >>> Did 8-current with the latest ZFS patch fixed all stability > >>> problems? > >>> > >>> Thanks for suggestions to both of you. > >>> > >>> Miroslav Lachman > >> > >> No, there are still stability issues under heavy load. The are just > >> far less frequent under 8-current than under 7. I couldn't keep my > >> systems up for more than 2 days before switching to 8. Running 8- > >> head > >> was better, but so far the best available configuration is 8-head > >> with > >> "the patch" applied. > >> > >> Matt > >> _______________________________________________ > >> freebsd-fs@freebsd.org mailing list > >> http://lists.freebsd.org/mailman/listinfo/freebsd-fs > >> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" > > > > -- > > ________________ > > Ruslan Kovtun > > mailto: yalur@mail.ru > > mob: +380503557878, +380919015095 > > ICQ: 277696182 > > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" -- ________________ Ruslan Kovtun mailto: yalur@mail.ru mob: +380503557878, +380919015095 ICQ: 277696182 From a.smith at ukgrid.net Fri Oct 17 15:34:17 2008 From: a.smith at ukgrid.net (andys) Date: Fri Oct 17 15:34:24 2008 Subject: bsdlabel partiton c error message on new install Message-ID: Hi, on a newly installed FreeBSD 7.0 system on a dell 1950 server I see the following error from bsdlabel. Is there any known issues with this or is the only reasonable explanation that I have managed to mess it up without even knowing? :P And should I manually change the partition c to fix the prob? Is this safe to do? bsdlabel -A /dev/da0s1 # /dev/da0s1: type: SCSI disk: da0s1 label: flags: bytes/sector: 512 sectors/track: 63 tracks/cylinder: 255 sectors/cylinder: 16065 cylinders: 17750 sectors/unit: 285155328 rpm: 3600 interleave: 1 trackskew: 0 cylinderskew: 0 headswitch: 0 # milliseconds track-to-track seek: 0 # milliseconds drivedata: 0 8 partitions: # size offset fstype [fsize bsize bps/cpg] a: 20971520 0 4.2BSD 2048 16384 28552 b: 20971520 75497472 swap c: 285153687 0 unused 0 0 # "raw" part, don't edit d: 20971520 20971520 4.2BSD 2048 16384 28552 e: 20971520 41943040 4.2BSD 2048 16384 28552 f: 12582912 62914560 4.2BSD 2048 16384 28552 bsdlabel: partition c doesn't cover the whole unit! bsdlabel: An incorrect partition c may cause problems for standard system utilities thanks for any advice, Im not really confident with the FreeBSD disk management as I havent used it much, thanks Andy. From remko at FreeBSD.org Sat Oct 18 08:14:12 2008 From: remko at FreeBSD.org (remko@FreeBSD.org) Date: Sat Oct 18 08:14:24 2008 Subject: kern/128173: ext3fs: ls gives "Input/output error" on mounted ext3 filesystem Message-ID: <200810180814.m9I8ECg2037826@freefall.freebsd.org> Old Synopsis: ls gives "Input/output error" on mounted ext3 filesystem New Synopsis: ext3fs: ls gives "Input/output error" on mounted ext3 filesystem Responsible-Changed-From-To: freebsd-i386->freebsd-fs Responsible-Changed-By: remko Responsible-Changed-When: Sat Oct 18 08:13:50 UTC 2008 Responsible-Changed-Why: Reassign to filesystem team. http://www.freebsd.org/cgi/query-pr.cgi?pr=128173 From linimon at FreeBSD.org Sun Oct 19 13:16:26 2008 From: linimon at FreeBSD.org (linimon@FreeBSD.org) Date: Sun Oct 19 13:16:38 2008 Subject: kern/119868: [zfs] [patch] 7.0 kernel panic during boot with ZFS and WD1600JS Message-ID: <200810191316.m9JDGP37030230@freefall.freebsd.org> Old Synopsis: [zfs] 7.0 kernel panic during boot with ZFS and WD1600JS New Synopsis: [zfs] [patch] 7.0 kernel panic during boot with ZFS and WD1600JS State-Changed-From-To: open->analyzed State-Changed-By: linimon State-Changed-When: Sun Oct 19 13:15:19 UTC 2008 State-Changed-Why: Patch has been submitted and has been confirmed as fixing the problem. Responsible-Changed-From-To: freebsd-bugs->freebsd-fs Responsible-Changed-By: linimon Responsible-Changed-When: Sun Oct 19 13:15:19 UTC 2008 Responsible-Changed-Why: http://www.freebsd.org/cgi/query-pr.cgi?pr=119868 From bugmaster at FreeBSD.org Mon Oct 20 11:06:51 2008 From: bugmaster at FreeBSD.org (FreeBSD bugmaster) Date: Mon Oct 20 11:07:45 2008 Subject: Current problem reports assigned to freebsd-fs@FreeBSD.org Message-ID: <200810201106.m9KB6o7P082647@freefall.freebsd.org> Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/128173 fs [ext2fs] ls gives "Input/output error" on mounted ext3 o kern/127420 fs [gjournal] [panic] Journal overflow on gmirrored gjour o kern/127213 fs [tmpfs] sendfile on tmpfs data corruption o kern/127029 fs [panic] mount(8): trying to mount a write protected zi o kern/126287 fs [ufs] [panic] Kernel panics while mounting an UFS file o kern/125536 fs [ext2fs] ext 2 mounts cleanly but fails on commands li o kern/125149 fs [nfs][panic] changing into .zfs dir from nfs client ca o kern/124621 fs [ext3] Cannot mount ext2fs partition o kern/122888 fs [zfs] zfs hang w/ prefetch on, zil off while running t o bin/122172 fs [fs]: amd(8) automount daemon dies on 6.3-STABLE i386, o bin/121072 fs [smbfs] mount_smbfs(8) cannot normally convert the cha a kern/119868 fs [zfs] [patch] 7.0 kernel panic during boot with ZFS an o bin/118249 fs mv(1): moving a directory changes its mtime o kern/116170 fs [panic] Kernel panic when mounting /tmp o kern/114955 fs [cd9660] [patch] [request] support for mask,dirmask,ui o kern/114847 fs [ntfs] [patch] [request] dirmask support for NTFS ala o kern/114676 fs [ufs] snapshot creation panics: snapacct_ufs2: bad blo o bin/114468 fs [patch] [request] add -d option to umount(8) to detach o bin/113838 fs [patch] [request] mount(8): add support for relative p o bin/113049 fs [patch] [request] make quot(8) use getopt(3) and show o kern/112658 fs [smbfs] [patch] smbfs and caching problems (resolves b o kern/93942 fs [vfs] [patch] panic: ufs_dirbad: bad dir (patch from D 22 problems total. From lulf at stud.ntnu.no Tue Oct 21 08:36:18 2008 From: lulf at stud.ntnu.no (Ulf Lilleengen) Date: Tue Oct 21 08:36:30 2008 Subject: bsdlabel partiton c error message on new install In-Reply-To: References: Message-ID: <20081021083415.GA1571@carrot.studby.ntnu.no> On fre, okt 17, 2008 at 04:16:14pm +0100, andys wrote: > Hi, > > on a newly installed FreeBSD 7.0 system on a dell 1950 server I see the > following error from bsdlabel. Is there any known issues with this or is the > only reasonable explanation that I have managed to mess it up without even > knowing? :P And should I manually change the partition c to fix the prob? Is > this safe to do? > > bsdlabel -A /dev/da0s1 > # /dev/da0s1: > type: SCSI > disk: da0s1 > label: > flags: > bytes/sector: 512 > sectors/track: 63 > tracks/cylinder: 255 > sectors/cylinder: 16065 > cylinders: 17750 > sectors/unit: 285155328 > rpm: 3600 > interleave: 1 > trackskew: 0 > cylinderskew: 0 > headswitch: 0 # milliseconds > track-to-track seek: 0 # milliseconds > drivedata: 0 > > 8 partitions: > # size offset fstype [fsize bsize bps/cpg] > a: 20971520 0 4.2BSD 2048 16384 28552 > b: 20971520 75497472 swap > c: 285153687 0 unused 0 0 # "raw" part, don't 285155328 > edit > d: 20971520 20971520 4.2BSD 2048 16384 28552 > e: 20971520 41943040 4.2BSD 2048 16384 28552 > f: 12582912 62914560 4.2BSD 2048 16384 28552 > bsdlabel: partition c doesn't cover the whole unit! > bsdlabel: An incorrect partition c may cause problems for standard system > utilities > > > thanks for any advice, Im not really confident with the FreeBSD disk > management as I havent used it much, Hello, This is completely ok. The reasons that you might get warnings like this is that fdisk tries to put the sector number on a cylinder boundary. If that means that the partition is larger than the actual disklabel size, that is ok. What would have been a problem is if the disklabel extends past the partition size! (I think the installer makes sure this does not happen). You do waste a few sectors because of this, but unless you are really interested in getting them back, I would not start bothering with it. One way to "fix" it is to do a bsdlabel -e and change c: 285153687 0 unused 0 0 to c: 285155328 0 unused 0 0 But again, it is not many sectors that is currently wasted. -- Ulf Lilleengen From a.smith at ukgrid.net Tue Oct 21 09:42:08 2008 From: a.smith at ukgrid.net (andys) Date: Tue Oct 21 09:42:15 2008 Subject: bsdlabel partiton c error message on new install In-Reply-To: <20081021083415.GA1571@carrot.studby.ntnu.no> References: <20081021083415.GA1571@carrot.studby.ntnu.no> Message-ID: Hi Ulf, thanks a lot for your answer, previously I'd asked this question on the freebsd-questions list and someone suggested asking it here as they didnt know the answer, however I did get pretty much 2 responces telling me to reinstall the OS!! :S For example I had this answer: http://lists.freebsd.org/pipermail/freebsd-questions/2008-October/184617.htm l So I assume you would disagree with this and the other person who advised me this was a serious error? And if this actually isnt a problem, does bsdlabel need to be updated (and the man page) to reflect the fact this can be seen on a healthy system? thanks a lot! Andy. > > This is completely ok. The reasons that you might get warnings like this is > that fdisk tries to put the sector number on a cylinder boundary. If that > means that the partition is larger than the actual disklabel size, that is > ok. What would have been a problem is if the disklabel extends past the > partition size! (I think the installer makes sure this does not happen). > > You do waste a few sectors because of this, but unless you are really > interested in getting them back, I would not start bothering with it. One way > to "fix" it is to do a bsdlabel -e and change > c: 285153687 0 unused 0 0 > to > c: 285155328 0 unused 0 0 > > But again, it is not many sectors that is currently wasted. > > -- > Ulf Lilleengen From koitsu at FreeBSD.org Tue Oct 21 10:15:15 2008 From: koitsu at FreeBSD.org (Jeremy Chadwick) Date: Tue Oct 21 10:15:22 2008 Subject: bsdlabel partiton c error message on new install In-Reply-To: References: <20081021083415.GA1571@carrot.studby.ntnu.no> Message-ID: <20081021095913.GA26955@icarus.home.lan> On Tue, Oct 21, 2008 at 10:42:06AM +0100, andys wrote: > Hi Ulf, > > thanks a lot for your answer, previously I'd asked this question on the > freebsd-questions list and someone suggested asking it here as they didnt > know the answer, however I did get pretty much 2 responces telling me to > reinstall the OS!! :S > > For example I had this answer: > > http://lists.freebsd.org/pipermail/freebsd-questions/2008-October/184617.htm > l > > So I assume you would disagree with this and the other person who advised > me this was a serious error? And if this actually isnt a problem, does > bsdlabel need to be updated (and the man page) to reflect the fact this > can be seen on a healthy system? Part of the problem is that you're "tinkering with bsdlabel" when most users simply create slices and partitions and don't bother to look at the results -- they build it all, install, and don't worry about it. I'm sure if ran bsdlabel and saw what you did, I'd be concerned too, so you did the right thing by asking. All the systems I maintain have the c slice offset at zero, but Ulf's explanation makes perfect sense. (I believe even Windows does something similar to this, except it leaves the leftovers at the end of the partition table for alignment.) Comparatively, there's the silly "cylinder geometry" warning that sysinstall spits out prior to launching into slice manipulation. It's silly in the majority of cases, but apparently it's legitimate when it comes to older/smaller disks, particularly SCSI. That said, you should see the look on Linux users' faces when they see it -- a look of fear, followed by someone saying "You can ignore that", followed by "...then what the hell is the point of printing it?!!" :-) -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | From lulf at stud.ntnu.no Tue Oct 21 12:14:12 2008 From: lulf at stud.ntnu.no (Ulf Lilleengen) Date: Tue Oct 21 12:14:19 2008 Subject: bsdlabel partiton c error message on new install In-Reply-To: References: <20081021083415.GA1571@carrot.studby.ntnu.no> Message-ID: <20081021121332.GA2280@carrot.studby.ntnu.no> On tir, okt 21, 2008 at 10:42:06am +0100, andys wrote: > Hi Ulf, > > thanks a lot for your answer, previously I'd asked this question on the > freebsd-questions list and someone suggested asking it here as they didnt > know the answer, however I did get pretty much 2 responces telling me to > reinstall the OS!! :S > > For example I had this answer: > > http://lists.freebsd.org/pipermail/freebsd-questions/2008-October/184617.htm > l > > So I assume you would disagree with this and the other person who advised me > this was a serious error? And if this actually isnt a problem, does bsdlabel > need to be updated (and the man page) to reflect the fact this can be seen > on a healthy system? > Well, this depends really on if you did this by purpose or if it was created this way by fdisk. There are many factors which can influence this, since it is not necessary something done by the system itself (I really doubt that the label size will change without the user being notified if the user did not make any such request, but then again, perhaps it should be checked in the utilities). If the disklabels were shortened after file system creation and the filesystem really expects a larger label, you might be in trouble (when you require the filesystem to do something involving the provider size which in this case might be a slice which have changed size, you might have trouble). You should perhaps do some testing with fsck and see if it complaints. Or perhaps find out the real size that UFS expects (I am not really sure how). If it is the same as the current label size, you are safe. -- Ulf Lilleengen From a.smith at ukgrid.net Tue Oct 21 14:18:35 2008 From: a.smith at ukgrid.net (andys) Date: Tue Oct 21 14:18:42 2008 Subject: bsdlabel partiton c error message on new install In-Reply-To: <20081021121332.GA2280@carrot.studby.ntnu.no> References: <20081021083415.GA1571@carrot.studby.ntnu.no> <20081021121332.GA2280@carrot.studby.ntnu.no> Message-ID: Hi, ok, so I have attempted to proceed with my original task which was to create a new UFS2 parition (using sysinstall). Having chosen "c" and then "w" from the lable section, i recieve the following error: Error mounting /dev/da0s1g on /export : No such file or directory After exiting sysinstall, I can see from bsdlabel: 8 partitions: # size offset fstype [fsize bsize bps/cpg] a: 20971520 0 4.2BSD 0 0 0 b: 20971520 75497472 swap c: 285153687 0 unused 0 0 # "raw" part, don't edit d: 20971520 20971520 4.2BSD 0 0 0 e: 20971520 41943040 4.2BSD 0 0 0 f: 12582912 62914560 4.2BSD 0 0 0 g: 146800640 96468992 4.2BSD 0 0 0 bsdlabel: partition c doesn't cover the whole unit! "g" is my new partition. Under /dev however I dont see the device file: ls /dev/da0* /dev/da0 /dev/da0s1a /dev/da0s1c /dev/da0s1e /dev/da0s1 /dev/da0s1b /dev/da0s1d /dev/da0s1f Can anyone help :( thanks a lot, Andy. From numisemis at yahoo.com Wed Oct 22 11:23:34 2008 From: numisemis at yahoo.com (Simun Mikecin) Date: Wed Oct 22 11:24:10 2008 Subject: (no subject) Message-ID: <207982.25549.qm@web36608.mail.mud.yahoo.com> > I will soon be installing an Areca ARC-1110 and 3x 1.5TB Seagate > Barracuda SATAs into a 3.2GHz Northwood P4 with 1GB of RAM, and I'm > wondering which would be the most stable filesystem to use. > I've read the bigdisk page [1] and the various information about ZFS on > the FreeBSD Wiki [2]. I'm aware of the tuning requirements that ZFS > requires, and upgrading to 4GB of RAM would be quite possible as it was > understood beforehand that ZFS requires a large quantity of it. > My questions are as follows. > 1. I'm aware of the fact that ZFS works better on 64-bit platforms, and > that alone has me thinking that it's not a good fit for this particular > machine. But apart from that, it seems that ZFS is not yet stable > enough for my environment (only about 25 users but in production > nonetheless). To me, [3] paints all sorts of ugly pictures, which can > be summarized as "count on ZFS-related panics and deadlocks happening > fairly regularly" and "disabling ZIL in the interest of stability will > put your data at risk." Comments about live systems using ZFS (on > 7.0-RELEASE or 7-STABLE) would be appreciated. I'm using 7.0-RELEASE/amd64 with ZFS on several machines without any stability problems. Here are the configs (prefetch is disabled for performance reasons): - with 1GB RAM (probably with just 1GB RAM system would be faster using UFS2 instead of ZFS) vfs.zfs.prefetch_disable=1 vm.kmem_size="512M" vfs.zfs.arc_max="150M" (at first it was 200M, but lots of swapping made me reduce it) - with 2GB RAM: vfs.zfs.prefetch_disable=1 vm.kmem_size="950M" (could be higher, even 1536M, but then there is not much RAM left for your apps) - with 8GB RAM: vfs.zfs.prefetch_disable=1 vm.kmem_size="1536M" (this could probably be higher: up to 2047M, but I haven't tried it). General rule to make it stable is to make the difference between vm.kmem_size and vfs.zfs.arc_max larger. vfs.zfs.arc_max is by default 3/4 of vm.kmem_size. You can achieve it by making vm.kmem_size bigger (but this leaves less memory for your applications) or reducing vfs.zfs.arc_max (but this reduces performance, since less memory will be available for caching). Problem with stability comes when kmem usage is at it's peak. arc_max is just a value after some of it will be deleted. But in some cases (high I/O activity) it will grow faster than the thread that reduces it (to a size less than arc_max) can delete. > 2. [1] appears to be a bit dated. Nevertheless, I'm inclined to think > that the status described there (as well as in various man pages) still > applies to UFS2 on 7.0-RELEASE. Please correct me if I'm wrong or let > me know if the state of affairs has improved significantly in 7-STABLE. > 2a. Does the information contained in [1] apply to ZFS as well? [1] is outdated. GEOM, GPT, UFS2 and ZFS are safe to use for many hundreds of terabytes. What is limited is MBR partitioning used by fdisk (2TB limit). > 3. As the array will be for data only and not be booted, will it be > possible to use fdisk to slice it up, or will I need to use gpt? fdisk can be used for slicing for disks that are up to 2TB. But I would recomment to use GPT (which doesn't have this limit) instead. There is no reason not to. > 4. My planned course of action will be to attempt to newfs the device > itself (da0, all 3TB of it) or 1 full-disk slice (da0s1). Failing that, > I will attempt to gconcat da0s1 and da0s2 (1.5TB each), although I > suspect that may not work since for one thing, growfs is not yet 64-bit > clean. In either case, I'm very interested in using gbde/geli to > encrypt the fs. If either of these paths are not possible or > recommended, are there any suggestions for alternate means of creating a > 3TB fs? If you will go for the ZFS route instead of UFS2 then don't make one logical array from your disks (da0) in your RAID controller, but instead make one logical array to be one phisical disk (so you have da0, da1 and da2) so you can use ZFS RAID functionality instead. If you are going to use UFS2 then you must use gjournal for disks of that size (or you will have background fscks that can last for ages and die a horrible death saying not enough memory). From a.smith at ukgrid.net Thu Oct 23 11:27:47 2008 From: a.smith at ukgrid.net (andys) Date: Thu Oct 23 11:27:54 2008 Subject: bsdlabel partiton c error message on new install Message-ID: Hi, the below was resolved by rebooting the server. After a reboot the device file /dev/da0s1g has been created, however this doesnt seem completely normal as sysinstall obviously expected to see the new device file immediately. Perhaps there is a prob with my system or is there just a problem with the expectations of sysinstall?? :S cheers Andy. andys writes: > Hi, > > ok, so I have attempted to proceed with my original task which was to > create a new UFS2 parition (using sysinstall). Having chosen "c" and then > "w" from the lable section, i recieve the following error: > > Error mounting /dev/da0s1g on /export : No such file or directory > > After exiting sysinstall, I can see from bsdlabel: > > 8 partitions: > # size offset fstype [fsize bsize bps/cpg] > a: 20971520 0 4.2BSD 0 0 0 > b: 20971520 75497472 swap > c: 285153687 0 unused 0 0 # "raw" part, don't > edit > d: 20971520 20971520 4.2BSD 0 0 0 > e: 20971520 41943040 4.2BSD 0 0 0 > f: 12582912 62914560 4.2BSD 0 0 0 > g: 146800640 96468992 4.2BSD 0 0 0 > bsdlabel: partition c doesn't cover the whole unit! > > "g" is my new partition. Under /dev however I dont see the device file: > > ls /dev/da0* > /dev/da0 /dev/da0s1a /dev/da0s1c /dev/da0s1e > /dev/da0s1 /dev/da0s1b /dev/da0s1d /dev/da0s1f > > Can anyone help :( > > thanks a lot, > Andy. From thierry at herbelot.com Fri Oct 24 16:36:59 2008 From: thierry at herbelot.com (Thierry Herbelot) Date: Fri Oct 24 16:37:06 2008 Subject: question about sb->st_blksize in src/sys/kern/vfs_vnops.c Message-ID: <200810241818.37262.thierry@herbelot.com> Hello, the [SUBJ] file contains the following extract (around line 705) : * Default to PAGE_SIZE after much discussion. * XXX: min(PAGE_SIZE, vp->v_bufobj.bo_bsize) may be more correct. */ sb->st_blksize = PAGE_SIZE; which arrived around four years ago, with revision 1.211 (see http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_vnops.c.diff?r1=1.210;r2=1.211;f=h) the net effect of this change is to decrease the block buffer size used in libc/stdio from 16 kbytes (derived from the underlying ufs partition) to PAGE_SIZE ==4 kbytes (fixed value), and consequently the I/O bandwidth is lowered (this is on a slow Flash). I have patched the kernel with a larger, fixed value (simply 4*PAGE_SIZE, to revert to the block size previoulsly used), and the kernel and world seem to be running fine. Seeing the XXX coment above, I'm a bit worried about keeping this new st_blksize value. are there any drawbacks with running with this bigger buffer size value ? TfH From glz at hidden-powers.com Sat Oct 25 08:35:41 2008 From: glz at hidden-powers.com (Goran Lowkrantz) Date: Sat Oct 25 08:35:47 2008 Subject: Whatever happened to autofs? Message-ID: <2B8727C356B2422602B9425C@[10.255.253.2]> Found this: Anyone know what happened with autofs? /glz --- Never attribute to malice what can adequately be explained by incompetence. From thierry.herbelot at laposte.net Sat Oct 25 15:05:45 2008 From: thierry.herbelot at laposte.net (Thierry Herbelot) Date: Sat Oct 25 15:05:58 2008 Subject: question about sb->st_blksize in src/sys/kern/vfs_vnops.c In-Reply-To: <20081025203549.C76165@delplex.bde.org> References: <200810241818.37262.thierry@herbelot.com> <20081025203549.C76165@delplex.bde.org> Message-ID: <200810251639.17586.thierry.herbelot@laposte.net> Le Saturday 25 October 2008, Bruce Evans a ?crit : > On Fri, 24 Oct 2008, Thierry Herbelot wrote: > > the [SUBJ] file contains the following extract (around line 705) : > > > > * Default to PAGE_SIZE after much discussion. > > * XXX: min(PAGE_SIZE, vp->v_bufobj.bo_bsize) may be more correct. > > */ > > > > sb->st_blksize = PAGE_SIZE; > > > > which arrived around four years ago, with revision 1.211 (see > > http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_vnops.c.diff?r1=1. > >210;r2=1.211;f=h) > > Indeed, this was completely broken long ago (in 1.211). Before then, and > after 1.128, some cases worked as intended if not perfectly: > - regular files: file systems still set va_blksize to their idea of the > best i/o size (normally to the file system block size, which is > normally larger than PAGE_SIZE and probably better in all cases) and > this was used here. However, for regular files, the fs block size > and the application's i/o size are almost irrelevant in most cases > due to vfs clustering. Most large i/o's are done physically with > the cluster size (which due to a related bug suite ends up being > hard-coded to MAXPHYS (128K) at a minor cost when this is different > from the best size). > - disk files: non-broken device drivers set si_iosize_best to their idea > of the best i/o size (normally to the max i/o size, which is normally > better than PAGE_SIZE) and this was used here. The bogus default > of BLKDEV_IOSIZE was used for broken drivers (this is bogus because it > was for the buffer cache implementation for block devices which no > longer exist and was too small for them anyway). > - non-disk character-special files: the default of PAGE_SIZE was used. > The comment about defaulting to PAGE_SIZE was added in 1.128 and is > mainly for this case. Now the comment is nonsense since the value is > fixed, not a default. > - other file types (fifos, pipes, sockets, ...): these got the default of > PAGE_SIZE too. > > In rev.1.1, st_blksize was set to va_blksize in all cases. So file systems > were supposed to set va_blksize reasonably in all cases, but this is not > easy and they did nothing good except for regular files. agreed, anyway the comment by phk about using ioctl(DIOCGSECTORSIZE) applies. > > Versions between 1.2 and 1.127 did weird things like defaulting to DFLTPHYS > (64K) for most cdevs but using a small size like BLKDEV_IOSIZE (2K) for > disks. This gave nonsense like 64K buffers for slow tty devices (keyboards) > and 2K buffers for fast disks. At least for programs that trust st_blksize > o be reasonable. Fortunately, st_blsize is rarely used... > > > the net effect of this change is to decrease the block buffer size used > > in libc/stdio from 16 kbytes (derived from the underlying ufs partition) > > to PAGE_SIZE ==4 kbytes (fixed value), and consequently the I/O bandwidth > > is lowered (this is on a slow Flash). > > ... except it is used by stdio. (Another mess here is that stdio mostly > doesn't use its own BUFSIZ. It trusts st_blksize if fstat() to determine This is indeed what I saw, meandering between the libc and the vfs part of the kernel. In fact, I was essentially wondering if st_blksize was used *elsewhere*, and bumping the value could break some memory allocation ... > st_blksize works. Of course, the existence of BUFSIZ is a related > historical mistake -- no fixed size can work best for all cases. But > when BUFSIZ is used, it is an even worse default than PAGE_SIZE.) (as it is even smaller ?) > > It's interesting that you can see the difference. Clustering is especially > good for hiding slowness on slow devices. Maybe you are using a > configuration that makes clustering ineffective. Mounting the file system > with -o sync or equivalently, doing a sync after every (too-small) write > would do it. Otherwise, writes are normally delated until the next cluster > boundary. My use case is for small (buffered) writes to a file between 4 kbytes and 16 16 kbytes. For example, writing a 16-kbyte file with a st_blksize of 4k is twice as slow as with 16k (220 ms compared to 110). The penalty is less for 8k-byte (105 ms vs 66). > > > I have patched the kernel with a larger, fixed value (simply 4*PAGE_SIZE, > > to revert to the block size previoulsly used), and the kernel and world > > seem to be running fine. > > > > Seeing the XXX coment above, I'm a bit worried about keeping this new > > st_blksize value. > > > > are there any drawbacks with running with this bigger buffer size value ? > > Mostly it doesn't matter, since buffering (clustering) hides the > differences. (as seen before, mostly) > Without clustering, 16K is a much better default for disks > than 4K, though not as good as the non-default va_blksize for regular > files. Newer disks might prefer 32K or 64k, but then the fs block size > should also be increased from 16K. Otherwise, increasing the block size > usually reduces performance, by thrashing caches or increasing latencies. > With modern cache sizes and disk speeds, you won't see these effects for a > block size of 64K, so defaulting to 64K would be reasonable for disks. It > would be silly for keyboards, but with modern memory sizes you would notice > this even less than when it was that in old versions. OK, thanks for the answer : I will submit the change to more stress tests and hope to shake it all before putting it to production. TfH > > Bruce From brde at optusnet.com.au Sat Oct 25 19:46:24 2008 From: brde at optusnet.com.au (Bruce Evans) Date: Sat Oct 25 19:46:40 2008 Subject: question about sb->st_blksize in src/sys/kern/vfs_vnops.c In-Reply-To: <200810241818.37262.thierry@herbelot.com> References: <200810241818.37262.thierry@herbelot.com> Message-ID: <20081025203549.C76165@delplex.bde.org> On Fri, 24 Oct 2008, Thierry Herbelot wrote: > the [SUBJ] file contains the following extract (around line 705) : > > * Default to PAGE_SIZE after much discussion. > * XXX: min(PAGE_SIZE, vp->v_bufobj.bo_bsize) may be more correct. > */ > > sb->st_blksize = PAGE_SIZE; > > which arrived around four years ago, with revision 1.211 (see > http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_vnops.c.diff?r1=1.210;r2=1.211;f=h) Indeed, this was completely broken long ago (in 1.211). Before then, and after 1.128, some cases worked as intended if not perfectly: - regular files: file systems still set va_blksize to their idea of the best i/o size (normally to the file system block size, which is normally larger than PAGE_SIZE and probably better in all cases) and this was used here. However, for regular files, the fs block size and the application's i/o size are almost irrelevant in most cases due to vfs clustering. Most large i/o's are done physically with the cluster size (which due to a related bug suite ends up being hard-coded to MAXPHYS (128K) at a minor cost when this is different from the best size). - disk files: non-broken device drivers set si_iosize_best to their idea of the best i/o size (normally to the max i/o size, which is normally better than PAGE_SIZE) and this was used here. The bogus default of BLKDEV_IOSIZE was used for broken drivers (this is bogus because it was for the buffer cache implementation for block devices which no longer exist and was too small for them anyway). - non-disk character-special files: the default of PAGE_SIZE was used. The comment about defaulting to PAGE_SIZE was added in 1.128 and is mainly for this case. Now the comment is nonsense since the value is fixed, not a default. - other file types (fifos, pipes, sockets, ...): these got the default of PAGE_SIZE too. In rev.1.1, st_blksize was set to va_blksize in all cases. So file systems were supposed to set va_blksize reasonably in all cases, but this is not easy and they did nothing good except for regular files. Versions between 1.2 and 1.127 did weird things like defaulting to DFLTPHYS (64K) for most cdevs but using a small size like BLKDEV_IOSIZE (2K) for disks. This gave nonsense like 64K buffers for slow tty devices (keyboards) and 2K buffers for fast disks. At least for programs that trust st_blksize o be reasonable. Fortunately, st_blsize is rarely used... > the net effect of this change is to decrease the block buffer size used in > libc/stdio from 16 kbytes (derived from the underlying ufs partition) to > PAGE_SIZE ==4 kbytes (fixed value), and consequently the I/O bandwidth is > lowered (this is on a slow Flash). ... except it is used by stdio. (Another mess here is that stdio mostly doesn't use its own BUFSIZ. It trusts st_blksize if fstat() to determine st_blksize works. Of course, the existence of BUFSIZ is a related historical mistake -- no fixed size can work best for all cases. But when BUFSIZ is used, it is an even worse default than PAGE_SIZE.) It's interesting that you can see the difference. Clustering is especially good for hiding slowness on slow devices. Maybe you are using a configuration that makes clustering ineffective. Mounting the file system with -o sync or equivalently, doing a sync after every (too-small) write would do it. Otherwise, writes are normally delated until the next cluster boundary. > I have patched the kernel with a larger, fixed value (simply 4*PAGE_SIZE, to > revert to the block size previoulsly used), and the kernel and world seem to > be running fine. > > Seeing the XXX coment above, I'm a bit worried about keeping this new > st_blksize value. > > are there any drawbacks with running with this bigger buffer size value ? Mostly it doesn't matter, since buffering (clustering) hides the differences. Without clustering, 16K is a much better default for disks than 4K, though not as good as the non-default va_blksize for regular files. Newer disks might prefer 32K or 64k, but then the fs block size should also be increased from 16K. Otherwise, increasing the block size usually reduces performance, by thrashing caches or increasing latencies. With modern cache sizes and disk speeds, you won't see these effects for a block size of 64K, so defaulting to 64K would be reasonable for disks. It would be silly for keyboards, but with modern memory sizes you would notice this even less than when it was that in old versions. Bruce From rec at RCousins.com Sun Oct 26 19:04:20 2008 From: rec at RCousins.com (Robert Cousins) Date: Sun Oct 26 19:04:26 2008 Subject: ZFS extended attributes Message-ID: <4904B998.1030107@RCousins.com> I was trying to run a program over ZFS which runs under UFS. After some tracking, I found that the program was using extended file attributes under UFS2 and that these are not supported under ZFS. Is there a plan to support extended attributes under ZFS? From koitsu at FreeBSD.org Sun Oct 26 19:06:06 2008 From: koitsu at FreeBSD.org (Jeremy Chadwick) Date: Sun Oct 26 19:06:14 2008 Subject: ZFS extended attributes In-Reply-To: <4904B998.1030107@RCousins.com> References: <4904B998.1030107@RCousins.com> Message-ID: <20081026190604.GA1748@icarus.home.lan> On Sun, Oct 26, 2008 at 11:40:24AM -0700, Robert Cousins wrote: > I was trying to run a program over ZFS which runs under UFS. After some > tracking, I found that the program was using extended file attributes > under UFS2 and that these are not supported under ZFS. > > Is there a plan to support extended attributes under ZFS? http://wiki.freebsd.org/ZFS See "extattr" in the table near the bottom. Talk to pjd@ if you're interested in helping. :-) -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | From gamato at users.sf.net Sun Oct 26 23:15:12 2008 From: gamato at users.sf.net (martinko) Date: Sun Oct 26 23:15:20 2008 Subject: journaling filesystem In-Reply-To: <6eb82e0808030743hc41c68bgd0c5121daba95d42@mail.gmail.com> References: <6eb82e0808030743hc41c68bgd0c5121daba95d42@mail.gmail.com> Message-ID: Rong-en Fan wrote: > In NetBSD, they now have metadata journaling support, see > > http://www.netbsd.org/changes/#wapbl > > I'm not a fs guru, I just want to know what are the status of > BluFFS and UFS journaling support which were mentioned > in recent years. > > Thanks, > Rong-En Fan This is very interesting! I can imagine this would be the way for systems where ZFS is not an option but SU are reaching their limits. From bugmaster at FreeBSD.org Mon Oct 27 11:07:12 2008 From: bugmaster at FreeBSD.org (FreeBSD bugmaster) Date: Mon Oct 27 11:07:55 2008 Subject: Current problem reports assigned to freebsd-fs@FreeBSD.org Message-ID: <200810271107.m9RB7B8G001930@freefall.freebsd.org> Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/128173 fs [ext2fs] ls gives "Input/output error" on mounted ext3 o kern/127420 fs [gjournal] [panic] Journal overflow on gmirrored gjour o kern/127213 fs [tmpfs] sendfile on tmpfs data corruption o kern/127029 fs [panic] mount(8): trying to mount a write protected zi o kern/126287 fs [ufs] [panic] Kernel panics while mounting an UFS file o kern/125536 fs [ext2fs] ext 2 mounts cleanly but fails on commands li o kern/125149 fs [nfs][panic] changing into .zfs dir from nfs client ca o kern/124621 fs [ext3] Cannot mount ext2fs partition o kern/122888 fs [zfs] zfs hang w/ prefetch on, zil off while running t o bin/122172 fs [fs]: amd(8) automount daemon dies on 6.3-STABLE i386, o bin/121072 fs [smbfs] mount_smbfs(8) cannot normally convert the cha a kern/119868 fs [zfs] [patch] 7.0 kernel panic during boot with ZFS an o bin/118249 fs mv(1): moving a directory changes its mtime o kern/116170 fs [panic] Kernel panic when mounting /tmp o kern/114955 fs [cd9660] [patch] [request] support for mask,dirmask,ui o kern/114847 fs [ntfs] [patch] [request] dirmask support for NTFS ala o kern/114676 fs [ufs] snapshot creation panics: snapacct_ufs2: bad blo o bin/114468 fs [patch] [request] add -d option to umount(8) to detach o bin/113838 fs [patch] [request] mount(8): add support for relative p o bin/113049 fs [patch] [request] make quot(8) use getopt(3) and show o kern/112658 fs [smbfs] [patch] smbfs and caching problems (resolves b o kern/93942 fs [vfs] [patch] panic: ufs_dirbad: bad dir (patch from D 22 problems total. From olli at lurza.secnetix.de Mon Oct 27 13:43:46 2008 From: olli at lurza.secnetix.de (Oliver Fromme) Date: Mon Oct 27 13:43:53 2008 Subject: journaling filesystem In-Reply-To: Message-ID: <200810271343.m9RDhfNQ013868@lurza.secnetix.de> martinko wrote: > Rong-en Fan wrote: > > In NetBSD, they now have metadata journaling support, see > > > > http://www.netbsd.org/changes/#wapbl > > > > I'm not a fs guru, I just want to know what are the status of > > BluFFS and UFS journaling support which were mentioned > > in recent years. > > This is very interesting! I can imagine this would be the way for > systems where ZFS is not an option but SU are reaching their limits. Have you had a look at gjournal(8)? Best regards Oliver -- Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M. Handelsregister: Registergericht Muenchen, HRA 74606, Gesch?ftsfuehrung: secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht M?n- chen, HRB 125758, Gesch?ftsf?hrer: Maik Bachmann, Olaf Erb, Ralf Gebhart FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd "C++ is to C as Lung Cancer is to Lung." -- Thomas Funke From grafan at gmail.com Mon Oct 27 15:09:45 2008 From: grafan at gmail.com (Rong-en Fan) Date: Mon Oct 27 15:09:52 2008 Subject: journaling filesystem In-Reply-To: <200810271343.m9RDhfNQ013868@lurza.secnetix.de> References: <200810271343.m9RDhfNQ013868@lurza.secnetix.de> Message-ID: <6eb82e0810270745w4444b735sca369c0b1d724b48@mail.gmail.com> On Mon, Oct 27, 2008 at 9:43 PM, Oliver Fromme wrote: > martinko wrote: > > Rong-en Fan wrote: > > > In NetBSD, they now have metadata journaling support, see > > > > > > http://www.netbsd.org/changes/#wapbl > > > > > > I'm not a fs guru, I just want to know what are the status of > > > BluFFS and UFS journaling support which were mentioned > > > in recent years. > > > > This is very interesting! I can imagine this would be the way for > > systems where ZFS is not an option but SU are reaching their limits. > > Have you had a look at gjournal(8)? Actually, gjournal has the problem with fast write load. It panics if the journal overflows... I was told by pjd@ that if I have been played with the sysctls and it still overflows, then there is nothing we can do... Regards, Rong-En Fan From 000.fbsd at quip.cz Mon Oct 27 16:00:32 2008 From: 000.fbsd at quip.cz (Miroslav Lachman) Date: Mon Oct 27 16:00:39 2008 Subject: journaling filesystem In-Reply-To: <200810271343.m9RDhfNQ013868@lurza.secnetix.de> References: <200810271343.m9RDhfNQ013868@lurza.secnetix.de> Message-ID: <4905E5BD.2020809@quip.cz> Oliver Fromme wrote: > martinko wrote: > > Rong-en Fan wrote: > > > In NetBSD, they now have metadata journaling support, see > > > > > > http://www.netbsd.org/changes/#wapbl > > > > > > I'm not a fs guru, I just want to know what are the status of > > > BluFFS and UFS journaling support which were mentioned > > > in recent years. > > > > This is very interesting! I can imagine this would be the way for > > systems where ZFS is not an option but SU are reaching their limits. > > Have you had a look at gjournal(8)? I played with a gjournal, it is simple and working, but not usable where performance matters. A write performance is degraded to about half of a performance without gjournal (with a gjournal on top of a gmirror it is even more slower) Miroslav Lachman From numisemis at yahoo.com Mon Oct 27 16:15:18 2008 From: numisemis at yahoo.com (Simun Mikecin) Date: Mon Oct 27 16:58:52 2008 Subject: journaling filesystem Message-ID: <169608.99606.qm@web36601.mail.mud.yahoo.com> "Rong-en Fan" wrote: >Actually, gjournal has the problem with fast write load. It panics if the >journal overflows... I was told by pjd@ that if I have been played with the >sysctls and it still overflows, then there is nothing we can do... pjd@ should answer this, but AFAIK journal size should be at least 2 * switch_time * transfer_rate where transfer_rate is your HDD bandwidth in MB/s. So for a fictional HDD that has 150MB/s (most real HDDs are much lower than this) and the default switch_time (which is 10) that would be: 2 * 10 * 150 = 3000MB for the journal minimum. I'm sure that increasing journal even more (or reducing switch_time) should make it stable (if it already isn't). From lulf at stud.ntnu.no Mon Oct 27 17:03:46 2008 From: lulf at stud.ntnu.no (Ulf Lilleengen) Date: Mon Oct 27 17:03:54 2008 Subject: ZFS extended attributes In-Reply-To: <4904B998.1030107@RCousins.com> References: <4904B998.1030107@RCousins.com> Message-ID: <20081027170343.GA11687@nobby.lan> On Sun, Oct 26, 2008 at 11:40:24AM -0700, Robert Cousins wrote: > I was trying to run a program over ZFS which runs under UFS. After some > tracking, I found that the program was using extended file attributes > under UFS2 and that these are not supported under ZFS. > > Is there a plan to support extended attributes under ZFS? The perforce version that will commited sometime in the future supports extended attributes. -- Ulf Lilleengen From k0802647 at telus.net Thu Oct 30 07:32:57 2008 From: k0802647 at telus.net (Carl) Date: Thu Oct 30 07:33:03 2008 Subject: gmirror with only some partitions gjournal'd, autosync setting? Message-ID: <49095B24.6010007@telus.net> I've built a GEOM mirror on a single slice of a single disk and am currently inserting the second disk. Of the partitions in the mirror, I made only a few of them gjournal'd. From this thread, I understand that auto-synchronization is unnecessary for gjournal on top of gmirror: http://lists.freebsd.org/pipermail/freebsd-hackers/2007-January/019276.html I'm thinking my non-journaled partitions (the ones too small to be journaled) still need to be sync'd after crashes, so how can I enable auto-synch for those partitions and not the ones that are journaled? What is the correct thing to do here? Carl / K0802647 From koitsu at FreeBSD.org Thu Oct 30 20:32:10 2008 From: koitsu at FreeBSD.org (Jeremy Chadwick) Date: Thu Oct 30 20:32:17 2008 Subject: Areca vs. ZFS performance testing. In-Reply-To: <490A782F.9060406@dannysplace.net> References: <490A782F.9060406@dannysplace.net> Message-ID: <20081031033208.GA21220@icarus.home.lan> Cross-posting this to freebsd-fs, as I'm sure people there will have other recommendations. (This is one of those rare cross-posting situations.....) On Fri, Oct 31, 2008 at 01:14:55PM +1000, Danny Carroll wrote: > I've just become the proud new owner of an Areca 1231-ML which I plan to > use to set up an office server. > > I'm very curious as to how ZFS compares to a hardware solution so I plan > to run some tests before I put this thing to work. > > The purpose of this email is to find out if anyone would like to see > specific things tested as well as perhaps get some advice on how to get > the most information out of the tests. > > My setup: > Supermicro X7SBE board with 2Gb ram and an E6550 Core 2 Duo. > FreeBSD 7.0-Stable compiled with amd64 sources from mid August. > 1 x ST9120822AS 120gb disk (for the OS) > For the array(s) > 9 x ST31000340AS 1tb disks > 1 x ST31000333AS 1tb disk (trying to swap this for a ST31000340AS) > > My thoughts are to do the following tests with bonnie++: > 1 5 disk Areca Raid5 > 2 5 Disk ZFS RaidZ1 (Connected to Areca in JBOD mode) > 3 5 Disk ZFS RaidZ1 (Connected to ICH9 On board SATA controller) > 4 5 disk Areca Raid6 > 5 5 Disk ZFS RaidZ2 (Connected to Areca in JBOD mode) > 6 5 Disk ZFS RaidZ2 (Connected to ICH9 On board SATA controller) > 7 10 disk Areca Raid5 > 8 10 Disk ZFS RaidZ1 (Connected to Areca in JBOD mode) > 9 10 disk Areca Raid6 > 10 10 Disk ZFS RaidZ2 (Connected to Areca in JBOD mode) > > My aim is to see what sort of performance gain you get by buying an > Areca card for use in JBOD as well as seeing how ZFS compares to the > hardware solution which offers write caching etc. I'm really only > interested in testing ZFS's volume management performance, so for that > reason I will also put ZFS on the Areca Raid drives. Not sure if it's a > good idea to create 2 Raid drives and stripe them or simply use 1 large > disk and give it to ZFS. > > Any thoughts on this setup as well as advice on what options to give to > bonnie++ (or suggestions on another disk testing package) are very welcome. I think these sets of tests are good. There are some others I'd like to see, but they'd only be applicable if the 1231-ML has hardware cache. I can mention what those are if the card does have hardware caching. > I do have some concern about the size of the eventual array and ZFS' use > of system memory. Are there guidelines available that give advice on > how much memory a box should have with large ZFS arrays? The general concept is: "the more RAM the better". However, if you're using RELENG_7, then there's not much point (speaking solely about ZFS) to getting more than maybe 3 or 4GB; you're still limited to a 2GB kmap maximum. Regarding size of the array vs. memory usage: as long as you tune kmem and ZFS ARC, you shouldn't have much trouble. There have been some key people reporting lately that they run very large ZFS arrays without issue, with proper tuning. Also, just a reminder: do not pick a value of 2048M for kmem_size or kmem_size_max; the machine won't boot/work. You shouldn't go above something like 1536M, although some have tuned slightly above that with success. (You need to remember that there is more to kernel memory allocation than just this, so you don't want to exhaust it all assigning it to kmap. Hope that makes sense...) > Can an AMD64 kernel make use of memory above 2g? Only on CURRENT; 7.x cannot, and AFAIK, will never be able to, as the engineering efforts required to fix it are too great. I look forward to seeing your numbers. Someone here might be able to compile them into some graphs and other whatnots to make things easier for future readers. Thanks for doing all of this! -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | From koitsu at FreeBSD.org Thu Oct 30 21:34:14 2008 From: koitsu at FreeBSD.org (Jeremy Chadwick) Date: Thu Oct 30 21:34:21 2008 Subject: Areca vs. ZFS performance testing. In-Reply-To: <490A849C.7030009@dannysplace.net> References: <490A782F.9060406@dannysplace.net> <20081031033208.GA21220@icarus.home.lan> <490A849C.7030009@dannysplace.net> Message-ID: <20081031043412.GA22289@icarus.home.lan> On Fri, Oct 31, 2008 at 02:07:56PM +1000, Danny Carroll wrote: > Jeremy Chadwick wrote: > > I think these sets of tests are good. There are some others I'd like to > > see, but they'd only be applicable if the 1231-ML has hardware cache. I > > can mention what those are if the card does have hardware caching. > > The card comes standard with 256Mb of cache. I'd like to see the performance difference between these scenarios: - Memory cache enabled on Areca, write caching enabled on disks - Memory cache enabled on Areca, write caching disabled on disks - Memory cache disabled on Areca, write caching enabled on disks - Memory cache disabled on Areca, write caching disabled on disks I don't know if the controller will let you disable use of memory cache, but I'm hoping it does. I'm pretty sure it lets you disable disk write caching in its BIOS or via the CLI utility. > >> I do have some concern about the size of the eventual array and ZFS' use > >> of system memory. Are there guidelines available that give advice on > >> how much memory a box should have with large ZFS arrays? > > > > The general concept is: "the more RAM the better". However, if you're > > using RELENG_7, then there's not much point (speaking solely about ZFS) > > to getting more than maybe 3 or 4GB; you're still limited to a 2GB kmap > > maximum. > > > > Regarding size of the array vs. memory usage: as long as you tune kmem > > and ZFS ARC, you shouldn't have much trouble. There have been some > > key people reporting lately that they run very large ZFS arrays without > > issue, with proper tuning. > > I followed the recommendations here: > http://wiki.freebsd.org/ZFSTuningGuide > > vm.kmem_size="1024M" > vm.kmem_size_max="1024M" > vfs.zfs.debug=1 > > And : kern.maxvnodes=400000 > > I have not added the following because they were listed in the i386 > section. (These values were quoted for a machine with 768Mb of ram) > vfs.zfs.arc_max="40M" > vfs.zfs.vdev.cache.size="5M" > > Am I right in assuming these do not apply to amd64? The article was not > specific. All of the tuning variables apply to i386 and amd64. You do not need the vfs.zfs.debug variable; I'm not sure why you enabled that. I imagine it will have some impact on performance. I do not know anything about kern.maxvnodes, or vfs.zfs.vdev.cache.size. The tuning variables I advocate for a system with 2GB of RAM or more, on RELENG_7, are: vm.kmem_size="1536M" vm.kmem_size_max="1536M" vfs.zfs.arc_min="16M" vfs.zfs.arc_max="64M" vfs.zfs.prefetch_disable="1" You can gradually increase arc_min and arc_max by ~16MB increments as you see fit; you should see general performance improvements as they get larger (more data being kept in the ARC), but don't get too crazy. I've tuned arc_max up to 128MB before with success, but I don't want to try anything larger without decreasing kmem_size_*. > > Also, just a reminder: do not pick a value of 2048M for kmem_size or > > kmem_size_max; the machine won't boot/work. You shouldn't go above > > something like 1536M, although some have tuned slightly above that > > with success. (You need to remember that there is more to kernel > > memory allocation than just this, so you don't want to exhaust it all > > assigning it to kmap. Hope that makes sense...) > > It makes sense. I'm using 1024 at the moment, but I've never really > looked into what memory is actually being used. > > Tuning advice here would be well received :-) The only reason you need to adjust kmem_size and kmem_size_max is to increase the amount of available kmap memory which ZFS relies heavily on. If the values are too low, under heavy I/O, the kernel will panic with kmem exhaustion messages (see the ZFS Wiki for what some look like, or my Wiki). I would recommend you stick with a consistent set of loader.conf tuning variables, and focus entirely on comparing the performance of ZFS on the Areca controller vs. the ICH controller. You can perform a "ZFS tuning comparison" later. One step at a time; don't over-exert yourself quite yet. :-) You can add raidz2 to this comparison list too if you feel it's worthwhile, but I think most people will be using raidz1. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | From andrew at modulus.org Thu Oct 30 21:46:12 2008 From: andrew at modulus.org (Andrew Snow) Date: Thu Oct 30 21:46:21 2008 Subject: Areca vs. ZFS performance testing. In-Reply-To: <20081031043412.GA22289@icarus.home.lan> References: <490A782F.9060406@dannysplace.net> <20081031033208.GA21220@icarus.home.lan> <490A849C.7030009@dannysplace.net> <20081031043412.GA22289@icarus.home.lan> Message-ID: <490A8D23.6030309@modulus.org> Jeremy Chadwick wrote: > I would recommend you stick with a consistent set of loader.conf > tuning variables, and focus entirely on comparing the performance of > ZFS on the Areca controller vs. the ICH controller. Its probably worth playing with vfs.zfs.cache_flush_disable when using the hardware RAID. By default, ZFS will flush the entire hardware cache just to make sure the ZFS Intent Log (ZIL) has been written. This isn't so bad on a group of hard disks with small caches, but bad if you have 256mb of controller write cache. From fbsd at dannysplace.net Thu Oct 30 21:47:49 2008 From: fbsd at dannysplace.net (Danny Carroll) Date: Thu Oct 30 21:48:01 2008 Subject: Areca vs. ZFS performance testing. In-Reply-To: <20081031043412.GA22289@icarus.home.lan> References: <490A782F.9060406@dannysplace.net> <20081031033208.GA21220@icarus.home.lan> <490A849C.7030009@dannysplace.net> <20081031043412.GA22289@icarus.home.lan> Message-ID: <490A8DFB.8030405@dannysplace.net> Jeremy Chadwick wrote: > On Fri, Oct 31, 2008 at 02:07:56PM +1000, Danny Carroll wrote: > - Memory cache enabled on Areca, write caching enabled on disks > - Memory cache enabled on Areca, write caching disabled on disks > - Memory cache disabled on Areca, write caching enabled on disks > - Memory cache disabled on Areca, write caching disabled on disks Does it matter what type of disk we are talking about? What I mean is, do you want to see this with both Raid5 and Raid6 arrays? Also, I'm pretty sure that in JBod mode the cache (on the card) will do nothing. But I am not certain, so I'll do the tests there as well. What about stripe sizes? I mainly use big files so I was going to stripe accordingly. But the bonnie++ tests might give strange results in that case. > I don't know if the controller will let you disable use of memory cache, > but I'm hoping it does. I'm pretty sure it lets you disable disk > write caching in its BIOS or via the CLI utility. > It's been a while since I've had a hardware raid card. I'll see what is available. > All of the tuning variables apply to i386 and amd64. > > You do not need the vfs.zfs.debug variable; I'm not sure why you enabled > that. I imagine it will have some impact on performance. Consider it gone. > I do not know anything about kern.maxvnodes, or vfs.zfs.vdev.cache.size. > At the moment I am not hitting anywhere near the max vnodes setting. So I think it is irrelevant. > The tuning variables I advocate for a system with 2GB of RAM or more, > on RELENG_7, are: > > vm.kmem_size="1536M" > vm.kmem_size_max="1536M" > vfs.zfs.arc_min="16M" > vfs.zfs.arc_max="64M" > vfs.zfs.prefetch_disable="1" > > You can gradually increase arc_min and arc_max by ~16MB increments as > you see fit; you should see general performance improvements as they > get larger (more data being kept in the ARC), but don't get too crazy. > I've tuned arc_max up to 128MB before with success, but I don't want > to try anything larger without decreasing kmem_size_*. What is the arc? Is it the ZFS file cache? > The only reason you need to adjust kmem_size and kmem_size_max is to > increase the amount of available kmap memory which ZFS relies heavily > on. If the values are too low, under heavy I/O, the kernel will panic > with kmem exhaustion messages (see the ZFS Wiki for what some look > like, or my Wiki). > > I would recommend you stick with a consistent set of loader.conf > tuning variables, and focus entirely on comparing the performance of > ZFS on the Areca controller vs. the ICH controller. Once I am settled on a 'starting point' I won't be altering it for the tests. > You can perform a "ZFS tuning comparison" later. One step at a time; > don't over-exert yourself quite yet. :-) Yeah, this is weekend stuff for me at the moment, it will take me some time to get things done. Firstly I need to figure out how I am going to hook up 10 drives to my system. I don't have the drive-bay space and I am not shelling out for a new case so I am hunting around for an ancient external disk cabinet. > You can add raidz2 to this comparison list too if you feel it's > worthwhile, but I think most people will be using raidz1. I might as well do both. -D From fbsd at dannysplace.net Thu Oct 30 21:48:38 2008 From: fbsd at dannysplace.net (Danny Carroll) Date: Thu Oct 30 21:48:45 2008 Subject: Areca vs. ZFS performance testing. In-Reply-To: <20081031033208.GA21220@icarus.home.lan> References: <490A782F.9060406@dannysplace.net> <20081031033208.GA21220@icarus.home.lan> Message-ID: <490A849C.7030009@dannysplace.net> Jeremy Chadwick wrote: > I think these sets of tests are good. There are some others I'd like to > see, but they'd only be applicable if the 1231-ML has hardware cache. I > can mention what those are if the card does have hardware caching. The card comes standard with 256Mb of cache. >> I do have some concern about the size of the eventual array and ZFS' use >> of system memory. Are there guidelines available that give advice on >> how much memory a box should have with large ZFS arrays? > > The general concept is: "the more RAM the better". However, if you're > using RELENG_7, then there's not much point (speaking solely about ZFS) > to getting more than maybe 3 or 4GB; you're still limited to a 2GB kmap > maximum. > > Regarding size of the array vs. memory usage: as long as you tune kmem > and ZFS ARC, you shouldn't have much trouble. There have been some > key people reporting lately that they run very large ZFS arrays without > issue, with proper tuning. I followed the recommendations here: http://wiki.freebsd.org/ZFSTuningGuide vm.kmem_size="1024M" vm.kmem_size_max="1024M" vfs.zfs.debug=1 And : kern.maxvnodes=400000 I have not added the following because they were listed in the i386 section. (These values were quoted for a machine with 768Mb of ram) vfs.zfs.arc_max="40M" vfs.zfs.vdev.cache.size="5M" Am I right in assuming these do not apply to amd64? The article was not specific. > > Also, just a reminder: do not pick a value of 2048M for kmem_size or > kmem_size_max; the machine won't boot/work. You shouldn't go above > something like 1536M, although some have tuned slightly above that > with success. (You need to remember that there is more to kernel > memory allocation than just this, so you don't want to exhaust it all > assigning it to kmap. Hope that makes sense...) It makes sense. I'm using 1024 at the moment, but I've never really looked into what memory is actually being used. Tuning advice here would be well received :-) >> Can an AMD64 kernel make use of memory above 2g? > > Only on CURRENT; 7.x cannot, and AFAIK, will never be able to, as the > engineering efforts required to fix it are too great. > > I look forward to seeing your numbers. Someone here might be able to > compile them into some graphs and other whatnots to make things easier > for future readers. Ahhh, well, that will eventually decide my upgrade path when RELENG_8 is released and stable. > Thanks for doing all of this! No worries, hopefully it will be useful information to future google searches :-P -D From fbsd at dannysplace.net Thu Oct 30 21:50:04 2008 From: fbsd at dannysplace.net (Danny Carroll) Date: Thu Oct 30 21:50:10 2008 Subject: Areca vs. ZFS performance testing. In-Reply-To: <490A8D23.6030309@modulus.org> References: <490A782F.9060406@dannysplace.net> <20081031033208.GA21220@icarus.home.lan> <490A849C.7030009@dannysplace.net> <20081031043412.GA22289@icarus.home.lan> <490A8D23.6030309@modulus.org> Message-ID: <490A8E82.1080901@dannysplace.net> Andrew Snow wrote: > Its probably worth playing with vfs.zfs.cache_flush_disable when using > the hardware RAID. > > By default, ZFS will flush the entire hardware cache just to make sure > the ZFS Intent Log (ZIL) has been written. > > This isn't so bad on a group of hard disks with small caches, but bad if > you have 256mb of controller write cache. Ok. From fbsd at dannysplace.net Thu Oct 30 21:55:02 2008 From: fbsd at dannysplace.net (Danny Carroll) Date: Thu Oct 30 21:55:08 2008 Subject: Areca vs. ZFS performance testing. In-Reply-To: <20081031043412.GA22289@icarus.home.lan> References: <490A782F.9060406@dannysplace.net> <20081031033208.GA21220@icarus.home.lan> <490A849C.7030009@dannysplace.net> <20081031043412.GA22289@icarus.home.lan> Message-ID: <490A8FAD.8060009@dannysplace.net> Jeremy Chadwick wrote: > > I'd like to see the performance difference between these scenarios: > > - Memory cache enabled on Areca, write caching enabled on disks > - Memory cache enabled on Areca, write caching disabled on disks > - Memory cache disabled on Areca, write caching enabled on disks > - Memory cache disabled on Areca, write caching disabled on disks > > I don't know if the controller will let you disable use of memory cache, > but I'm hoping it does. I'm pretty sure it lets you disable disk > write caching in its BIOS or via the CLI utility. The manual suggests that the write cache can be disabled. Perhaps there is no read cache for this card. -D From numisemis at yahoo.com Fri Oct 31 02:21:01 2008 From: numisemis at yahoo.com (Simun Mikecin) Date: Fri Oct 31 04:23:37 2008 Subject: Areca vs. ZFS performance testing. Message-ID: <880498.17704.qm@web36603.mail.mud.yahoo.com> Jeremy Chadwick wrote: > The tuning variables I advocate for a system with 2GB of RAM or more, > on RELENG_7, are: > vm.kmem_size="1536M" > vm.kmem_size_max="1536M" There is no point in setting vm.kmem_size_max. Setting vm.kmem_size is enough. vm.kmem_size_max is used for auto-tuning of kmem size which is in this case actually overriden by manually setting vm.kmem_size. > vfs.zfs.arc_min="16M" > vfs.zfs.arc_max="64M" > vfs.zfs.prefetch_disable="1" > You can gradually increase arc_min and arc_max by > ~16MB increments as > you see fit; you should see general performance > improvements as they > get larger (more data being kept in the ARC), but > don't get too crazy. > I've tuned arc_max up to 128MB before with > success, but I don't want > to try anything larger without decreasing kmem_size_*. Can you explain why would you have to decrease kmem_size to use larger ARC? AFAIK it should be contrary to what you are saying: when you use larger kmem_size you can also use larger arc_max. My suggestion if you are using kmem_size of 1536M would be to not tune arc_min and arc_max if your system isn't panicing. If it does you should try decreasing arc_max (from it's default value) until it doesn't. From numisemis at yahoo.com Fri Oct 31 04:57:19 2008 From: numisemis at yahoo.com (Simun Mikecin) Date: Fri Oct 31 05:04:27 2008 Subject: Areca vs. ZFS performance testing. Message-ID: <174490.95560.qm@web36607.mail.mud.yahoo.com> Jeremy Chadwick wrote: > Well, my understanding (which is probably wrong) is that the memory > used for the ARC is somehow separate from that of the kmap. I was > under the impression the kmap was used by ZFS for other things, and > did not include ARC. kmem is used by ARC. You can check your total kmem usage by ZFS using 'vmstat -m' under the line that says 'solaris'. > People have advocated increasing arc_min and arc_max in the past, citing > large performance gains as arc_max gets larger; you might see people > mentioning that they see great performance increases when increasing > arc_max from 64M to 128M. My understanding is that increasing the ARC > provides more actual cached data that ZFS can reference (vs. pulling it > off disk). Again, if I'm incorrect, please state so. You are correct about the benefits of increasing arc_max. I don't know of any benefits of tuning arc_min. Maybe someone else can answer this. By default on 7-STABLE arc_max will be 3/4 of kmem_size. So if you are using 1536M for kmem_size, arc_max will be 1152M by default. But some people will maybe need to lower it to avoid panic during heavy I/O since in those scenarios ARC cache size could for short periods of time be larger than arc_max and reach kmem limit. From grarpamp at gmail.com Fri Oct 31 09:36:32 2008 From: grarpamp at gmail.com (grarpamp) Date: Fri Oct 31 09:36:44 2008 Subject: Benchmark tools: was Areca vs ZFS Message-ID: Hi. Wanted to send an FYI for those who may not know about it. Lots of folks seem to mention Bonnie. There is also Iozone. It has been maintained fairly well and has/had useful features before Bonnie did/may. When compiling iozone change lib -lpthread to gcc option -pthread and you're done. There is also freebsd-performance list. Here are the current links for anyone interested. Happy benching! http://www.iozone.org/ ver : 3.311 port: 3.283 [old] http://www.coker.com.au/bonnie++/ ver : 1.03d port: none [???] http://www.coker.com.au/bonnie++/experimental/ ver : 1.94 port : 1.93d [old]