From lists at thefrog.net Thu May 1 08:54:41 2008 From: lists at thefrog.net (Andrew Hill) Date: Thu May 1 08:54:45 2008 Subject: ZFS docs / info Message-ID: Ivan Voras wrote: > Do you know about http://wiki.freebsd.org/ZFS ? Yes, that was my starting point as I learnt about ZFS. I simply wanted to offer documentation aimed at a different level of user. I found that the documentation on that wiki and the docs it links to tended to fit into one of three categories 1. it provided a very high level listing of features of the whole system, without talking about specific components, what each one is responsible for and how they fit together (e.g. is the zpool or the zfs responsible for checksumming, compression, redundancy, etc) - great for convincing people of the worth of ZFS 2. it assumes the reader has full knowledge of how the zfs pieces fit together (i.e. they what they want to create and when) and was simply there to document the syntax of the zpool and zfs commands - a good quick-reference guide for those familiar with zfs 3. it provided very detailed information about commands, which must of course include how to use every single component available to ZFS, a lot of which is far beyond what a typical 'home' bsd user would want, and perhaps confusing due to the level of detail - but perfect for an engineer or administrator Obviously the right documentation for a specific user really depends on their background knowledge, and I felt that the first category was great for convincing someone to use ZFS, but if they knew nothing of how the pieces fit together then 2 and 3 were a very deep pool to dive into. So I've tried to summarise the info I found from all three into a simpler document aimed somewhere in between high-level-overview and detailed-man-pages, containing what I found most useful from the documentation available I don't imagine anyone who's actually bothered to sign up to freebsd- fs will want documentation at the level I've written it (they'll be going for #2 or 3 above), but I figured those trying to find out how it fits together might stumble across the archives, or maybe someone involved in documentation will see some utility (for new zfs users) in what i've written. Andrew From anderson at freebsd.org Fri May 2 20:58:40 2008 From: anderson at freebsd.org (Eric Anderson) Date: Fri May 2 20:58:44 2008 Subject: Consistent inodes between distinct machines In-Reply-To: <48070DCF.9090902@fsn.hu> References: <48070DCF.9090902@fsn.hu> Message-ID: <4CA7BA82-E95C-45FF-9B94-8EF27B6DB024@freebsd.org> On Apr 17, 2008, at 3:43 AM, Attila Nagy wrote: > Hello, > > I have several NFS servers, where the service must be available > 0-24. The servers are mounted read only on the clients and I've > solved the problem of maintaining consistent inodes between them by > rsyncing an UFS image and mounting it via md on the NFS servers. > The machines have a common IP address with CARP, so if one of them > falls out, the other(s) can take over. > > This works nice, but rsyncing multi gigabyte files are becoming more > and more annoying, so I've wondered whether it would be possible to > get constant inodes between machines via alternative ways. Why not avoid syncing multi-gigabyte files by splitting your huge FS image into many smaller say 512MB files, then use md and geom concat/ stripe/etc to make them all one image that you mount? Eric From ticso at cicely12.cicely.de Sat May 3 12:51:05 2008 From: ticso at cicely12.cicely.de (Bernd Walter) Date: Sat May 3 12:51:09 2008 Subject: Consistent inodes between distinct machines In-Reply-To: <4CA7BA82-E95C-45FF-9B94-8EF27B6DB024@freebsd.org> References: <48070DCF.9090902@fsn.hu> <4CA7BA82-E95C-45FF-9B94-8EF27B6DB024@freebsd.org> Message-ID: <20080503125050.GG40730@cicely12.cicely.de> On Fri, May 02, 2008 at 03:40:11PM -0500, Eric Anderson wrote: > On Apr 17, 2008, at 3:43 AM, Attila Nagy wrote: > > >Hello, > > > >I have several NFS servers, where the service must be available > >0-24. The servers are mounted read only on the clients and I've > >solved the problem of maintaining consistent inodes between them by > >rsyncing an UFS image and mounting it via md on the NFS servers. > >The machines have a common IP address with CARP, so if one of them > >falls out, the other(s) can take over. > > > >This works nice, but rsyncing multi gigabyte files are becoming more > >and more annoying, so I've wondered whether it would be possible to > >get constant inodes between machines via alternative ways. > > > Why not avoid syncing multi-gigabyte files by splitting your huge FS > image into many smaller say 512MB files, then use md and geom concat/ > stripe/etc to make them all one image that you mount? Where would be the positive effect by doing this? FFS distributes data over the media, so all the small files changes in almost every case and you have to checksum-compare the whole virtual disk anyway. With multiple files the syncing is more complex. For example a normal rsync run can garantie that you get a complete file synced or none at all, but this doesn't work out of the box with multiple files, so you risk half updated data. Nevertheless I think that the UFS/NFS combo is not very good for this problem. With ZFS send/receive however inode numbers are consistent. Together with the differential stream creation it is quite efficient to sync large volumes as well. [75]cicely14# zfs send data/arm-elf@2008-05-03 | zfs receive -v data/test receiving full stream of data/arm-elf@2008-05-03 into data/test@2008-05-03 received 126Mb stream in 28 seconds (4.50Mb/sec) 0.008u 5.046s 0:27.93 18.0% 53+2246k 0+0io 0pf+0w [56]cicely14# ls -ali /usr/local/arm-elf/bin/ total 22585 147 drwxr-xr-x 2 root wheel 20 Mar 25 2006 . 3 drwxr-xr-x 11 root wheel 11 Dec 25 04:58 .. 154 -rwxr-xr-x 1 root wheel 1514107 Mar 25 2006 arm-elf-addr2line 150 -rwxr-xr-x 2 root wheel 1495219 Mar 25 2006 arm-elf-ar 159 -rwxr-xr-x 2 root wheel 2275463 Mar 25 2006 arm-elf-as 158 -rwxr-xr-x 1 root wheel 1481234 Mar 25 2006 arm-elf-c++filt 163 -rwxr-xr-x 1 root wheel 300233 Mar 25 2006 arm-elf-cpp 164 -rwxr-xr-x 2 root wheel 296938 Mar 25 2006 arm-elf-gcc 164 -rwxr-xr-x 2 root wheel 296938 Mar 25 2006 arm-elf-gcc-4.1.0 162 -rwxr-xr-x 1 root wheel 15949 Mar 25 2006 arm-elf-gccbug 161 -rwxr-xr-x 1 root wheel 126715 Mar 25 2006 arm-elf-gcov 160 -rwxr-xr-x 2 root wheel 2162285 Mar 25 2006 arm-elf-ld 156 -rwxr-xr-x 2 root wheel 1541809 Mar 25 2006 arm-elf-nm 153 -rwxr-xr-x 1 root wheel 1871104 Mar 25 2006 arm-elf-objcopy 149 -rwxr-xr-x 2 root wheel 2008424 Mar 25 2006 arm-elf-objdump 152 -rwxr-xr-x 2 root wheel 1495214 Mar 25 2006 arm-elf-ranlib 155 -rwxr-xr-x 1 root wheel 389000 Mar 25 2006 arm-elf-readelf 148 -rwxr-xr-x 1 root wheel 1430608 Mar 25 2006 arm-elf-size 151 -rwxr-xr-x 1 root wheel 1412788 Mar 25 2006 arm-elf-strings 157 -rwxr-xr-x 2 root wheel 1871103 Mar 25 2006 arm-elf-strip [57]cicely14# ls -ali /data/test/bin/ total 22585 147 drwxr-xr-x 2 root wheel 20 Mar 25 2006 . 3 drwxr-xr-x 11 root wheel 11 Dec 25 04:58 .. 154 -rwxr-xr-x 1 root wheel 1514107 Mar 25 2006 arm-elf-addr2line 150 -rwxr-xr-x 2 root wheel 1495219 Mar 25 2006 arm-elf-ar 159 -rwxr-xr-x 2 root wheel 2275463 Mar 25 2006 arm-elf-as 158 -rwxr-xr-x 1 root wheel 1481234 Mar 25 2006 arm-elf-c++filt 163 -rwxr-xr-x 1 root wheel 300233 Mar 25 2006 arm-elf-cpp 164 -rwxr-xr-x 2 root wheel 296938 Mar 25 2006 arm-elf-gcc 164 -rwxr-xr-x 2 root wheel 296938 Mar 25 2006 arm-elf-gcc-4.1.0 162 -rwxr-xr-x 1 root wheel 15949 Mar 25 2006 arm-elf-gccbug 161 -rwxr-xr-x 1 root wheel 126715 Mar 25 2006 arm-elf-gcov 160 -rwxr-xr-x 2 root wheel 2162285 Mar 25 2006 arm-elf-ld 156 -rwxr-xr-x 2 root wheel 1541809 Mar 25 2006 arm-elf-nm 153 -rwxr-xr-x 1 root wheel 1871104 Mar 25 2006 arm-elf-objcopy 149 -rwxr-xr-x 2 root wheel 2008424 Mar 25 2006 arm-elf-objdump 152 -rwxr-xr-x 2 root wheel 1495214 Mar 25 2006 arm-elf-ranlib 155 -rwxr-xr-x 1 root wheel 389000 Mar 25 2006 arm-elf-readelf 148 -rwxr-xr-x 1 root wheel 1430608 Mar 25 2006 arm-elf-size 151 -rwxr-xr-x 1 root wheel 1412788 Mar 25 2006 arm-elf-strings 157 -rwxr-xr-x 2 root wheel 1871103 Mar 25 2006 arm-elf-strip -- B.Walter http://www.bwct.de Modbus/TCP Ethernet I/O Baugruppen, ARM basierte FreeBSD Rechner uvm. From yalur at mail.ru Sat May 3 15:55:46 2008 From: yalur at mail.ru (Ruslan Kovtun) Date: Sat May 3 15:55:49 2008 Subject: Choppy performance. In-Reply-To: <24adbbc00804270501t48b9a1c5le2f1d0bce18572cf@mail.gmail.com> References: <24adbbc00804151529m2a74085ds468eaac55ba94a32@mail.gmail.com> <200804162212.32560.yalur@mail.ru> <24adbbc00804270501t48b9a1c5le2f1d0bce18572cf@mail.gmail.com> Message-ID: <200805031855.43218.yalur@mail.ru> Sorry, maybe I miss something. What "memory allocation errors in rtorrent" do you mean? > But if it isn't really using that much memory how come I get > memory allocation errors in rtorrent if there's more memory > avaliable? One week ago was observed problem with write speed on ZFS pool with following configuration on i386: vm.kmem_size_max="1073741824" vm.kmem_size="1073741824" KVA_PAGES=512 Write speed in 8 disks (raidz) is 40 Mb/sec and very choppy. If I change to vm.kmem_size_max="999M", write speed increase in 4 times (160Mb/sec). I think this is bug. What is yours configuration? -- ________________ ? ????????? ?????? ?????? mailto From bra at fsn.hu Sat May 3 18:09:34 2008 From: bra at fsn.hu (Attila Nagy) Date: Sat May 3 18:09:38 2008 Subject: Consistent inodes between distinct machines In-Reply-To: <20080503125050.GG40730@cicely12.cicely.de> References: <48070DCF.9090902@fsn.hu> <4CA7BA82-E95C-45FF-9B94-8EF27B6DB024@freebsd.org> <20080503125050.GG40730@cicely12.cicely.de> Message-ID: <481CAA55.2030506@fsn.hu> Hello, On 2008.05.03. 14:50, Bernd Walter wrote: > On Fri, May 02, 2008 at 03:40:11PM -0500, Eric Anderson wrote: > >> On Apr 17, 2008, at 3:43 AM, Attila Nagy wrote: >> >> >>> Hello, >>> >>> I have several NFS servers, where the service must be available >>> 0-24. The servers are mounted read only on the clients and I've >>> solved the problem of maintaining consistent inodes between them by >>> rsyncing an UFS image and mounting it via md on the NFS servers. >>> The machines have a common IP address with CARP, so if one of them >>> falls out, the other(s) can take over. >>> >>> This works nice, but rsyncing multi gigabyte files are becoming more >>> and more annoying, so I've wondered whether it would be possible to >>> get constant inodes between machines via alternative ways. >>> >> Why not avoid syncing multi-gigabyte files by splitting your huge FS >> image into many smaller say 512MB files, then use md and geom concat/ >> stripe/etc to make them all one image that you mount? >> > > Where would be the positive effect by doing this? > FFS distributes data over the media, so all the small files changes > in almost every case and you have to checksum-compare the whole virtual > disk anyway. > With multiple files the syncing is more complex. For example a normal > rsync run can garantie that you get a complete file synced or none > at all, but this doesn't work out of the box with multiple files, so > you risk half updated data. > I haven't got Eric's e-mail, but I agree with the above. > Nevertheless I think that the UFS/NFS combo is not very good for this > problem. > I don't think so. I need a stable system and UFS/NFS is in that state in FreeBSD. > With ZFS send/receive however inode numbers are consistent. > Yes, they are, but the filesystem IDs are not, so you cannot have CARP failover for the NFS servers, because all clients will have ESTALE errors on everything. I've already tried that, see my e-mails about this topic in the archives (it would be good if we could synchronize the filesystem IDs and therefore the filehandles too). > Together with the differential stream creation it is quite efficient > to sync large volumes as well. > [75]cicely14# zfs send data/arm-elf@2008-05-03 | zfs receive -v data/test > receiving full stream of data/arm-elf@2008-05-03 into data/test@2008-05-03 > received 126Mb stream in 28 seconds (4.50Mb/sec) > 0.008u 5.046s 0:27.93 18.0% 53+2246k 0+0io 0pf+0w > Yes, that's why I thought of this in the first place. But there is another problem, which hits us today (with the loopbacked image mount) as well: you have to unmount the image and restart the NFS server (it can panic the machine otherwise), so we have to flip the active state from one machine to the other during the sync. The exact process looks like this: - rsync the image to the inactive server - when it's done, remount the image and restart the nfsd - flip CARP (this is when the new content will go into production) - sync the image to the now inactive, previously active server This is a painful, slow (because of the rsync) and fragile process. And if the active server crashes while the sync is going, you are there with a possibly non-working state. With ZFS, the sync time is much smaller, but you have to flip the active state and restart nfsd as well. Currently I'm experimenting with a silly kernel patch, which replaces the following arc4random()s with a constant value: ./ffs/ffs_alloc.c: ip->i_gen = arc4random() / 2 + 1; ./ffs/ffs_alloc.c: prefcg = arc4random() % fs->fs_ncg; ./ffs/ffs_alloc.c: dp2->di_gen = arc4random() / 2 + 1; ./ffs/ffs_vfsops.c: ip->i_gen = arc4random() / 2 + 1; It seems that this works when I don't use soft updates on the volumes. So what I have now: - all of the machines have the above arc4random()s removed - all machines run the data file system in async mode (for speed and because soft updates seems to mess up the constant inodes) - I have all the data in a subversion repository (better than a plain "master image", because it's versioned, logged, etc) - I do updates in this way on the machines: mount -o rw,async /data; svn up; mount -o ro /data So far it seems to be OK, but I'm not yet finished with the testing. From ticso at cicely12.cicely.de Sat May 3 18:52:07 2008 From: ticso at cicely12.cicely.de (Bernd Walter) Date: Sat May 3 18:52:12 2008 Subject: Consistent inodes between distinct machines In-Reply-To: <481CAA55.2030506@fsn.hu> References: <48070DCF.9090902@fsn.hu> <4CA7BA82-E95C-45FF-9B94-8EF27B6DB024@freebsd.org> <20080503125050.GG40730@cicely12.cicely.de> <481CAA55.2030506@fsn.hu> Message-ID: <20080503185155.GA44005@cicely12.cicely.de> On Sat, May 03, 2008 at 08:09:25PM +0200, Attila Nagy wrote: > Hello, > > On 2008.05.03. 14:50, Bernd Walter wrote: > >On Fri, May 02, 2008 at 03:40:11PM -0500, Eric Anderson wrote: > > > >>On Apr 17, 2008, at 3:43 AM, Attila Nagy wrote: > >Nevertheless I think that the UFS/NFS combo is not very good for this > >problem. > > > I don't think so. I need a stable system and UFS/NFS is in that state in > FreeBSD. ZFS is pretty stable as well, although it has some points you need to care and tune about. > >With ZFS send/receive however inode numbers are consistent. > > > Yes, they are, but the filesystem IDs are not, so you cannot have CARP > failover for the NFS servers, because all clients will have ESTALE > errors on everything. Havn't though about this. Of course this is a real problem. Have you tried the following: Setup Server A with all required ZFS filesystems. Replicate everything to Server B using dd. Then the filesystem ID should be the same on both systems. This will not work for newly created filesystems however and you may need to take extra care about not accidently change disks between the machines, since they have the same disk IDs as well. I admit - not very perfect :( > I've already tried that, see my e-mails about this topic in the archives > (it would be good if we could synchronize the filesystem IDs and > therefore the filehandles too). > >Together with the differential stream creation it is quite efficient > >to sync large volumes as well. > >[75]cicely14# zfs send data/arm-elf@2008-05-03 | zfs receive -v data/test > >receiving full stream of data/arm-elf@2008-05-03 into data/test@2008-05-03 > >received 126Mb stream in 28 seconds (4.50Mb/sec) > >0.008u 5.046s 0:27.93 18.0% 53+2246k 0+0io 0pf+0w > > > Yes, that's why I thought of this in the first place. But there is > another problem, which hits us today (with the loopbacked image mount) > as well: you have to unmount the image and restart the NFS server (it > can panic the machine otherwise), so we have to flip the active state > from one machine to the other during the sync. Of course you have to do this - readonly mounts mean not writing, but it doesn't mean not caching metadata and expecting the underlying media to change contents, so to stay in sync you have to remount. > The exact process looks like this: > - rsync the image to the inactive server > - when it's done, remount the image and restart the nfsd You also have to sync the image to a different file, since you can't pollute the original file with new content, while it is mounted. But with propper (IIRC default) options rsync already writes a new file and than exchanges it with the old one. > - flip CARP (this is when the new content will go into production) > - sync the image to the now inactive, previously active server > > This is a painful, slow (because of the rsync) and fragile process. And > if the active server crashes while the sync is going, you are there with > a possibly non-working state. > > With ZFS, the sync time is much smaller, but you have to flip the active > state and restart nfsd as well. Sounds plausible to me. > Currently I'm experimenting with a silly kernel patch, which replaces > the following arc4random()s with a constant value: > ./ffs/ffs_alloc.c: ip->i_gen = arc4random() / 2 + 1; > ./ffs/ffs_alloc.c: prefcg = arc4random() % fs->fs_ncg; > ./ffs/ffs_alloc.c: dp2->di_gen = arc4random() / 2 + 1; > ./ffs/ffs_vfsops.c: ip->i_gen = arc4random() / 2 + 1; > > It seems that this works when I don't use soft updates on the volumes. But it is very fragile and it is there for a good reason. Namely to distribute the allocated inodes over the media and since AFAIK at leasy small files have their data allocated near the inode you influece data distribution as well. This will very likely lead to lower speed after some usage. > So what I have now: > - all of the machines have the above arc4random()s removed > - all machines run the data file system in async mode (for speed and > because soft updates seems to mess up the constant inodes) > - I have all the data in a subversion repository (better than a plain > "master image", because it's versioned, logged, etc) > - I do updates in this way on the machines: mount -o rw,async /data; svn > up; mount -o ro /data > > So far it seems to be OK, but I'm not yet finished with the testing. Honestly said - I wouldn't trust that very much. Say you use two disk stations with fibre channel, which are connetced to two hosts. Use the disk stations with different power supply rails. Then use a solid constructed single server and have the same machine as cold or maybe already booted standby. Use the disk stations to mirror - one half on each station. If the host dies you can easily take over the service to the other machine by just mounting the disks. If you do this with ZFS it even takes care that the original host will not automatically mount them, since the host-id for the pool has been changed to that of the other host. It is not a hot standby as your solution, but talking about service failures I would assume this will outperform any hackish solution. I see so many people trying to do freaky failover with additional complexity and additional failure points, instead of just to increase the quality of their hardware. -- B.Walter http://www.bwct.de Modbus/TCP Ethernet I/O Baugruppen, ARM basierte FreeBSD Rechner uvm. From bra at fsn.hu Sat May 3 19:53:44 2008 From: bra at fsn.hu (Attila Nagy) Date: Sat May 3 19:53:49 2008 Subject: Consistent inodes between distinct machines In-Reply-To: <20080503185155.GA44005@cicely12.cicely.de> References: <48070DCF.9090902@fsn.hu> <4CA7BA82-E95C-45FF-9B94-8EF27B6DB024@freebsd.org> <20080503125050.GG40730@cicely12.cicely.de> <481CAA55.2030506@fsn.hu> <20080503185155.GA44005@cicely12.cicely.de> Message-ID: <481CC2B8.5080205@fsn.hu> On 2008.05.03. 20:51, Bernd Walter wrote: > On Sat, May 03, 2008 at 08:09:25PM +0200, Attila Nagy wrote: > >> Hello, >> >> On 2008.05.03. 14:50, Bernd Walter wrote: >> >>> On Fri, May 02, 2008 at 03:40:11PM -0500, Eric Anderson wrote: >>> >>> >>>> On Apr 17, 2008, at 3:43 AM, Attila Nagy wrote: >>>> >>> Nevertheless I think that the UFS/NFS combo is not very good for this >>> problem. >>> >>> >> I don't think so. I need a stable system and UFS/NFS is in that state in >> FreeBSD. >> > > ZFS is pretty stable as well, although it has some points you need > to care and tune about. > I have (had, switched back one to UFS) two machines with ZFS. One i386 and one amd64. Both kept crashing or freezing, so I don't consider ZFS pretty stable ATM. :( > Havn't though about this. > Of course this is a real problem. > Have you tried the following: > Setup Server A with all required ZFS filesystems. > Replicate everything to Server B using dd. > Then the filesystem ID should be the same on both systems. > This will not work for newly created filesystems however and you may > need to take extra care about not accidently change disks between the > machines, since they have the same disk IDs as well. > I admit - not very perfect :( > Haven't tried that -but thought of it-, because I would need a bunch of new filesystems for snapshotting and synchronizing and I would like to dd tens of gigabytes every time to all of the NFS servers over the network. >> Yes, that's why I thought of this in the first place. But there is >> another problem, which hits us today (with the loopbacked image mount) >> as well: you have to unmount the image and restart the NFS server (it >> can panic the machine otherwise), so we have to flip the active state >> from one machine to the other during the sync. >> > > Of course you have to do this - readonly mounts mean not writing, but > it doesn't mean not caching metadata and expecting the underlying media > to change contents, so to stay in sync you have to remount. > I am very well aware of that. If it would work, I would choose a geom_gate solution with one RW machine and many RO ones with a mirror formed from them. Of course that's still not perfect, so ZFS's mirroring would be a better fit (due to incremental updates). But sadly, it's not possible (AFAIK with "standard" methods) to run systems like that. > >> The exact process looks like this: >> - rsync the image to the inactive server >> - when it's done, remount the image and restart the nfsd >> > > You also have to sync the image to a different file, since you can't > pollute the original file with new content, while it is mounted. > I am doing this for years without any ill effects. Of course I don't access the filesystem while it's synced. I'm just lazy to umount it, but you are right, that's the correct way. > But with propper (IIRC default) options rsync already writes a new > file and than exchanges it with the old one. > Yes, I use inplace syncing, because I don't have that much space available. > >> Currently I'm experimenting with a silly kernel patch, which replaces >> the following arc4random()s with a constant value: >> ./ffs/ffs_alloc.c: ip->i_gen = arc4random() / 2 + 1; >> ./ffs/ffs_alloc.c: prefcg = arc4random() % fs->fs_ncg; >> ./ffs/ffs_alloc.c: dp2->di_gen = arc4random() / 2 + 1; >> ./ffs/ffs_vfsops.c: ip->i_gen = arc4random() / 2 + 1; >> >> It seems that this works when I don't use soft updates on the volumes. >> > > But it is very fragile and it is there for a good reason. > For a normal filesystem, yes. > Namely to distribute the allocated inodes over the media and since > AFAIK at leasy small files have their data allocated near the inode > you influece data distribution as well. > This will very likely lead to lower speed after some usage. > Because these are mostly RO (only RW while updating, which is a slow process anyway) volumes, used for serving NFS clients, I don't think it will matter that much. But I'll see. Currently this is the best I could came up with. > Honestly said - I wouldn't trust that very much. > Say you use two disk stations with fibre channel, which are connetced to > two hosts. > Use the disk stations with different power supply rails. > Then use a solid constructed single server and have the same machine > as cold or maybe already booted standby. > Use the disk stations to mirror - one half on each station. > If the host dies you can easily take over the service to the other > machine by just mounting the disks. > If you do this with ZFS it even takes care that the original host will > not automatically mount them, since the host-id for the pool has been > changed to that of the other host. > It is not a hot standby as your solution, but talking about service > failures I would assume this will outperform any hackish solution. > I see so many people trying to do freaky failover with additional > complexity and additional failure points, instead of just to increase > the quality of their hardware. > > The above servers are providing NFS to FreeBSD and Linux netboot clients (clients are at many sites, running the real services behind load balancers, BGP anycast routing, whatever you like). The NFS servers here have the function of rapid deployment (put some new machines in the server pool X), centralised management (only have to make the configuration and OS changes in one place), etc. So I'm not trying to build a highly available general cluster (with NFS), but a highly available NFS server for netbooted clients. And commercial NASes aren't better at all (at least this is what I've seen so far), most of them are not shared nothing systems with affordable, reliable multisite replication capabilities. From engywook at gmail.com Sun May 4 09:03:30 2008 From: engywook at gmail.com (Daniel Andersson) Date: Sun May 4 09:03:34 2008 Subject: Choppy performance. In-Reply-To: <200805031855.43218.yalur@mail.ru> References: <24adbbc00804151529m2a74085ds468eaac55ba94a32@mail.gmail.com> <200804162212.32560.yalur@mail.ru> <24adbbc00804270501t48b9a1c5le2f1d0bce18572cf@mail.gmail.com> <200805031855.43218.yalur@mail.ru> Message-ID: <24adbbc00805040203nbe92a2u140586abbbeb4a73@mail.gmail.com> > Sorry, maybe I miss something. > What "memory allocation errors in rtorrent" do you mean? "Storage error: [File chunk write error: Cannot allocate memory]" "Storage error: [File chunk read error: Cannot allocate memory]" It happens with big torrents ~40gb+. It probably happens because I have: send_buffer_size=1M receive_buffer_size=1M 1MB send/receive buffers plus big torrent chunks eats the memory pretty fast. I have it at 1M to lessen the disk load. But maybe I don't need it that high anymore with two zfs disks? > > > > But if it isn't really using that much memory how come I get > > memory allocation errors in rtorrent if there's more memory > > avaliable? > > One week ago was observed problem with write speed on ZFS pool with following > configuration on i386: > > vm.kmem_size_max="1073741824" > vm.kmem_size="1073741824" > KVA_PAGES=512 > Write speed in 8 disks (raidz) is 40 Mb/sec and very choppy. > If I change to vm.kmem_size_max="999M", write speed increase in 4 times > (160Mb/sec). I think this is bug. > What is yours configuration? > My loader.conf(AMD64): vfs.zfs.prefetch_disable=1 vm.kmem_size_max="1073741824" vm.kmem_size="1073741824" I'll try setting it to 999M instead, thanks! > -- > ________________ > ? ????????? > ?????? ?????? mailto Cheers! Daniel From gary.jennejohn at freenet.de Sun May 4 14:08:05 2008 From: gary.jennejohn at freenet.de (Gary Jennejohn) Date: Sun May 4 14:08:10 2008 Subject: Choppy performance. In-Reply-To: <24adbbc00805040203nbe92a2u140586abbbeb4a73@mail.gmail.com> References: <24adbbc00804151529m2a74085ds468eaac55ba94a32@mail.gmail.com> <200804162212.32560.yalur@mail.ru> <24adbbc00804270501t48b9a1c5le2f1d0bce18572cf@mail.gmail.com> <200805031855.43218.yalur@mail.ru> <24adbbc00805040203nbe92a2u140586abbbeb4a73@mail.gmail.com> Message-ID: <20080504160802.745afcc2@peedub.jennejohn.org> On Sun, 4 May 2008 11:03:30 +0200 "Daniel Andersson" wrote: > > vm.kmem_size_max="1073741824" > > vm.kmem_size="1073741824" > > KVA_PAGES=512 > > Write speed in 8 disks (raidz) is 40 Mb/sec and very choppy. > > If I change to vm.kmem_size_max="999M", write speed increase in 4 times > > (160Mb/sec). I think this is bug. > > What is yours configuration? > > > My loader.conf(AMD64): > > vfs.zfs.prefetch_disable=1 > vm.kmem_size_max="1073741824" > vm.kmem_size="1073741824" > > I'll try setting it to 999M instead, thanks! > I just tried setting it to 999M for kicks (amd64) and saw no significant speed up doing a ``make -j4 buildworld''. I only saw a time difference of 20 seconds between 1024M and 999M, which is just in the noise. --- Gary Jennejohn From engywook at gmail.com Sun May 4 14:57:29 2008 From: engywook at gmail.com (Daniel Andersson) Date: Sun May 4 14:57:32 2008 Subject: Choppy performance. In-Reply-To: <20080504160802.745afcc2@peedub.jennejohn.org> References: <24adbbc00804151529m2a74085ds468eaac55ba94a32@mail.gmail.com> <200804162212.32560.yalur@mail.ru> <24adbbc00804270501t48b9a1c5le2f1d0bce18572cf@mail.gmail.com> <200805031855.43218.yalur@mail.ru> <24adbbc00805040203nbe92a2u140586abbbeb4a73@mail.gmail.com> <20080504160802.745afcc2@peedub.jennejohn.org> Message-ID: <24adbbc00805040757k32a1795fw8057bb7df0812452@mail.gmail.com> > I just tried setting it to 999M for kicks (amd64) and saw no significant > speed up doing a ``make -j4 buildworld''. I only saw a time difference > of 20 seconds between 1024M and 999M, which is just in the noise. > > --- > Gary Jennejohn What kind of speeds do you get when/if you ftp stuff? I seem to have trouble getting above 30MB/s. What kind of setup do you have? How many disks? raidz? Cheers; Daniel From yalur at mail.ru Sun May 4 15:11:14 2008 From: yalur at mail.ru (Ruslan Kovtun) Date: Sun May 4 15:11:18 2008 Subject: Choppy performance. In-Reply-To: <20080504160802.745afcc2@peedub.jennejohn.org> References: <24adbbc00804151529m2a74085ds468eaac55ba94a32@mail.gmail.com> <24adbbc00805040203nbe92a2u140586abbbeb4a73@mail.gmail.com> <20080504160802.745afcc2@peedub.jennejohn.org> Message-ID: <200805041811.11266.yalur@mail.ru> ? ????????? ?? ??????????? 04 ??? 2008 Gary Jennejohn ???????(a): > On Sun, 4 May 2008 11:03:30 +0200 > > "Daniel Andersson" wrote: > > > vm.kmem_size_max="1073741824" > > > vm.kmem_size="1073741824" > > > KVA_PAGES=512 > > > Write speed in 8 disks (raidz) is 40 Mb/sec and very choppy. > > > If I change to vm.kmem_size_max="999M", write speed increase in 4 > > > times (160Mb/sec). I think this is bug. > > > What is yours configuration? > > > > My loader.conf(AMD64): > > > > vfs.zfs.prefetch_disable=1 > > vm.kmem_size_max="1073741824" > > vm.kmem_size="1073741824" > > > > I'll try setting it to 999M instead, thanks! > > I just tried setting it to 999M for kicks (amd64) and saw no significant > speed up doing a ``make -j4 buildworld''. I only saw a time difference > of 20 seconds between 1024M and 999M, which is just in the noise. So, This bug related with i386 only. > > --- > Gary Jennejohn -- ________________ ? ????????? ?????? ?????? mailto From linimon at FreeBSD.org Mon May 5 01:39:36 2008 From: linimon at FreeBSD.org (linimon@FreeBSD.org) Date: Mon May 5 01:39:38 2008 Subject: kern/122888: [zfs] zfs hang w/ prefetch on, zil off while running transmission-daemon Message-ID: <200805050139.m451daRx098247@freefall.freebsd.org> Synopsis: [zfs] zfs hang w/ prefetch on, zil off while running transmission-daemon Responsible-Changed-From-To: freebsd-amd64->freebsd-fs Responsible-Changed-By: linimon Responsible-Changed-When: Mon May 5 01:39:18 UTC 2008 Responsible-Changed-Why: Reclassify. http://www.freebsd.org/cgi/query-pr.cgi?pr=122888 From bugmaster at FreeBSD.org Mon May 5 11:07:04 2008 From: bugmaster at FreeBSD.org (FreeBSD bugmaster) Date: Mon May 5 11:07:10 2008 Subject: Current problem reports assigned to freebsd-fs@FreeBSD.org Message-ID: <200805051107.m45B73XT070682@freefall.freebsd.org> Current FreeBSD problem reports Critical problems Serious problems S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/112658 fs [smbfs] [patch] smbfs and caching problems (resolves b o kern/114676 fs [ufs] snapshot creation panics: snapacct_ufs2: bad blo o kern/116170 fs [panic] Kernel panic when mounting /tmp o bin/121072 fs [smbfs] mount_smbfs(8) cannot normally convert the cha o bin/122172 fs [amd] [fs]: amd(8) automount daemon dies on 6.3-STABLE o kern/122888 fs [zfs] zfs hang w/ prefetch on, zil off while running t 6 problems total. Non-critical problems S Tracker Resp. Description -------------------------------------------------------------------------------- o bin/113049 fs [patch] [request] make quot(8) use getopt(3) and show o bin/113838 fs [patch] [request] mount(8): add support for relative p o bin/114468 fs [patch] [request] add -d option to umount(8) to detach o kern/114847 fs [ntfs] [patch] [request] dirmask support for NTFS ala o kern/114955 fs [cd9660] [patch] [request] support for mask,dirmask,ui o bin/118249 fs mv(1): moving a directory changes its mtime 6 problems total. From anderson at freebsd.org Mon May 5 13:21:37 2008 From: anderson at freebsd.org (Eric Anderson) Date: Mon May 5 13:21:42 2008 Subject: Consistent inodes between distinct machines In-Reply-To: <20080503125050.GG40730@cicely12.cicely.de> References: <48070DCF.9090902@fsn.hu> <4CA7BA82-E95C-45FF-9B94-8EF27B6DB024@freebsd.org> <20080503125050.GG40730@cicely12.cicely.de> Message-ID: On May 3, 2008, at 7:50 AM, Bernd Walter wrote: > On Fri, May 02, 2008 at 03:40:11PM -0500, Eric Anderson wrote: >> On Apr 17, 2008, at 3:43 AM, Attila Nagy wrote: >> >>> Hello, >>> >>> I have several NFS servers, where the service must be available >>> 0-24. The servers are mounted read only on the clients and I've >>> solved the problem of maintaining consistent inodes between them by >>> rsyncing an UFS image and mounting it via md on the NFS servers. >>> The machines have a common IP address with CARP, so if one of them >>> falls out, the other(s) can take over. >>> >>> This works nice, but rsyncing multi gigabyte files are becoming more >>> and more annoying, so I've wondered whether it would be possible to >>> get constant inodes between machines via alternative ways. >> >> >> Why not avoid syncing multi-gigabyte files by splitting your huge FS >> image into many smaller say 512MB files, then use md and geom concat/ >> stripe/etc to make them all one image that you mount? > > Where would be the positive effect by doing this? > FFS distributes data over the media, so all the small files changes > in almost every case and you have to checksum-compare the whole > virtual > disk anyway. > With multiple files the syncing is more complex. For example a normal > rsync run can garantie that you get a complete file synced or none > at all, but this doesn't work out of the box with multiple files, so > you risk half updated data. The positive effect is when your image size is smaller than the cylinder group size, so every image is not getting changes. The smaller your image, the better the efficiency, but the more difficult the concat becomes. Possibly another way is to mirror devices over a ggate or iscsi link. Eric From peter.schuller at infidyne.com Thu May 8 19:26:54 2008 From: peter.schuller at infidyne.com (Peter Schuller) Date: Thu May 8 19:26:58 2008 Subject: zfs and vfs.zfs.prefetch_disable="1" In-Reply-To: <9bbcef730804280145x6961c43ekab916ec289396361@mail.gmail.com> References: <48142ABE.4050107@psg.com> <20080428083133.GA81628@eos.sc1.parodius.com> <9bbcef730804280145x6961c43ekab916ec289396361@mail.gmail.com> Message-ID: <200805082111.12655.peter.schuller@infidyne.com> > The change you reference is apparently about zil_disable, which wasn't > removed, just moved. But the prefetch_disable setting was added and > removed a couple of times to the page, the latest being that it was > added since Pawel uses it in his post. I don't know whether this has changed, but at least some people reported that prefetch caused very poor streaming performance. I had the same problem (e.g., more or less unable to play a movie) until I disabled it. I have, however, not tried re-enabling it with newer versions (this was from the earlier days of ZFS in current). -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller ' Key retrieval: Send an E-Mail to getpgpkey@scode.org E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: This is a digitally signed message part. Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20080508/4a54b3c4/attachment.pgp From mm at FreeBSD.org Sat May 10 05:18:00 2008 From: mm at FreeBSD.org (Martin Matuska) Date: Sat May 10 05:18:05 2008 Subject: ZFS lockup in "zfs" state Message-ID: <48252C89.8@FreeBSD.org> Hi, I just experienced the same lockup in zfs state as other people did (Ivan Voras, Peter Schuller) - UFS filesystems still intact. There was heavy backup tar/gzip activity on the filesystem (read-only, write was to NFS) and lots of reads via NFS, the server was doing this job without problems for 11 days. FreeBSD: 7-STABLE 2008-04-29 ARCH: amd64 RAM: 6 GB ZFS: prefetch disabled vm.kmem_size_max="1073741824" From bugmaster at FreeBSD.org Mon May 12 11:06:56 2008 From: bugmaster at FreeBSD.org (FreeBSD bugmaster) Date: Mon May 12 11:07:01 2008 Subject: Current problem reports assigned to freebsd-fs@FreeBSD.org Message-ID: <200805121106.m4CB6uWN037986@freefall.freebsd.org> Current FreeBSD problem reports Critical problems Serious problems S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/112658 fs [smbfs] [patch] smbfs and caching problems (resolves b o kern/114676 fs [ufs] snapshot creation panics: snapacct_ufs2: bad blo o kern/116170 fs [panic] Kernel panic when mounting /tmp o bin/121072 fs [smbfs] mount_smbfs(8) cannot normally convert the cha o bin/122172 fs [amd] [fs]: amd(8) automount daemon dies on 6.3-STABLE o kern/122888 fs [zfs] zfs hang w/ prefetch on, zil off while running t 6 problems total. Non-critical problems S Tracker Resp. Description -------------------------------------------------------------------------------- o bin/113049 fs [patch] [request] make quot(8) use getopt(3) and show o bin/113838 fs [patch] [request] mount(8): add support for relative p o bin/114468 fs [patch] [request] add -d option to umount(8) to detach o kern/114847 fs [ntfs] [patch] [request] dirmask support for NTFS ala o kern/114955 fs [cd9660] [patch] [request] support for mask,dirmask,ui o bin/118249 fs mv(1): moving a directory changes its mtime 6 problems total. From invite at pimpmysearch.com Tue May 13 05:51:41 2008 From: invite at pimpmysearch.com (shilpa) Date: Tue May 13 05:51:45 2008 Subject: Invitation from shilpa Message-ID: <20080513054127.32263.1369130599.swift@invite.pimpmysearch.com> View My Invitation here >> http://pimpmysearch.com/r.php?g=Brxu_Qdph&m=iuhhevg-iv,iuhhevg.ruj&s=shilpa thanks, shilpa --------------------------------------------- This invitation was sent by shilpa (shilps4u.always@gmail.com) via PimpMySearch to freebsd-fs@freebsd.org. If you would not like to receive further invitations from your friends please let us know http://invite.pimpmysearch.com/i/stopinvites.php?id=freebsd-fs@freebsd.org Or write to: Zyber Media, PO Box 1046, Novi, MI 48376-1046, USA. From avg at icyb.net.ua Tue May 13 07:56:21 2008 From: avg at icyb.net.ua (Andriy Gapon) Date: Tue May 13 07:56:26 2008 Subject: handling of EGAIN from softdep_check_suspend (gjournal) Message-ID: <4829499B.6080007@icyb.net.ua> As being reported from time to time, sometimes there is a following error message produced by gjournal activity: kernel: fsync: giving up on dirty kernel: 0xc32b8bb0: tag devfs, type VCHR kernel: usecount 1, writecount 0, refcount 50 mountedhere 0xc323d200 kernel: flags () kernel: v_object 0xc10499b0 ref 0 pages 451 kernel: lock type devfs: EXCL (count 1) by thread 0xc3208000 (pid 39) kernel: dev ad4s1e.journal kernel: GEOM_JOURNAL: Cannot suspend file system /export (error=35). errno 35 is EAGAIN/EWOULDBLOCK and it is returned from vfs_write_suspend on FFS. The only place where this return code is present in the whole FFS/UFS code is softdep_check_suspend. Comment in the function says the following (for non-softupdates case): /* * Reasons for needing more work before suspend: * - Dirty buffers on devvp. * - Secondary writes occurred after start of vnode sync loop */ I wonder what is recommended handling of this return code? Maybe we should try 'AGAIN' instead of just giving up immediately? -- Andriy Gapon From fbsd-fs at mawer.org Tue May 13 22:36:11 2008 From: fbsd-fs at mawer.org (Antony Mawer) Date: Tue May 13 22:36:14 2008 Subject: BLUFFS update? Message-ID: <482A10AA.1030006@mawer.org> Hi -fs/ups@, Is there any word on the status of BLUFFS, first described here: http://2007.asiabsdcon.org/papers/P11-slides.pdf It was discussed as originally being ready for testing Q1-Q2 2007, about 12 months back... I was just hoping there may be a status update as I am trying to decide what options to take to avoid the painful fsck times we are presently dealing with. Gjournal is one option, but seems more of a hack than a proper solution, but if its all that's likely to be available for the forseeable future it may the only option... Cheers Antony From stas at FreeBSD.org Wed May 14 17:25:19 2008 From: stas at FreeBSD.org (Stanislav Sedov) Date: Wed May 14 17:25:48 2008 Subject: ZFS lockup in "zfs" state In-Reply-To: <48252C89.8@FreeBSD.org> References: <48252C89.8@FreeBSD.org> Message-ID: <20080514205615.87d3f1b7.stas@FreeBSD.org> On Sat, 10 May 2008 07:03:05 +0200 Martin Matuska mentioned: > Hi, > > I just experienced the same lockup in zfs state as other people did > (Ivan Voras, Peter Schuller) - UFS filesystems still intact. > There was heavy backup tar/gzip activity on the filesystem (read-only, > write was to NFS) and lots of reads via NFS, the server was doing this > job without problems for 11 days. > The following patch, published some time ago by pjd helped me: http://mbsd.msk.ru/dist/zfs_lockup.diff 100+ days of uptime of heavily loaded machines and no problems so far. Hope it would help. -- Stanislav Sedov ST4096-RIPE From lists at thefrog.net Sun May 18 07:11:47 2008 From: lists at thefrog.net (Andrew Hill) Date: Sun May 18 07:11:52 2008 Subject: ZFS lockup in "zfs" state Message-ID: <683A6ED2-0E54-42D7-8212-898221C05150@thefrog.net> > The following patch, published some time ago by pjd helped me: > http://mbsd.msk.ru/dist/zfs_lockup.diff > > 100+ days of uptime of heavily loaded machines and no problems so far. > > Hope it would help. I applied this patch with some modifications to fix up the file names as they seem to have moved from - src/sys/contrib/opensolaris/uts/common/fs/zfs/sys/zil.h - src/sys/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c - src/sys/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c to - src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil.h - src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c - src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c (and pointed the kernel configuration file, MASSHOSTING_7_64, to my own kernel config) buildworld and buildkernel succeeded without error, but when i installed the new kernel and rebooted i got the following output (the important point being the failure to load zfs on the 8th line) May 17 17:02:06 <0.2> gutter kernel: Copyright (c) 1992-2008 The FreeBSD Project. May 17 17:02:06 <0.2> gutter kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 May 17 17:02:06 <0.2> gutter kernel: The Regents of the University of California. All rights reserved. May 17 17:02:06 <0.2> gutter kernel: FreeBSD is a registered trademark of The FreeBSD Foundation. May 17 17:02:06 <0.2> gutter kernel: FreeBSD 7.0-STABLE #6: Sat May 17 16:39:32 EST 2008 May 17 17:02:06 <0.2> gutter kernel: root@gutter.thefrog.net:/usr/obj/ usr/src/sys/GUTTER May 17 17:02:06 <0.2> gutter kernel: link_elf_obj: symbol kproc_exit undefined May 17 17:02:06 <0.2> gutter kernel: KLD file zfs.ko - could not finalize loading May 17 17:02:06 <0.2> gutter kernel: Timecounter "i8254" frequency 1193182 Hz quality 0 May 17 17:02:06 <0.2> gutter kernel: CPU: AMD Athlon(tm) 64 Processor 3200+ (2010.31-MHz K8-class CPU) May 17 17:02:06 <0.2> gutter kernel: Origin = "AuthenticAMD" Id = 0x10ff0 Stepping = 0 May 17 17:02:06 <0.2> gutter kernel: Features =0x78bfbff May 17 17:02:06 <0.2> gutter kernel: AMD Features=0xe2500800 May 17 17:02:06 <0.2> gutter kernel: AMD Features2=0x1 May 17 17:02:06 <0.2> gutter kernel: usable memory = 2137882624 (2038 MB) May 17 17:02:06 <0.2> gutter kernel: avail memory = 2060988416 (1965 MB) May 17 17:02:06 <0.2> gutter kernel: ACPI APIC Table: May 17 17:02:06 <0.2> gutter kernel: ioapic0 irqs 0-23 on motherboard May 17 17:02:06 <0.2> gutter kernel: ad0: 238475MB at ata0-master UDMA100 May 17 17:02:06 <0.2> gutter kernel: ad2: 238475MB at ata1-master UDMA100 May 17 17:02:06 <0.2> gutter kernel: ad3: 152627MB at ata1-slave UDMA100 May 17 17:02:06 <0.2> gutter kernel: ad4: 476940MB at ata2-master SATA300 May 17 17:02:06 <0.2> gutter kernel: ad6: 715404MB at ata3-master SATA300 May 17 17:02:06 <0.2> gutter kernel: ad8: 305245MB at ata4-master SATA300 May 17 17:02:06 <0.2> gutter kernel: ad10: 305245MB at ata5-master SATA300 May 17 17:02:06 <0.2> gutter kernel: ad12: 305245MB at ata6-master SATA150 May 17 17:02:06 <0.2> gutter kernel: Trying to mount root from zfs:tank/root May 17 17:02:06 <0.2> gutter kernel: May 17 17:02:06 <0.2> gutter kernel: Manual root filesystem specification: May 17 17:02:06 <0.2> gutter kernel: : Mount using filesystem May 17 17:02:06 <0.2> gutter kernel: eg. ufs:da0s1a May 17 17:02:06 <0.2> gutter kernel: ? List valid disk boot devices May 17 17:02:06 <0.2> gutter kernel: Abort manual input May 17 17:02:06 <0.2> gutter kernel: May 17 17:02:06 <0.2> gutter kernel: mountroot> at this point, since zfs has not been loaded, obviously i could not get it to mount root from zfs:tank/root, and resorted to a backup ufs root to put my old kernel back in place i'm not sure if there is more output available than just the "could not finalize loading", if so please let me know where to look and i'd love to re-test this patch if it'll provide more information right now, i'm getting uptimes in the order of days before everything locks up, i assume its related to this bug, though i'm also getting the following output when it locks up ad2: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=350494631 ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=234920650 ad2: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=443427007 ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=350174938 ad2: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=350494631 ad0: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=234920650 ad2: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=443427007 ad0: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=350174938 ad2: FAILURE - WRITE_DMA48 timed out LBA=350494631 ad0: FAILURE - WRITE_DMA timed out LBA=234920650 ad2: FAILURE - WRITE_DMA48 timed out LBA=443427007 ad0: FAILURE - WRITE_DMA48 timed out LBA=350174938 typically repeated for a number of different LBA values before the system panics. I don't know if this is more likely to be related to the cause of the lockups (e.g. faulty hardware/driver) or if its an effect of the lockup (e.g. waiting on a deadlocked thread)... from what i've found searching mailing lists, this kind of error seems to turn up with faulty hardware/drivers so i guess it could just be that zfs exposes the faults because its using the hardware differently to my previous ufs setup... in terms of my specific setup, i have 2gb ram, i'm running from up-to- date -STABLE source (apart from my attempt to apply the aforementioned patch), i'm running an amd64 kernel, and my /boot/loader.conf looks like this: vm.kmem_size_max="1610612736" vm.kmem_size="1610612736" zfs_load="YES" vfs.root.mountfrom="zfs:tank/root" vfs.zfs.prefetch_disable="1" vfs.zfs.arc_max="838860800" the last line was an attempt to reduce the amount of arc cache in the kernel in case it was having trouble locating memory blocks for other things (as the default value had it at 1.2gb) but adding that parameter doesn't seem to have had any effect anyway, any info toward resolving this would be greatly appreciated, and otherwise let me know what further info i can provide to help track down the problem Andrew From ighighi at gmail.com Sun May 18 12:33:59 2008 From: ighighi at gmail.com (Ighighi Ighighi) Date: Sun May 18 12:34:24 2008 Subject: Incorrect handling of UF_IMMUTABLE & UF_APPEND flags on EXT2FS Message-ID: Almost 2 months have passed since I submitted this PR through GNATS (which cc'd it to freebsd-bugs), so I thought that maybe I should forward it to this list so it gets the attention it deserves: http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/122047 http://lists.freebsd.org/pipermail/freebsd-bugs/2008-March/029740.html Some errata notes: The bug may be present in REISERFS, but there's no write support anyway. Salutes, Igh From koitsu at FreeBSD.org Sun May 18 12:42:17 2008 From: koitsu at FreeBSD.org (Jeremy Chadwick) Date: Sun May 18 12:42:21 2008 Subject: ZFS lockup in "zfs" state In-Reply-To: <683A6ED2-0E54-42D7-8212-898221C05150@thefrog.net> References: <683A6ED2-0E54-42D7-8212-898221C05150@thefrog.net> Message-ID: <20080518124217.GA16222@eos.sc1.parodius.com> On Sun, May 18, 2008 at 05:11:37PM +1000, Andrew Hill wrote: > right now, i'm getting uptimes in the order of days before everything locks > up, i assume its related to this bug, though i'm also getting the following > output when it locks up > > ad2: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=350494631 > ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=234920650 > ad2: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=443427007 > ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=350174938 > ad2: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=350494631 > ad0: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=234920650 > ad2: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=443427007 > ad0: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=350174938 > ad2: FAILURE - WRITE_DMA48 timed out LBA=350494631 > ad0: FAILURE - WRITE_DMA timed out LBA=234920650 > ad2: FAILURE - WRITE_DMA48 timed out LBA=443427007 > ad0: FAILURE - WRITE_DMA48 timed out LBA=350174938 I've documented this fairly well, although I suppose I could write up a diagnosis method as an addendum. Anyway: http://wiki.freebsd.org/JeremyChadwick/Commonly_reported_issues One thing: are the timeouts always on ad0 and ad2? > typically repeated for a number of different LBA values before the system > panics. I don't know if this is more likely to be related to the cause of > the lockups (e.g. faulty hardware/driver) or if its an effect of the lockup > (e.g. waiting on a deadlocked thread)... from what i've found searching > mailing lists, this kind of error seems to turn up with faulty > hardware/drivers so i guess it could just be that zfs exposes the faults > because its using the hardware differently to my previous ufs setup... It is possible you have some bad hardware, but there are many of us who have seen the above (with or without ZFS) on perfectly good hardware. For some, changing cables fixed the problem, while for others absolutely nothing fixed it (changed cables, changed controller brands, changed to new disks). If the DMA timeouts are easily reproducable, please get in touch with Scott Long , who is in the process of researching why these happen. Serial console access might be required. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | From lists at thefrog.net Sun May 18 15:12:02 2008 From: lists at thefrog.net (Andrew Hill) Date: Sun May 18 15:12:06 2008 Subject: ZFS lockup in "zfs" state In-Reply-To: <20080518124217.GA16222@eos.sc1.parodius.com> References: <683A6ED2-0E54-42D7-8212-898221C05150@thefrog.net> <20080518124217.GA16222@eos.sc1.parodius.com> Message-ID: <93F07874-8D5F-44AE-945F-803FFC3B9279@thefrog.net> On 18/05/2008, at 10:42 PM, Jeremy Chadwick wrote: > One thing: are the timeouts always on ad0 and ad2? firstly, some relevant output from my dmesg atapci0: port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 6.0 on pci0 atapci1: port 0x9f0-0x9f7,0xbf0-0xbf3,0x970-0x977,0xb70-0xb73,0xcc00-0xcc0f mem 0xf4005000-0xf4005fff irq 21 at device 7.0 on pci0 atapci2: port 0x9e0-0x9e7,0xbe0-0xbe3,0x960-0x967,0xb60-0xb63,0xe000-0xe00f mem 0xf4000000-0xf4000fff irq 22 at device 8.0 on pci0 atapci3: port 0x8400-0x8407,0x8800-0x8803,0x8c00-0x8c07,0x9000-0x9003,0x9400-0x940f mem 0xf1004000-0xf10043ff irq 17 at device 9.0 on pci1 ad0: 238475MB at ata0-master UDMA100 ad2: 238475MB at ata1-master UDMA100 ad3: 152627MB at ata1-slave UDMA100 ad4: 476940MB at ata2-master SATA300 ad6: 715404MB at ata3-master SATA300 ad8: 305245MB at ata4-master SATA300 ad10: 305245MB at ata5-master SATA300 ad12: 305245MB at ata6-master SATA150 and to answer the question, no, i get timeouts on ad0, 2, 4, 6, 8, 10 and 12, but when they occur its always 1 or 2 disks... for various reasons (primarily focusing on space and low-cost, not performance) i have a 7 disk raidz covering a 250GB slice on each of the above 7 disks, and i've made two more zpools from the remaining space on the drives - and yes, i realise this is a bit of a mess and anyone who's set up any kind of production raid would be appalled, but the aim was to make use of some old disks moreso than to have a fast/ clean setup ad0,2,3 are on the nvidia (southbridge) ata controller ad4,6,8,10 are on the nvidia (southbridge) sata controller ad12 is on the SiI 3114 controller so perhaps i can contribute something useful here because of my (odd) set up? my timeouts aren't limited to any one drive/controller/connector-type - i've had timeouts on all 7 of the drives in the raidz (i've yet to see a timeout on ad3 but that disk is rarely accessed so i'm not entirely surprised) i tend to find that the timeouts occur on one or two disks at once - e.g. ad0 and 2 will complain of timeouts, and the system locks up shortly thereafter... the pairs seem to be grouped by the ata controller... which is to say, i often get ad0 and 2 timeouts together, or two of ad4,6,8,10, or 12 on its own... i'm not 100% sure as i've not recorded the pairs each time, but it seems like there's a strong correlation between the drives giving timeouts and the controller they're running on. this might imply its a bug in the controller driver? or it might simply be an effect of the timing of the writes at some level... this correlation seems interesting though, and i've only just noticed it so i'll be keeping track of future timeouts to see if they consistently pair up within a controller there is the obvious power question (8 drives in a standard PC case... my initial guess was power) but i've hooked up a (Fluke 111) multimeter to log the 5 and 12V rails going to the drives, and its been a steady 5.4 and 12.3 V (including during a timeout and lockup) - these both varied by less than 0.1V over fairly long test periods - so i don't think its power, but i'm willing to keep testing anything... i've also run memtest86 on the ram fearing that might have been the cause... > It is possible you have some bad hardware, but there are many of us > who > have seen the above (with or without ZFS) on perfectly good hardware. > For some, changing cables fixed the problem, while for others > absolutely > nothing fixed it (changed cables, changed controller brands, changed > to > new disks). i'm inclined to think that the disks/cables themselves are good (given the timeouts aren't specific to one disk) and given the ram is okay (from the memtest at least), and the timeouts are occurring on multiple controllers, i think this suggests the controllers are probably okay... (i guess it could be in the northbridge or bus still...) > If the DMA timeouts are easily reproducable, please get in touch with > Scott Long , who is in the process of researching > why > these happen. Serial console access might be required. will do, thanks for the contacts/wiki page (: Andrew From bugmaster at FreeBSD.org Mon May 19 11:06:52 2008 From: bugmaster at FreeBSD.org (FreeBSD bugmaster) Date: Mon May 19 11:07:05 2008 Subject: Current problem reports assigned to freebsd-fs@FreeBSD.org Message-ID: <200805191106.m4JB6pPA011555@freefall.freebsd.org> Current FreeBSD problem reports Critical problems Serious problems S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/112658 fs [smbfs] [patch] smbfs and caching problems (resolves b o kern/114676 fs [ufs] snapshot creation panics: snapacct_ufs2: bad blo o kern/116170 fs [panic] Kernel panic when mounting /tmp o bin/121072 fs [smbfs] mount_smbfs(8) cannot normally convert the cha o bin/122172 fs [amd] [fs]: amd(8) automount daemon dies on 6.3-STABLE o kern/122888 fs [zfs] zfs hang w/ prefetch on, zil off while running t 6 problems total. Non-critical problems S Tracker Resp. Description -------------------------------------------------------------------------------- o bin/113049 fs [patch] [request] make quot(8) use getopt(3) and show o bin/113838 fs [patch] [request] mount(8): add support for relative p o bin/114468 fs [patch] [request] add -d option to umount(8) to detach o kern/114847 fs [ntfs] [patch] [request] dirmask support for NTFS ala o kern/114955 fs [cd9660] [patch] [request] support for mask,dirmask,ui o bin/118249 fs mv(1): moving a directory changes its mtime 6 problems total. From peter.schuller at infidyne.com Mon May 19 20:30:52 2008 From: peter.schuller at infidyne.com (Peter Schuller) Date: Mon May 19 20:31:01 2008 Subject: ZFS lockup in "zfs" state In-Reply-To: <48252C89.8@FreeBSD.org> References: <48252C89.8@FreeBSD.org> Message-ID: <200805192231.46561.peter.schuller@infidyne.com> > I just experienced the same lockup in zfs state as other people did > (Ivan Voras, Peter Schuller) - UFS filesystems still intact. > There was heavy backup tar/gzip activity on the filesystem (read-only, > write was to NFS) and lots of reads via NFS, the server was doing this > job without problems for 11 days. FWIW, I've seen it a few more times on two different machines. Both running semi-new FreeBSD (I still don't think I ever saw this on earlier CURRENT:s). In the case of both machines, the machine is only selectively hung. Possibly limited to the zfs file system - definitely not global to the pool. A remote reboot -q -n has been useful to recover without console access. In both of these cases, more or less all activity of any amount is on ZFS file systems. One of them has only ZFS except for swap on a UFS file system. The other has root on UFS, but no bulk operations whatsoever happening (beyond the usual periodics) except on ZFS. No NFS on either machine. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller ' Key retrieval: Send an E-Mail to getpgpkey@scode.org E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: This is a digitally signed message part. Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20080519/8e447d07/attachment.pgp From delphij at delphij.net Sat May 24 01:09:02 2008 From: delphij at delphij.net (Xin LI) Date: Sat May 24 01:09:04 2008 Subject: vfs.lookup_shared Message-ID: <48376AA3.6090205@delphij.net> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, Is there any reason behind we don't have vfs.lookup_shared enabled by default? Cheers, - -- ** Help China's quake relief at http://www.redcross.org.cn/ |>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Xin LI http://www.delphij.net/ FreeBSD - The Power to Serve! -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (FreeBSD) iEYEARECAAYFAkg3aqIACgkQi+vbBBjt66B+kgCeN+tTUiiLkPGLsAVNkrJtgSe2 SKwAoKS15lX6IvL+9ej+ys5H2XKz3GpB =gqHL -----END PGP SIGNATURE----- From jroberson at jroberson.net Sat May 24 03:38:33 2008 From: jroberson at jroberson.net (Jeff Roberson) Date: Sat May 24 03:38:37 2008 Subject: vfs.lookup_shared In-Reply-To: <48376AA3.6090205@delphij.net> References: <48376AA3.6090205@delphij.net> Message-ID: <20080523171509.K954@desktop> On Fri, 23 May 2008, Xin LI wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi, > > Is there any reason behind we don't have vfs.lookup_shared enabled by > default? We have discussed enabling it by default once the ffs shared lookup support is complete. Unfortunately ffs is still not 100% reliable. I want to verify that it's an ffs problem and not a problem with the vfs generic code which would effect all filesystems before we enable it by default. Jeff > > Cheers, > - -- > ** Help China's quake relief at http://www.redcross.org.cn/ > |>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Xin LI http://www.delphij.net/ > FreeBSD - The Power to Serve! > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v2.0.9 (FreeBSD) > > iEYEARECAAYFAkg3aqIACgkQi+vbBBjt66B+kgCeN+tTUiiLkPGLsAVNkrJtgSe2 > SKwAoKS15lX6IvL+9ej+ys5H2XKz3GpB > =gqHL > -----END PGP SIGNATURE----- > From kris at FreeBSD.org Sat May 24 08:52:24 2008 From: kris at FreeBSD.org (Kris Kennaway) Date: Sat May 24 08:52:26 2008 Subject: vfs.lookup_shared In-Reply-To: <20080523171509.K954@desktop> References: <48376AA3.6090205@delphij.net> <20080523171509.K954@desktop> Message-ID: <20080524085224.GL20868@hub.freebsd.org> On Fri, May 23, 2008 at 05:16:16PM -1000, Jeff Roberson wrote: > On Fri, 23 May 2008, Xin LI wrote: > > >-----BEGIN PGP SIGNED MESSAGE----- > >Hash: SHA1 > > > >Hi, > > > >Is there any reason behind we don't have vfs.lookup_shared enabled by > >default? > > We have discussed enabling it by default once the ffs shared lookup > support is complete. Unfortunately ffs is still not 100% reliable. I > want to verify that it's an ffs problem and not a problem with the vfs > generic code which would effect all filesystems before we enable it by > default. Also, until Attilio's recent lockmgr work, shared lockmgr locks were starving exclusive lockmgr lock requests, leading to performance problems on some workloads with the only filesystem that supported shared locking (NFS). This is now fixed though. Kris -- In God we Trust -- all others must submit an X.509 certificate. -- Charles Forsythe From pjd at FreeBSD.org Sun May 25 22:34:22 2008 From: pjd at FreeBSD.org (Pawel Jakub Dawidek) Date: Sun May 25 22:34:56 2008 Subject: vfs.lookup_shared In-Reply-To: <20080524085224.GL20868@hub.freebsd.org> References: <48376AA3.6090205@delphij.net> <20080523171509.K954@desktop> <20080524085224.GL20868@hub.freebsd.org> Message-ID: <20080525221444.GA8103@garage.freebsd.pl> On Sat, May 24, 2008 at 08:52:24AM +0000, Kris Kennaway wrote: > > On Fri, May 23, 2008 at 05:16:16PM -1000, Jeff Roberson wrote: > > On Fri, 23 May 2008, Xin LI wrote: > > > > >-----BEGIN PGP SIGNED MESSAGE----- > > >Hash: SHA1 > > > > > >Hi, > > > > > >Is there any reason behind we don't have vfs.lookup_shared enabled by > > >default? > > > > We have discussed enabling it by default once the ffs shared lookup > > support is complete. Unfortunately ffs is still not 100% reliable. I > > want to verify that it's an ffs problem and not a problem with the vfs > > generic code which would effect all filesystems before we enable it by > > default. > > Also, until Attilio's recent lockmgr work, shared lockmgr locks were > starving exclusive lockmgr lock requests, leading to performance > problems on some workloads with the only filesystem that supported > shared locking (NFS). This is now fixed though. ZFS also supports shared locking and shared lookups. I've vfs.lookup_shared=1 on my ZFS-only laptop for more than a year now and never had a problem. Although it is only a laptop... -- Pawel Jakub Dawidek http://www.wheel.pl pjd@FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20080525/dc01c457/attachment.pgp From bugmaster at FreeBSD.org Mon May 26 11:06:47 2008 From: bugmaster at FreeBSD.org (FreeBSD bugmaster) Date: Mon May 26 11:07:07 2008 Subject: Current problem reports assigned to freebsd-fs@FreeBSD.org Message-ID: <200805261106.m4QB6lOH064876@freefall.freebsd.org> Current FreeBSD problem reports Critical problems Serious problems S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/112658 fs [smbfs] [patch] smbfs and caching problems (resolves b o kern/114676 fs [ufs] snapshot creation panics: snapacct_ufs2: bad blo o kern/116170 fs [panic] Kernel panic when mounting /tmp o bin/121072 fs [smbfs] mount_smbfs(8) cannot normally convert the cha o bin/122172 fs [amd] [fs]: amd(8) automount daemon dies on 6.3-STABLE o kern/122888 fs [zfs] zfs hang w/ prefetch on, zil off while running t 6 problems total. Non-critical problems S Tracker Resp. Description -------------------------------------------------------------------------------- o bin/113049 fs [patch] [request] make quot(8) use getopt(3) and show o bin/113838 fs [patch] [request] mount(8): add support for relative p o bin/114468 fs [patch] [request] add -d option to umount(8) to detach o kern/114847 fs [ntfs] [patch] [request] dirmask support for NTFS ala o kern/114955 fs [cd9660] [patch] [request] support for mask,dirmask,ui o bin/118249 fs mv(1): moving a directory changes its mtime 6 problems total. From ighighi at gmail.com Thu May 29 06:46:36 2008 From: ighighi at gmail.com (Ighighi Ighighi) Date: Thu May 29 06:46:39 2008 Subject: kern/122047: [ext2fs] incorrect handling of UF_IMMUTABLE / UF_APPEND flag on EXT2FS (maybe others) Message-ID: See attached patch. -------------- next part -------------- # # (!c) 2008 by Ighighi # # See http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/122047 # # This patch adds a "vfs.e2fs.userflags" sysctl to permit regular users # to set/clear the APPEND/IMMUTABLE filesystem flags on EXT2 filesystems. # As a bonus, it also sets st_birthtime to zero. # # Built and tested on FreeBSD 6.3-STABLE (RELENG_6). # Known to patch on -CURRENT # # To install, run as root: # /sbin/umount -v -t ext2fs -a # /sbin/kldunload -v ext2fs # /usr/bin/patch -d /usr < /path/to/ext2fs.patch # cd /sys/modules/ext2fs/ # make clean obj depend && make && make install # /sbin/kldload -v ext2fs # /sbin/sysctl vfs.e2fs.userflags=1 # 0 is default # /sbin/mount -v -t ext2fs -a # --- src/sys/gnu/fs/ext2fs/ext2_inode_cnv.c.orig 2005-06-14 22:36:10.000000000 -0400 +++ src/sys/gnu/fs/ext2fs/ext2_inode_cnv.c 2008-05-28 15:15:27.527318854 -0430 @@ -30,11 +30,19 @@ #include #include #include +#include +#include #include #include #include +SYSCTL_DECL(_vfs_e2fs); + +static int userflags = 0; +SYSCTL_INT(_vfs_e2fs, OID_AUTO, userflags, CTLFLAG_RW, + &userflags, 0, "Users may set/clear filesystem flags"); + void ext2_print_inode( in ) struct inode *in; @@ -83,8 +91,17 @@ ext2_ei2i(ei, ip) ip->i_mtime = ei->i_mtime; ip->i_ctime = ei->i_ctime; ip->i_flags = 0; - ip->i_flags |= (ei->i_flags & EXT2_APPEND_FL) ? APPEND : 0; - ip->i_flags |= (ei->i_flags & EXT2_IMMUTABLE_FL) ? IMMUTABLE : 0; + if (userflags) { + if (ei->i_flags & EXT2_APPEND_FL) + ip->i_flags |= UF_APPEND; + if (ei->i_flags & EXT2_IMMUTABLE_FL) + ip->i_flags |= UF_IMMUTABLE; + } else { + if (ei->i_flags & EXT2_APPEND_FL) + ip->i_flags |= APPEND; + if (ei->i_flags & EXT2_IMMUTABLE_FL) + ip->i_flags |= IMMUTABLE; + } ip->i_blocks = ei->i_blocks; ip->i_gen = ei->i_generation; ip->i_uid = ei->i_uid; --- src/sys/gnu/fs/ext2fs/ext2_lookup.c.orig 2006-01-04 15:32:00.000000000 -0400 +++ src/sys/gnu/fs/ext2fs/ext2_lookup.c 2008-05-28 13:35:16.841349269 -0430 @@ -66,7 +66,7 @@ static int dirchk = 1; static int dirchk = 0; #endif -static SYSCTL_NODE(_vfs, OID_AUTO, e2fs, CTLFLAG_RD, 0, "EXT2FS filesystem"); +SYSCTL_NODE(_vfs, OID_AUTO, e2fs, CTLFLAG_RW, 0, "EXT2FS filesystem"); SYSCTL_INT(_vfs_e2fs, OID_AUTO, dircheck, CTLFLAG_RW, &dirchk, 0, ""); /* --- src/sys/gnu/fs/ext2fs/ext2_vnops.c.orig 2006-02-19 20:53:14.000000000 -0400 +++ src/sys/gnu/fs/ext2fs/ext2_vnops.c 2008-05-28 07:58:02.189157441 -0430 @@ -358,6 +358,8 @@ ext2_getattr(ap) vap->va_mtime.tv_nsec = ip->i_mtimensec; vap->va_ctime.tv_sec = ip->i_ctime; vap->va_ctime.tv_nsec = ip->i_ctimensec; + vap->va_birthtime.tv_sec = 0; + vap->va_birthtime.tv_nsec = 0; vap->va_flags = ip->i_flags; vap->va_gen = ip->i_gen; vap->va_blocksize = vp->v_mount->mnt_stat.f_iosize; From ighighi at gmail.com Thu May 29 06:56:37 2008 From: ighighi at gmail.com (Ighighi Ighighi) Date: Thu May 29 06:56:45 2008 Subject: kern/122047: [ext2fs] incorrect handling of UF_IMMUTABLE / UF_APPEND flag on EXT2FS (maybe others) Message-ID: See attached patch. From ighighi at gmail.com Thu May 29 07:37:07 2008 From: ighighi at gmail.com (Ighighi) Date: Thu May 29 07:37:37 2008 Subject: kern/122047: [ext2fs] incorrect handling of UF_IMMUTABLE / UF_APPEND flag on EXT2FS (maybe others) Message-ID: <483E5CFE.7030807@gmail.com> See attached patch. Gmail sucks at sending patches to GNATS =( -------------- next part -------------- # # (!c) 2008 by Ighighi # # See http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/122047 # # This patch adds a "vfs.e2fs.userflags" sysctl to permit regular users # to set/clear the APPEND/IMMUTABLE filesystem flags on EXT2 filesystems. # As a bonus, it also sets st_birthtime to zero. # # Built and tested on FreeBSD 6.3-STABLE (RELENG_6). # Known to patch on -CURRENT # # To install, run as root: # /sbin/umount -v -t ext2fs -a # /sbin/kldunload -v ext2fs # /usr/bin/patch -d /usr < /path/to/ext2fs.patch # cd /sys/modules/ext2fs/ # make clean obj depend && make && make install # /sbin/kldload -v ext2fs # /sbin/sysctl vfs.e2fs.userflags=1 # 0 is default # /sbin/mount -v -t ext2fs -a # --- src/sys/gnu/fs/ext2fs/ext2_inode_cnv.c.orig 2005-06-14 22:36:10.000000000 -0400 +++ src/sys/gnu/fs/ext2fs/ext2_inode_cnv.c 2008-05-28 15:15:27.527318854 -0430 @@ -30,11 +30,19 @@ #include #include #include +#include +#include #include #include #include +SYSCTL_DECL(_vfs_e2fs); + +static int userflags = 0; +SYSCTL_INT(_vfs_e2fs, OID_AUTO, userflags, CTLFLAG_RW, + &userflags, 0, "Users may set/clear filesystem flags"); + void ext2_print_inode( in ) struct inode *in; @@ -83,8 +91,17 @@ ext2_ei2i(ei, ip) ip->i_mtime = ei->i_mtime; ip->i_ctime = ei->i_ctime; ip->i_flags = 0; - ip->i_flags |= (ei->i_flags & EXT2_APPEND_FL) ? APPEND : 0; - ip->i_flags |= (ei->i_flags & EXT2_IMMUTABLE_FL) ? IMMUTABLE : 0; + if (userflags) { + if (ei->i_flags & EXT2_APPEND_FL) + ip->i_flags |= UF_APPEND; + if (ei->i_flags & EXT2_IMMUTABLE_FL) + ip->i_flags |= UF_IMMUTABLE; + } else { + if (ei->i_flags & EXT2_APPEND_FL) + ip->i_flags |= APPEND; + if (ei->i_flags & EXT2_IMMUTABLE_FL) + ip->i_flags |= IMMUTABLE; + } ip->i_blocks = ei->i_blocks; ip->i_gen = ei->i_generation; ip->i_uid = ei->i_uid; --- src/sys/gnu/fs/ext2fs/ext2_lookup.c.orig 2006-01-04 15:32:00.000000000 -0400 +++ src/sys/gnu/fs/ext2fs/ext2_lookup.c 2008-05-28 13:35:16.841349269 -0430 @@ -66,7 +66,7 @@ static int dirchk = 1; static int dirchk = 0; #endif -static SYSCTL_NODE(_vfs, OID_AUTO, e2fs, CTLFLAG_RD, 0, "EXT2FS filesystem"); +SYSCTL_NODE(_vfs, OID_AUTO, e2fs, CTLFLAG_RW, 0, "EXT2FS filesystem"); SYSCTL_INT(_vfs_e2fs, OID_AUTO, dircheck, CTLFLAG_RW, &dirchk, 0, ""); /* --- src/sys/gnu/fs/ext2fs/ext2_vnops.c.orig 2006-02-19 20:53:14.000000000 -0400 +++ src/sys/gnu/fs/ext2fs/ext2_vnops.c 2008-05-28 07:58:02.189157441 -0430 @@ -358,6 +358,8 @@ ext2_getattr(ap) vap->va_mtime.tv_nsec = ip->i_mtimensec; vap->va_ctime.tv_sec = ip->i_ctime; vap->va_ctime.tv_nsec = ip->i_ctimensec; + vap->va_birthtime.tv_sec = 0; + vap->va_birthtime.tv_nsec = 0; vap->va_flags = ip->i_flags; vap->va_gen = ip->i_gen; vap->va_blocksize = vp->v_mount->mnt_stat.f_iosize; From pjd at FreeBSD.org Fri May 30 06:48:59 2008 From: pjd at FreeBSD.org (Pawel Jakub Dawidek) Date: Fri May 30 06:49:03 2008 Subject: Analysis of disk file block with ZFS checksum error In-Reply-To: <47B0A45C.4090909@skyrush.com> References: <47ACD7D4.5050905@skyrush.com> <47ACDE82.1050100@skyrush.com> <20080208173517.rdtobnxqg4g004c4@www.wolves.k12.mo.us> <47ACF0AE.3040802@skyrush.com> <1202747953.27277.7.camel@buffy.york.ac.uk> <47B0A45C.4090909@skyrush.com> Message-ID: <20080530064851.GA3596@garage.freebsd.pl> On Mon, Feb 11, 2008 at 12:39:08PM -0700, Joe Peterson wrote: > Gavin Atkinson wrote: > > Are the datestamps (Thu Jan 24 23:20:58 2008) found within the corrupt > > block before or after the datestamp of the file it was found within? > > i.e. was the corrupt block on the disk before or after the mp3 was > > written there? > > Hi Gavin, those dated are later than the original copy (I do not have > the file timestamps to prove this, but according to my email record, I > am pretty sure of this). So the corrupt block is later than the > original write. > > If this is the case, I assume that the block got written, by mistake, > into the middle of the mp3 file. Someone else suggested that it could > be caused by a bad transfer block number or bad drive command (corrupted > on the way to the drive, since these are not checksummed in the > hardware). If the block went to the wrong place, AND if it was a HW > glitch, I suppose the best ZFS could then do is retry the write (if its > failure was even detected - still not sure if ZFS does a re-check of the > disk data checksum after the disk write), not knowing until the later > scrub that the block had corrupted a file. ZFS doesn't verify checksum after write, it would be pointless for two reasons: 1. The read will come most likely from disk cache and not from the stable storage. 2. This would kill performance. ZFS test checksum only on read. What you observe is either a misdirected read/write (you asked to read/write sector X, but the data was read from or wrote to sector Y) or a phantom write (you asked to write, but the data never reach the disk, so you have old data there). -- Pawel Jakub Dawidek http://www.wheel.pl pjd@FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20080530/47b1ef84/attachment.pgp From pjd at FreeBSD.org Fri May 30 06:54:44 2008 From: pjd at FreeBSD.org (Pawel Jakub Dawidek) Date: Fri May 30 06:54:47 2008 Subject: ZFS panic when changing zfs dataset mountpoint In-Reply-To: <2e77fc10802120304n32fd1c42m52e6bc617ba07c35@mail.gmail.com> References: <2e77fc10802120304n32fd1c42m52e6bc617ba07c35@mail.gmail.com> Message-ID: <20080530065439.GB3596@garage.freebsd.pl> On Tue, Feb 12, 2008 at 01:04:38PM +0200, Niki Denev wrote: > Hi, > > I got the following panic when trying to set/change the mountpoint > property of a dataset. > > I did : > # zfs set mountpoint=/usr/ports zfs2/ports > and the machine crashed. > > The datased had one snapshot taken. > > Here is what i was able to extract from the dump : [...] > #8 0xffffffff804800d5 in _sx_xlock (sx=0xa0, opts=0, > file=0xffffffff80c697f0 > "/usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c", > line=1069) at atomic.h:142 > #9 0xffffffff80c50b2a in zfsctl_umount_snapshots (vfsp=Variable > "vfsp" is not available. > ) at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c:1069 > #10 0xffffffff80c57978 in zfs_umount (vfsp=0xffffff00014f5650, > fflag=0, td=0xffffff001483b6a0) > at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:692 [...] I tried to reproduce your problem, but I can't. This is clearly related to unmounting snapshots. Was your snapshot mounted at the time of calling 'zfs set mountpoint='? I tried both scenarious (having mounted and unmounted snapshot) and no panic. Is there anything else you did? -- Pawel Jakub Dawidek http://www.wheel.pl pjd@FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20080530/da07b9e4/attachment.pgp