From bugmaster at FreeBSD.org Mon Nov 3 03:06:49 2008 From: bugmaster at FreeBSD.org (FreeBSD bugmaster) Date: Mon Nov 3 03:07:19 2008 Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org Message-ID: <200811031106.mA3B6mwv010833@freefall.freebsd.org> Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/120749 arch [request] Suggest upping the default kern.ps_arg_cache 1 problem total. From trasz at FreeBSD.ORG Thu Nov 6 11:25:40 2008 From: trasz at FreeBSD.ORG (Edward Tomasz Napierala) Date: Thu Nov 6 11:25:49 2008 Subject: Directory rename semantics. In-Reply-To: <20081028161855.GA45129@zim.MIT.EDU> References: <20081027193545.GA95872@pin.if.uz.zgora.pl> <20081028161855.GA45129@zim.MIT.EDU> Message-ID: <20081106192829.GA98742@pin.if.uz.zgora.pl> After discussion about this with rwatson and pjd, I decided to do the opposite: change ZFS behaviour to match UFS. Reason is simple: this is security, and we want to be conservative here. It's impossible to make sure this change wouldn't cause security problems. -- If you cut off my head, what would I say? Me and my head, or me and my body? From ceri at submonkey.net Thu Nov 6 13:20:44 2008 From: ceri at submonkey.net (Ceri Davies) Date: Thu Nov 6 13:20:51 2008 Subject: Directory rename semantics. In-Reply-To: <20081106192829.GA98742@pin.if.uz.zgora.pl> References: <20081027193545.GA95872@pin.if.uz.zgora.pl> <20081028161855.GA45129@zim.MIT.EDU> <20081106192829.GA98742@pin.if.uz.zgora.pl> Message-ID: <20081106195558.GG2281@submonkey.net> On Thu, Nov 06, 2008 at 08:28:29PM +0100, Edward Tomasz Napierala wrote: > After discussion about this with rwatson and pjd, I decided to do > the opposite: change ZFS behaviour to match UFS. Reason is simple: > this is security, and we want to be conservative here. It's impossible > to make sure this change wouldn't cause security problems. Perhaps it would have been better to either do nothing or create a zfs property that toggled this behaviour so that people who expect ZFS to behave a certain way get it. I'm not sure why we would want all filesystems to behave the same way, to be honest. Ceri -- That must be wonderful! I don't understand it at all. -- Moliere -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081106/a575c3e6/attachment.pgp From ivoras at freebsd.org Fri Nov 7 03:05:08 2008 From: ivoras at freebsd.org (Ivan Voras) Date: Fri Nov 7 03:05:15 2008 Subject: Directory rename semantics. In-Reply-To: <20081106195558.GG2281@submonkey.net> References: <20081027193545.GA95872@pin.if.uz.zgora.pl> <20081028161855.GA45129@zim.MIT.EDU> <20081106192829.GA98742@pin.if.uz.zgora.pl> <20081106195558.GG2281@submonkey.net> Message-ID: Ceri Davies wrote: > On Thu, Nov 06, 2008 at 08:28:29PM +0100, Edward Tomasz Napierala wrote: >> After discussion about this with rwatson and pjd, I decided to do >> the opposite: change ZFS behaviour to match UFS. Reason is simple: >> this is security, and we want to be conservative here. It's impossible >> to make sure this change wouldn't cause security problems. > > Perhaps it would have been better to either do nothing or create a zfs > property that toggled this behaviour so that people who expect ZFS to > behave a certain way get it. I'm not sure why we would want all > filesystems to behave the same way, to be honest. That would be desirable if we want file system semantics to be a property of the OS instead of individual file systems. (Though I don't know if there's ever been a conscious decision about this particular goal). If so, a knob that toggles between the behaviours should toggle it for all file systems. Having them behave differently can create problems in migration to and from ZFS. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 252 bytes Desc: OpenPGP digital signature Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081107/2ddee5d0/signature.pgp From ceri at submonkey.net Fri Nov 7 03:10:25 2008 From: ceri at submonkey.net (Ceri Davies) Date: Fri Nov 7 03:10:32 2008 Subject: Directory rename semantics. In-Reply-To: <20081106195558.GG2281@submonkey.net> References: <20081027193545.GA95872@pin.if.uz.zgora.pl> <20081028161855.GA45129@zim.MIT.EDU> <20081106192829.GA98742@pin.if.uz.zgora.pl> <20081106195558.GG2281@submonkey.net> Message-ID: <20081107111022.GB34757@submonkey.net> On Thu, Nov 06, 2008 at 07:55:58PM +0000, Ceri Davies wrote: > On Thu, Nov 06, 2008 at 08:28:29PM +0100, Edward Tomasz Napierala wrote: > > After discussion about this with rwatson and pjd, I decided to do > > the opposite: change ZFS behaviour to match UFS. Reason is simple: > > this is security, and we want to be conservative here. It's impossible > > to make sure this change wouldn't cause security problems. > > Perhaps it would have been better to either do nothing or create a zfs > property that toggled this behaviour so that people who expect ZFS to > behave a certain way get it. I'm not sure why we would want all > filesystems to behave the same way, to be honest. I'm essentially unhappy here that a change to UFS which is local to us was considered important enough to ask -arch about, while ZFS which exists on at least two other operating systems was deemed fine to go ahead and change without review. Ceri -- That must be wonderful! I don't understand it at all. -- Moliere -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081107/a4398441/attachment.pgp From ivoras at freebsd.org Fri Nov 7 03:50:27 2008 From: ivoras at freebsd.org (Ivan Voras) Date: Fri Nov 7 03:50:33 2008 Subject: Directory rename semantics. In-Reply-To: <20081107111022.GB34757@submonkey.net> References: <20081027193545.GA95872@pin.if.uz.zgora.pl> <20081028161855.GA45129@zim.MIT.EDU> <20081106192829.GA98742@pin.if.uz.zgora.pl> <20081106195558.GG2281@submonkey.net> <20081107111022.GB34757@submonkey.net> Message-ID: Ceri Davies wrote: > On Thu, Nov 06, 2008 at 07:55:58PM +0000, Ceri Davies wrote: >> On Thu, Nov 06, 2008 at 08:28:29PM +0100, Edward Tomasz Napierala wrote: >>> After discussion about this with rwatson and pjd, I decided to do >>> the opposite: change ZFS behaviour to match UFS. Reason is simple: >>> this is security, and we want to be conservative here. It's impossible >>> to make sure this change wouldn't cause security problems. >> Perhaps it would have been better to either do nothing or create a zfs >> property that toggled this behaviour so that people who expect ZFS to >> behave a certain way get it. I'm not sure why we would want all >> filesystems to behave the same way, to be honest. > > I'm essentially unhappy here that a change to UFS which is local to us > was considered important enough to ask -arch about, while ZFS which > exists on at least two other operating systems was deemed fine to go > ahead and change without review. I think it has something to do with the percentage of "our" users running UFS vs ZFS :) -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 252 bytes Desc: OpenPGP digital signature Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081107/3bc1a8a0/signature.pgp From ceri at submonkey.net Fri Nov 7 04:33:03 2008 From: ceri at submonkey.net (Ceri Davies) Date: Fri Nov 7 04:33:10 2008 Subject: Directory rename semantics. In-Reply-To: References: <20081027193545.GA95872@pin.if.uz.zgora.pl> <20081028161855.GA45129@zim.MIT.EDU> <20081106192829.GA98742@pin.if.uz.zgora.pl> <20081106195558.GG2281@submonkey.net> Message-ID: <20081107123259.GC34757@submonkey.net> On Fri, Nov 07, 2008 at 11:44:27AM +0100, Ivan Voras wrote: > Ceri Davies wrote: > > On Thu, Nov 06, 2008 at 08:28:29PM +0100, Edward Tomasz Napierala wrote: > >> After discussion about this with rwatson and pjd, I decided to do > >> the opposite: change ZFS behaviour to match UFS. Reason is simple: > >> this is security, and we want to be conservative here. It's impossible > >> to make sure this change wouldn't cause security problems. > > > > Perhaps it would have been better to either do nothing or create a zfs > > property that toggled this behaviour so that people who expect ZFS to > > behave a certain way get it. I'm not sure why we would want all > > filesystems to behave the same way, to be honest. > > That would be desirable if we want file system semantics to be a > property of the OS instead of individual file systems. (Though I don't > know if there's ever been a conscious decision about this particular > goal). If so, a knob that toggles between the behaviours should toggle > it for all file systems. Having them behave differently can create > problems in migration to and from ZFS. That's essentially what has just happened, but without the knob. I'm not really sure whether you agree with the change that was made or not. Ceri -- That must be wonderful! I don't understand it at all. -- Moliere -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081107/efa57923/attachment.pgp From ivoras at freebsd.org Fri Nov 7 06:37:36 2008 From: ivoras at freebsd.org (Ivan Voras) Date: Fri Nov 7 06:37:45 2008 Subject: Directory rename semantics. In-Reply-To: <20081107123259.GC34757@submonkey.net> References: <20081027193545.GA95872@pin.if.uz.zgora.pl> <20081028161855.GA45129@zim.MIT.EDU> <20081106192829.GA98742@pin.if.uz.zgora.pl> <20081106195558.GG2281@submonkey.net> <20081107123259.GC34757@submonkey.net> Message-ID: Ceri Davies wrote: > On Fri, Nov 07, 2008 at 11:44:27AM +0100, Ivan Voras wrote: >> Ceri Davies wrote: >>> On Thu, Nov 06, 2008 at 08:28:29PM +0100, Edward Tomasz Napierala wrote: >>>> After discussion about this with rwatson and pjd, I decided to do >>>> the opposite: change ZFS behaviour to match UFS. Reason is simple: >>>> this is security, and we want to be conservative here. It's impossible >>>> to make sure this change wouldn't cause security problems. >>> Perhaps it would have been better to either do nothing or create a zfs >>> property that toggled this behaviour so that people who expect ZFS to >>> behave a certain way get it. I'm not sure why we would want all >>> filesystems to behave the same way, to be honest. >> That would be desirable if we want file system semantics to be a >> property of the OS instead of individual file systems. (Though I don't >> know if there's ever been a conscious decision about this particular >> goal). If so, a knob that toggles between the behaviours should toggle >> it for all file systems. Having them behave differently can create >> problems in migration to and from ZFS. > > That's essentially what has just happened, but without the knob. > > I'm not really sure whether you agree with the change that was made or > not. I agree with the aspect of the change that unified the semantics on UFS and ZFS. I hope somebody comes up with a knob that would toggle it for both systems at the same time, if the alternate behaviour is useful to people. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 252 bytes Desc: OpenPGP digital signature Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081107/541415a5/signature.pgp From trasz at FreeBSD.ORG Fri Nov 7 07:02:50 2008 From: trasz at FreeBSD.ORG (Edward Tomasz Napierala) Date: Fri Nov 7 07:02:56 2008 Subject: Directory rename semantics. In-Reply-To: <20081107111022.GB34757@submonkey.net> References: <20081027193545.GA95872@pin.if.uz.zgora.pl> <20081028161855.GA45129@zim.MIT.EDU> <20081106192829.GA98742@pin.if.uz.zgora.pl> <20081106195558.GG2281@submonkey.net> <20081107111022.GB34757@submonkey.net> Message-ID: <20081107150544.GA12290@pin.if.uz.zgora.pl> On 1107T1110, Ceri Davies wrote: > > > After discussion about this with rwatson and pjd, I decided to do > > > the opposite: change ZFS behaviour to match UFS. Reason is simple: > > > this is security, and we want to be conservative here. It's impossible > > > to make sure this change wouldn't cause security problems. > > > > Perhaps it would have been better to either do nothing or create a zfs > > property that toggled this behaviour so that people who expect ZFS to > > behave a certain way get it. I'm not sure why we would want all > > filesystems to behave the same way, to be honest. Because of consistency. Having different access rights behaviour in different filesystems under the same operating system is confusing. > I'm essentially unhappy here that a change to UFS which is local to us > was considered important enough to ask -arch about, while ZFS which > exists on at least two other operating systems was deemed fine to go > ahead and change without review. The change to UFS changes behaviour that 'was always there'. Also, it changes the behaviour to more permissive. On the other hand, change to ZFS is just another fix to make its semantics match ours. Not the first one - our ZFS behaves differently from ZFS under SunOS in other places, e.g. newly created files inherit their group from the parent directory. Also, the change makes it more restrictive. Sure, I can make it controllable via sysctl or a property. However, that would increase complexity - and the risk of security problems - even more, for a very little in return (how many people actually _know_ about this check?). Also, it _was_ reviewed. Just not here. ;-) -- If you cut off my head, what would I say? Me and my head, or me and my body? From das at FreeBSD.ORG Fri Nov 7 08:34:28 2008 From: das at FreeBSD.ORG (David Schultz) Date: Fri Nov 7 08:34:34 2008 Subject: Directory rename semantics. In-Reply-To: References: <20081027193545.GA95872@pin.if.uz.zgora.pl> <20081028161855.GA45129@zim.MIT.EDU> <20081106192829.GA98742@pin.if.uz.zgora.pl> <20081106195558.GG2281@submonkey.net> Message-ID: <20081107163910.GA7007@zim.MIT.EDU> On Fri, Nov 07, 2008, Ivan Voras wrote: > That would be desirable if we want file system semantics to be a > property of the OS instead of individual file systems. (Though I don't > know if there's ever been a conscious decision about this particular > goal). I don't agree with this. The access control rules are fundamentally a property of the filesystem. Nobody expects msdosfs or ntfs to have the same semantics as UFS, for instance. Furthermore, even if you hacked up all the local filesystems to support the "FreeBSD rules" (as a recent commit seems to have done), you'd still get different semantics for remote NFS and AFS mounts. From ivoras at freebsd.org Fri Nov 7 10:34:54 2008 From: ivoras at freebsd.org (Ivan Voras) Date: Fri Nov 7 10:35:01 2008 Subject: Directory rename semantics. In-Reply-To: <20081107163910.GA7007@zim.MIT.EDU> References: <20081027193545.GA95872@pin.if.uz.zgora.pl> <20081028161855.GA45129@zim.MIT.EDU> <20081106192829.GA98742@pin.if.uz.zgora.pl> <20081106195558.GG2281@submonkey.net> <20081107163910.GA7007@zim.MIT.EDU> Message-ID: <9bbcef730811071013q35c04dd4gb582a286a709f22d@mail.gmail.com> 2008/11/7 David Schultz : > On Fri, Nov 07, 2008, Ivan Voras wrote: >> That would be desirable if we want file system semantics to be a >> property of the OS instead of individual file systems. (Though I don't >> know if there's ever been a conscious decision about this particular >> goal). > > I don't agree with this. The access control rules are > fundamentally a property of the filesystem. Nobody expects msdosfs > or ntfs to have the same semantics as UFS, for instance. > Furthermore, even if you hacked up all the local filesystems to > support the "FreeBSD rules" (as a recent commit seems to have > done), you'd still get different semantics for remote NFS and AFS > mounts. There's a fundamental difference between the three groups of file systems: UFS and ZFS are native local file systems created for Unix, MSDOSfs is definitely an odd, foreign file system, while NFS and AFS are network file systems nobody trusts anyway :) From ceri at submonkey.net Fri Nov 7 13:12:31 2008 From: ceri at submonkey.net (Ceri Davies) Date: Fri Nov 7 13:12:38 2008 Subject: Directory rename semantics. In-Reply-To: <9bbcef730811071013q35c04dd4gb582a286a709f22d@mail.gmail.com> References: <20081027193545.GA95872@pin.if.uz.zgora.pl> <20081028161855.GA45129@zim.MIT.EDU> <20081106192829.GA98742@pin.if.uz.zgora.pl> <20081106195558.GG2281@submonkey.net> <20081107163910.GA7007@zim.MIT.EDU> <9bbcef730811071013q35c04dd4gb582a286a709f22d@mail.gmail.com> Message-ID: <20081107211228.GF34757@submonkey.net> On Fri, Nov 07, 2008 at 07:13:34PM +0100, Ivan Voras wrote: > 2008/11/7 David Schultz : > > On Fri, Nov 07, 2008, Ivan Voras wrote: > >> That would be desirable if we want file system semantics to be a > >> property of the OS instead of individual file systems. (Though I don't > >> know if there's ever been a conscious decision about this particular > >> goal). > > > > I don't agree with this. The access control rules are > > fundamentally a property of the filesystem. Nobody expects msdosfs > > or ntfs to have the same semantics as UFS, for instance. > > Furthermore, even if you hacked up all the local filesystems to > > support the "FreeBSD rules" (as a recent commit seems to have > > done), you'd still get different semantics for remote NFS and AFS > > mounts. > > There's a fundamental difference between the three groups of file > systems: UFS and ZFS are native local file systems created for Unix, > MSDOSfs is definitely an odd, foreign file system, while NFS and AFS > are network file systems nobody trusts anyway :) The point is that if you are concerned about these things then you should be checking what file system you are using anyway, and therefore there is no point in changing ZFS to match UFS. ZFS ACLs are completely disparate to UFS ones, for example, so what's the proposal to fix that difference? Ceri -- That must be wonderful! I don't understand it at all. -- Moliere -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081107/c60817db/attachment.pgp From ivoras at freebsd.org Fri Nov 7 13:27:05 2008 From: ivoras at freebsd.org (Ivan Voras) Date: Fri Nov 7 13:27:11 2008 Subject: Directory rename semantics. In-Reply-To: <20081107211228.GF34757@submonkey.net> References: <20081027193545.GA95872@pin.if.uz.zgora.pl> <20081028161855.GA45129@zim.MIT.EDU> <20081106192829.GA98742@pin.if.uz.zgora.pl> <20081106195558.GG2281@submonkey.net> <20081107163910.GA7007@zim.MIT.EDU> <9bbcef730811071013q35c04dd4gb582a286a709f22d@mail.gmail.com> <20081107211228.GF34757@submonkey.net> Message-ID: <9bbcef730811071327y494401eflfe7c7bcd64316a62@mail.gmail.com> 2008/11/7 Ceri Davies : > The point is that if you are concerned about these things then you > should be checking what file system you are using anyway, and therefore I agree - users should test before using, but... > there is no point in changing ZFS to match UFS. In an ideal world, I would like just the thing - obviously, within limits (things like ZFS snapshots and such are what makes ZFS - ZFS). A good goal would be to make file systems indistinguishable to non-administrative userland applications running on them. > ZFS ACLs are completely > disparate to UFS ones, for example, so what's the proposal to fix that > difference? Erm, exactly the thing that's supposed to be done, as described at: http://wiki.freebsd.org/ZFS :) "Currently ZFS on FreeBSD doesn't support ACLs. ZFS itself supports NFSv4-style ACLs, which is different than the existing POSIX.1e implementation in FreeBSD, which means that first thing to do is to add support for NFSv4-style ACLs to FreeBSD, which is being done as a GSoC project." I'm not going to argue this point just for the sake of arguing - I don't think I will convince you and I believe in my standpoint. Since I'm not able to work on either file system, I'll leave it to the people who will to decide :) From ed at 80386.nl Sun Nov 9 11:27:49 2008 From: ed at 80386.nl (Ed Schouten) Date: Sun Nov 9 11:27:56 2008 Subject: pipe(2) calling convention: why? Message-ID: <20081109192746.GO1165@hoeg.nl> Skipped content of type multipart/mixed-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081109/dcb3eaf4/attachment.pgp From alfred at freebsd.org Sun Nov 9 13:02:54 2008 From: alfred at freebsd.org (Alfred Perlstein) Date: Sun Nov 9 13:03:00 2008 Subject: pipe(2) calling convention: why? In-Reply-To: <20081109192746.GO1165@hoeg.nl> References: <20081109192746.GO1165@hoeg.nl> Message-ID: <20081109204342.GB53877@elvis.mu.org> Looks really good, simplifies and reduces code. * Ed Schouten [081109 11:27] wrote: > Hello all, > > After having a discussion on IRC with some friends of mine about system > call conventions, we couldn't exactly determine why pipe(2)'s calling > convention has to be different from the rest. Unlike most system calls, > pipe(2) has two return values. Instead of just copying out an array of > two elements, it uses two registers to store the file descriptor > numbers. > > It seems a lot of BSD-style system calls used to work that way, but > pipe(2) seems to be the only system call on FreeBSD that uses this > today. Some system calls only seem to set td_retval[1] to zero, which > makes little sense to me. Maybe those assignments can be removed. > > In my opinion there are a couple of disadvantages of having multiple > return values: > > - As documented in syscall(2), there is no way to obtain the second > return value if you use this functions. > > - Each of those system calls needs to have its own implementation > written in assembly for each architecture we support. Why can hundreds > of system calls be handled in a generic fashion, while interfaces like > pipe(2) can't? > > As a small experiment I've written a patch to allocate a new system call > (506) which uses a generic calling convention to implement pipe(2). It > seems Linux also uses this method, so I've removed linux_pipe() from the > Linuxolator as well, which seems to work. > > I could commit this if people think it makes sense. Any comments? > > -- > Ed Schouten > WWW: http://80386.nl/ -- - Alfred Perlstein From kostikbel at gmail.com Sun Nov 9 13:14:05 2008 From: kostikbel at gmail.com (Kostik Belousov) Date: Sun Nov 9 13:14:18 2008 Subject: pipe(2) calling convention: why? In-Reply-To: <20081109192746.GO1165@hoeg.nl> References: <20081109192746.GO1165@hoeg.nl> Message-ID: <20081109203848.GP18100@deviant.kiev.zoral.com.ua> On Sun, Nov 09, 2008 at 08:27:46PM +0100, Ed Schouten wrote: > Hello all, > > After having a discussion on IRC with some friends of mine about system > call conventions, we couldn't exactly determine why pipe(2)'s calling > convention has to be different from the rest. Unlike most system calls, > pipe(2) has two return values. Instead of just copying out an array of > two elements, it uses two registers to store the file descriptor > numbers. > > It seems a lot of BSD-style system calls used to work that way, but > pipe(2) seems to be the only system call on FreeBSD that uses this > today. Some system calls only seem to set td_retval[1] to zero, which > makes little sense to me. Maybe those assignments can be removed. > > In my opinion there are a couple of disadvantages of having multiple > return values: > > - As documented in syscall(2), there is no way to obtain the second > return value if you use this functions. > > - Each of those system calls needs to have its own implementation > written in assembly for each architecture we support. Why can hundreds > of system calls be handled in a generic fashion, while interfaces like > pipe(2) can't? > > As a small experiment I've written a patch to allocate a new system call > (506) which uses a generic calling convention to implement pipe(2). It > seems Linux also uses this method, so I've removed linux_pipe() from the > Linuxolator as well, which seems to work. > > I could commit this if people think it makes sense. Any comments? > The convention of returning pipe descriptors in the registers comes back at least to the Six Edition. Check the Lion' book for the reference. Amusingly, Solaris uses the same calling convention for pipe(2). I do not see what we gain by the change. Now, we have one syscall and some arch-dependend wrappers in the libc. After the patch, we get rid of the wrappers, but grow two syscalls. The only reason of doing this I can imagine is to allow syscall(2) to work for SYS_pipe from C code. Since we did not heard complaints about this for ~15 years, we can live with it. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081109/d8babd9a/attachment.pgp From kostikbel at gmail.com Sun Nov 9 13:14:06 2008 From: kostikbel at gmail.com (Kostik Belousov) Date: Sun Nov 9 13:14:19 2008 Subject: pipe(2) calling convention: why? In-Reply-To: <20081109203848.GP18100@deviant.kiev.zoral.com.ua> References: <20081109192746.GO1165@hoeg.nl> <20081109203848.GP18100@deviant.kiev.zoral.com.ua> Message-ID: <20081109204639.GQ18100@deviant.kiev.zoral.com.ua> On Sun, Nov 09, 2008 at 10:38:48PM +0200, Kostik Belousov wrote: > On Sun, Nov 09, 2008 at 08:27:46PM +0100, Ed Schouten wrote: > > Hello all, > > > > After having a discussion on IRC with some friends of mine about system > > call conventions, we couldn't exactly determine why pipe(2)'s calling > > convention has to be different from the rest. Unlike most system calls, > > pipe(2) has two return values. Instead of just copying out an array of > > two elements, it uses two registers to store the file descriptor > > numbers. > > > > It seems a lot of BSD-style system calls used to work that way, but > > pipe(2) seems to be the only system call on FreeBSD that uses this > > today. Some system calls only seem to set td_retval[1] to zero, which > > makes little sense to me. Maybe those assignments can be removed. > > > > In my opinion there are a couple of disadvantages of having multiple > > return values: > > > > - As documented in syscall(2), there is no way to obtain the second > > return value if you use this functions. > > > > - Each of those system calls needs to have its own implementation > > written in assembly for each architecture we support. Why can hundreds > > of system calls be handled in a generic fashion, while interfaces like > > pipe(2) can't? > > > > As a small experiment I've written a patch to allocate a new system call > > (506) which uses a generic calling convention to implement pipe(2). It > > seems Linux also uses this method, so I've removed linux_pipe() from the > > Linuxolator as well, which seems to work. > > > > I could commit this if people think it makes sense. Any comments? > > > > The convention of returning pipe descriptors in the registers comes > back at least to the Six Edition. Check the Lion' book for the reference. > Amusingly, Solaris uses the same calling convention for pipe(2). > > I do not see what we gain by the change. Now, we have one syscall and > some arch-dependend wrappers in the libc. After the patch, we get rid > of the wrappers, but grow two syscalls. > > The only reason of doing this I can imagine is to allow syscall(2) to > work for SYS_pipe from C code. Since we did not heard complaints about > this for ~15 years, we can live with it. Part that updates man page, introduces kern_pipe and simplifies linuxolator has a stand-alone value. I think that should be committed in any case. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081109/1595846f/attachment.pgp From phk at phk.freebsd.dk Sun Nov 9 14:31:24 2008 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Sun Nov 9 14:31:30 2008 Subject: pipe(2) calling convention: why? In-Reply-To: Your message of "Sun, 09 Nov 2008 20:27:46 +0100." <20081109192746.GO1165@hoeg.nl> Message-ID: <71215.1226268745@critter.freebsd.dk> In message <20081109192746.GO1165@hoeg.nl>, Ed Schouten writes: >After having a discussion on IRC with some friends of mine about system >call conventions, we couldn't exactly determine why pipe(2)'s calling >convention has to be different from the rest. It will take some time before we can remove the old syscall, but I'd say it is worth it, just for getting more consistency and less pointless magic. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From peter at wemm.org Sun Nov 9 15:04:01 2008 From: peter at wemm.org (Peter Wemm) Date: Sun Nov 9 15:04:08 2008 Subject: pipe(2) calling convention: why? In-Reply-To: <20081109203848.GP18100@deviant.kiev.zoral.com.ua> References: <20081109192746.GO1165@hoeg.nl> <20081109203848.GP18100@deviant.kiev.zoral.com.ua> Message-ID: On Sun, Nov 9, 2008 at 12:38 PM, Kostik Belousov wrote: > On Sun, Nov 09, 2008 at 08:27:46PM +0100, Ed Schouten wrote: >> Hello all, >> >> After having a discussion on IRC with some friends of mine about system >> call conventions, we couldn't exactly determine why pipe(2)'s calling >> convention has to be different from the rest. Unlike most system calls, >> pipe(2) has two return values. Instead of just copying out an array of >> two elements, it uses two registers to store the file descriptor >> numbers. >> >> It seems a lot of BSD-style system calls used to work that way, but >> pipe(2) seems to be the only system call on FreeBSD that uses this >> today. Some system calls only seem to set td_retval[1] to zero, which >> makes little sense to me. Maybe those assignments can be removed. >> >> In my opinion there are a couple of disadvantages of having multiple >> return values: >> >> - As documented in syscall(2), there is no way to obtain the second >> return value if you use this functions. >> >> - Each of those system calls needs to have its own implementation >> written in assembly for each architecture we support. Why can hundreds >> of system calls be handled in a generic fashion, while interfaces like >> pipe(2) can't? >> >> As a small experiment I've written a patch to allocate a new system call >> (506) which uses a generic calling convention to implement pipe(2). It >> seems Linux also uses this method, so I've removed linux_pipe() from the >> Linuxolator as well, which seems to work. >> >> I could commit this if people think it makes sense. Any comments? >> > > The convention of returning pipe descriptors in the registers comes > back at least to the Six Edition. Check the Lion' book for the reference. > Amusingly, Solaris uses the same calling convention for pipe(2). > > I do not see what we gain by the change. Now, we have one syscall and > some arch-dependend wrappers in the libc. After the patch, we get rid > of the wrappers, but grow two syscalls. > > The only reason of doing this I can imagine is to allow syscall(2) to > work for SYS_pipe from C code. Since we did not heard complaints about > this for ~15 years, we can live with it. > The other side effect of the change is to remove one asm instruction code in the syscall handler and replace it by potentially hundreds of instructions to do the copyout. Plus we gain another syscall, lose backwards compatability with kernel.old again, and so on. I really don't see an overall benefit. What I do see some use for is to do the kern_pipe() split (like in the patch) which simplifies the linux abi wrappers (and other ABI wrappers, not just linux!). Just have our syscall return in retval[0] and [1] like before. But we get the benefit of simplifying a bunch of wrappers. The patch is incomplete anyway, It leaks fds if the copyout fails. There is a comment about this in the patch anyway. Other historical notes.. Ancient unix systems used to implement syscalls by having userland do a call (jsr) to a shared page. The trap handler would verify the entry point, and if it was approved, it would then give privilige and continue. The problem was that this severely limited the number of syscalls because we were talking tiny address spaces. Given that syscall numbers were at a premium, it made sense to pack as much functionality into syscalls as possible. eg: getpid syscall could return both pid and ppid, saving a kernel syscall entry point, and so on. This is also one of the reasons for SIGSYS. Calling an illegal kernel entry point in a process that had run wild could be easily converted into a signal. WIld processes could easily hit the kernel entry points. Again, this doesn't really apply these days. It is somewhat archaic by today's standards - linux doesn't even bother with SIGSYS - it has bad syscalls just return ENOSYS. fork() currently uses both retval[0] and [1], in spite of it appearing not to. See cpu_fork() for the other half. We use both return values for 64 bit returns. eg: lseek(). Some places that set it to 0 are silly. I really don't see td_retval[0] and td_retval[1] ever going away entirely, at least not while we share the syscall vector between 32 and 64 bit systems. I don't think it is worth breaking kernel.old compatability, replacing the current syscall for pipe() with a slower one, and having to have both anyway is much of a win. Splitting pipe() and kern_pipe() would help ABI wrappers. I don't see value in adding a new way for pipe(2) to fail (right now, pipe(2) causes a segfault if you pass a bad address. The new wrapper causes it to return EFAULT instead, and NOT crash the app. The failure mode has changed.) As an aside.. I'm very very very painfully aware of the dual return from syscalls. I've been fighting with this in valgrind for quite some time now. We have some very interesting semantics on i386. * syscalls preserve all registers except for %eax and %eflags. Even scratch registers. * .. except for %edx sometimes, for 64 bit returns, or dual-returns. Otherwise %edx is preserved. * libc depends on this in a couple of hand-written asm stubs, eg: brk()/sbrk(). Nothing else cares about this. * some libc syscall wrappers trash the scratch registers though. * in spite of syscalls not using C calling conventions, the kernel assumes you've done a C-style call to libc. It assumes the C return address was pushed onto the stack before the args. In retrospect I wish it never had started out this way. But it did, it still is, and I feel the costs of changing it are not worth it for such little gain. -- Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI6FJV "All of this is for nothing if we don't go to the stars" - JMS/B5 "If Java had true garbage collection, most programs would delete themselves upon execution." -- Robert Sewell From bugmaster at FreeBSD.org Mon Nov 10 03:06:47 2008 From: bugmaster at FreeBSD.org (FreeBSD bugmaster) Date: Mon Nov 10 03:07:28 2008 Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org Message-ID: <200811101106.mAAB6loD049648@freefall.freebsd.org> Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/120749 arch [request] Suggest upping the default kern.ps_arg_cache 1 problem total. From julian at elischer.org Mon Nov 10 12:39:32 2008 From: julian at elischer.org (Julian Elischer) Date: Mon Nov 10 12:39:38 2008 Subject: pipe(2) calling convention: why? In-Reply-To: <20081109203848.GP18100@deviant.kiev.zoral.com.ua> References: <20081109192746.GO1165@hoeg.nl> <20081109203848.GP18100@deviant.kiev.zoral.com.ua> Message-ID: <4918953F.7070006@elischer.org> Kostik Belousov wrote: > On Sun, Nov 09, 2008 at 08:27:46PM +0100, Ed Schouten wrote: >> Hello all, >> >> After having a discussion on IRC with some friends of mine about system >> call conventions, we couldn't exactly determine why pipe(2)'s calling >> convention has to be different from the rest. Unlike most system calls, >> pipe(2) has two return values. Instead of just copying out an array of >> two elements, it uses two registers to store the file descriptor >> numbers. >> >> It seems a lot of BSD-style system calls used to work that way, but >> pipe(2) seems to be the only system call on FreeBSD that uses this >> today. Some system calls only seem to set td_retval[1] to zero, which >> makes little sense to me. Maybe those assignments can be removed. don't forget that we will still need to support the old convention because we support running old binaries "forever" (generally). I occasionally run freebsd 1.0 binaries on -current (you should see how fast a "make world" is in a 1.0 chroot!) >> >> In my opinion there are a couple of disadvantages of having multiple >> return values: >> >> - As documented in syscall(2), there is no way to obtain the second >> return value if you use this functions. >> >> - Each of those system calls needs to have its own implementation >> written in assembly for each architecture we support. Why can hundreds >> of system calls be handled in a generic fashion, while interfaces like >> pipe(2) can't? >> >> As a small experiment I've written a patch to allocate a new system call >> (506) which uses a generic calling convention to implement pipe(2). It >> seems Linux also uses this method, so I've removed linux_pipe() from the >> Linuxolator as well, which seems to work. >> >> I could commit this if people think it makes sense. Any comments? >> > > The convention of returning pipe descriptors in the registers comes > back at least to the Six Edition. Check the Lion' book for the reference. > Amusingly, Solaris uses the same calling convention for pipe(2). > > I do not see what we gain by the change. Now, we have one syscall and > some arch-dependend wrappers in the libc. After the patch, we get rid > of the wrappers, but grow two syscalls. > > The only reason of doing this I can imagine is to allow syscall(2) to > work for SYS_pipe from C code. Since we did not heard complaints about > this for ~15 years, we can live with it. From ed at 80386.nl Tue Nov 11 07:00:40 2008 From: ed at 80386.nl (Ed Schouten) Date: Tue Nov 11 07:00:47 2008 Subject: pipe(2) calling convention: why? In-Reply-To: <20081109192746.GO1165@hoeg.nl> References: <20081109192746.GO1165@hoeg.nl> Message-ID: <20081111150039.GV1165@hoeg.nl> Hello all, It seems most people liked some things that were in the patch, while others preferred to keep things as they were. I've just committed a patch to SVN (r184849) which keeps pipe(2) as it is now, but does some cleanups: - I've added kern_pipe(), so we can make linux_pipe() and linux32_pipe() less ugly (discussed with rdivacky). - I've also changed the manual page to not mention EFAULT, because we just get a segmentation fault if we pass an invalid address. Thanks all for commenting on this topic! -- Ed Schouten WWW: http://80386.nl/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081111/30d0e1d8/attachment.pgp From pjd at FreeBSD.org Wed Nov 12 01:00:15 2008 From: pjd at FreeBSD.org (Pawel Jakub Dawidek) Date: Wed Nov 12 01:00:27 2008 Subject: Directory rename semantics. In-Reply-To: <20081107123259.GC34757@submonkey.net> References: <20081027193545.GA95872@pin.if.uz.zgora.pl> <20081028161855.GA45129@zim.MIT.EDU> <20081106192829.GA98742@pin.if.uz.zgora.pl> <20081106195558.GG2281@submonkey.net> <20081107123259.GC34757@submonkey.net> Message-ID: <20081112082658.GA2441@garage.freebsd.pl> On Fri, Nov 07, 2008 at 12:32:59PM +0000, Ceri Davies wrote: > On Fri, Nov 07, 2008 at 11:44:27AM +0100, Ivan Voras wrote: > > Ceri Davies wrote: > > > On Thu, Nov 06, 2008 at 08:28:29PM +0100, Edward Tomasz Napierala wrote: > > >> After discussion about this with rwatson and pjd, I decided to do > > >> the opposite: change ZFS behaviour to match UFS. Reason is simple: > > >> this is security, and we want to be conservative here. It's impossible > > >> to make sure this change wouldn't cause security problems. > > > > > > Perhaps it would have been better to either do nothing or create a zfs > > > property that toggled this behaviour so that people who expect ZFS to > > > behave a certain way get it. I'm not sure why we would want all > > > filesystems to behave the same way, to be honest. > > > > That would be desirable if we want file system semantics to be a > > property of the OS instead of individual file systems. (Though I don't > > know if there's ever been a conscious decision about this particular > > goal). If so, a knob that toggles between the behaviours should toggle > > it for all file systems. Having them behave differently can create > > problems in migration to and from ZFS. > > That's essentially what has just happened, but without the knob. > > I'm not really sure whether you agree with the change that was made or > not. From user's perspective if I want to migrate from UFS to ZFS, I don't want to find out that there are differences in how FS behaves. I'm trying to make ZFS to be a full functional replacement for UFS, so that it supports chflags(2), extattrs, etc. Of course we need to draw a line what do we really want to support and what we may skip or support not fully. For example... ZFS/FreeBSD on directory creation inherits group ownership from the parent directory and now it also obeys directory write permissions when we want to move it to another directory. We support FreeBSD's extattrs, not Solaris fsattrs. We could support POSIX.1e ACLs easly (on top of extattrs), but we want to move to NFSv4-like ACLs with UFS too, AFAIK. We support chflags(2), but not all the flags. (I'm talking about perforce version.) -- Pawel Jakub Dawidek http://www.wheel.pl pjd@FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081112/dd390212/attachment.pgp From bugmaster at FreeBSD.org Mon Nov 17 03:06:49 2008 From: bugmaster at FreeBSD.org (FreeBSD bugmaster) Date: Mon Nov 17 03:07:31 2008 Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org Message-ID: <200811171106.mAHB6kqV082454@freefall.freebsd.org> Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/120749 arch [request] Suggest upping the default kern.ps_arg_cache 1 problem total. From lstewart at freebsd.org Wed Nov 19 05:51:32 2008 From: lstewart at freebsd.org (Lawrence Stewart) Date: Wed Nov 19 05:51:42 2008 Subject: kthread_exit(9) unexpectedness Message-ID: <492412E8.3060700@freebsd.org> Hi all, I tracked down a deadlock in some of my code today to some weird behaviour in the kthread(9) KPI. The executive summary is that kthread_exit() thread termination notification using wakeup() behaves as expected intuitively in 8.x, but not in 7.x. From sys/kern/kern_kthread.c ----------------------begin 8.x kthread_exit()-------------------------- void kthread_exit(void) { struct proc *p; /* A module may be waiting for us to exit. */ wakeup(curthread); /* * We could rely on thread_exit to call exit1() but * there is extra work that needs to be done */ if (curthread->td_proc->p_numthreads == 1) kproc_exit(0); /* never returns */ p = curthread->td_proc; PROC_LOCK(p); PROC_SLOCK(p); thread_exit(); } ----------------------end 8.x kthread_exit()---------------------------- ---------------------begin 7.x kthread_exit()--------------------------- void kthread_exit(int ecode) { struct thread *td; struct proc *p; td = curthread; p = td->td_proc; /* * Reparent curthread from proc0 to init so that the zombie * is harvested. */ sx_xlock(&proctree_lock); PROC_LOCK(p); proc_reparent(p, initproc); PROC_UNLOCK(p); sx_xunlock(&proctree_lock); /* * Wakeup anyone waiting for us to exit. */ wakeup(p); /* Buh-bye! */ exit1(td, W_EXITCODE(ecode, 0)); } ----------------------end 7.x kthread_exit()---------------------------- From the 7.x kthread(9) manpage: "While exiting, the function exit1(9) will initiate a call to wakeup(9) on the thread handle." The 8.x kthread manpage has no mention of the wakeup behaviour whatsoever. So from the code above, we can see that the 7.x kthread_exit() calls wakeup() on the *proc instead of the *thread. In 8.x, kthread_exit() calls wakeup() on the *thread and the newly added kproc_exit() function will wakeup() anyone waiting on the *proc. Looking at: http://svn.freebsd.org/viewvc/base/head/sys/kern/kern_kthread.c?view=log the confusion seems to have crept in around r173004 during the KPI refactoring to support true kernel threads. Historically it seems that kthread_exit() called wakeup on the *proc (which to my mind seems counter intuitive, but whatever). Then in r173052 we switch to the 8.x style of calling wakeup on the *thread, which matches the function naming convention and 7.x man page comment. At a minimum we need a better discussion of the differences in the man page, but the behaviour change seems unnecessarily intrusive to me and has nasty side effects i.e. deadlock. Keeping consistent wakeup behaviour between 7.x and 8.x would I suspect be desirable and avoid this issue biting others. Thoughts? Cheers, Lawrence From julian at elischer.org Wed Nov 19 11:07:16 2008 From: julian at elischer.org (Julian Elischer) Date: Wed Nov 19 11:07:22 2008 Subject: kthread_exit(9) unexpectedness In-Reply-To: <492412E8.3060700@freebsd.org> References: <492412E8.3060700@freebsd.org> Message-ID: <49245D3B.8050607@elischer.org> Lawrence Stewart wrote: > Hi all, > > I tracked down a deadlock in some of my code today to some weird > behaviour in the kthread(9) KPI. The executive summary is that > kthread_exit() thread termination notification using wakeup() behaves as > expected intuitively in 8.x, but not in 7.x. > > From sys/kern/kern_kthread.c > > ----------------------begin 8.x kthread_exit()-------------------------- > void > kthread_exit(void) > { > struct proc *p; > > /* A module may be waiting for us to exit. */ > wakeup(curthread); > > /* > * We could rely on thread_exit to call exit1() but > * there is extra work that needs to be done > */ > if (curthread->td_proc->p_numthreads == 1) > kproc_exit(0); /* never returns */ > > p = curthread->td_proc; > PROC_LOCK(p); > PROC_SLOCK(p); > thread_exit(); > } > ----------------------end 8.x kthread_exit()---------------------------- > > ---------------------begin 7.x kthread_exit()--------------------------- > void > kthread_exit(int ecode) > { > struct thread *td; > struct proc *p; > > td = curthread; > p = td->td_proc; > > /* > * Reparent curthread from proc0 to init so that the zombie > * is harvested. > */ > sx_xlock(&proctree_lock); > PROC_LOCK(p); > proc_reparent(p, initproc); > PROC_UNLOCK(p); > sx_xunlock(&proctree_lock); > > /* > * Wakeup anyone waiting for us to exit. > */ > wakeup(p); > > /* Buh-bye! */ > exit1(td, W_EXITCODE(ecode, 0)); > } > ----------------------end 7.x kthread_exit()---------------------------- > > From the 7.x kthread(9) manpage: > > "While exiting, the function exit1(9) will initiate a call to wakeup(9) > on the thread handle." > > The 8.x kthread manpage has no mention of the wakeup behaviour whatsoever. > > So from the code above, we can see that the 7.x kthread_exit() calls > wakeup() on the *proc instead of the *thread. That was a bug. > In 8.x, kthread_exit() > calls wakeup() on the *thread and the newly added kproc_exit() function > will wakeup() anyone waiting on the *proc. more intuitive, no? That is what what was supposed to happen but we can't change a Kernel API in mid series. > > Looking at: > http://svn.freebsd.org/viewvc/base/head/sys/kern/kern_kthread.c?view=log > the confusion seems to have crept in around r173004 during the KPI > refactoring to support true kernel threads. > > Historically it seems that kthread_exit() called wakeup on the *proc > (which to my mind seems counter intuitive, but whatever). Then in > r173052 we switch to the 8.x style of calling wakeup on the *thread, > which matches the function naming convention and 7.x man page comment. this is because historically kthread_xxx actes on actual processes. so the proc was unique to each. the kthread man page became the kproc man page so that is where the info on wakeup might have gone. A new kthread man page was made... > > At a minimum we need a better discussion of the differences in the man > page, but the behaviour change seems unnecessarily intrusive to me and > has nasty side effects i.e. deadlock. Keeping consistent wakeup > behaviour between 7.x and 8.x would I suspect be desirable and avoid > this issue biting others. which one would you change? > > Thoughts? > > Cheers, > Lawrence > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" From lstewart at freebsd.org Wed Nov 19 14:16:26 2008 From: lstewart at freebsd.org (Lawrence Stewart) Date: Wed Nov 19 14:16:33 2008 Subject: kthread_exit(9) unexpectedness In-Reply-To: <49245D3B.8050607@elischer.org> References: <492412E8.3060700@freebsd.org> <49245D3B.8050607@elischer.org> Message-ID: <49248958.9060308@freebsd.org> Julian Elischer wrote: > Lawrence Stewart wrote: >> Hi all, >> >> I tracked down a deadlock in some of my code today to some weird >> behaviour in the kthread(9) KPI. The executive summary is that >> kthread_exit() thread termination notification using wakeup() behaves >> as expected intuitively in 8.x, but not in 7.x. >> >> From sys/kern/kern_kthread.c >> >> ----------------------begin 8.x kthread_exit()-------------------------- >> void >> kthread_exit(void) >> { >> struct proc *p; >> >> /* A module may be waiting for us to exit. */ >> wakeup(curthread); >> >> /* >> * We could rely on thread_exit to call exit1() but >> * there is extra work that needs to be done >> */ >> if (curthread->td_proc->p_numthreads == 1) >> kproc_exit(0); /* never returns */ >> >> p = curthread->td_proc; >> PROC_LOCK(p); >> PROC_SLOCK(p); >> thread_exit(); >> } >> ----------------------end 8.x kthread_exit()---------------------------- >> >> ---------------------begin 7.x kthread_exit()--------------------------- >> void >> kthread_exit(int ecode) >> { >> struct thread *td; >> struct proc *p; >> >> td = curthread; >> p = td->td_proc; >> >> /* >> * Reparent curthread from proc0 to init so that the zombie >> * is harvested. >> */ >> sx_xlock(&proctree_lock); >> PROC_LOCK(p); >> proc_reparent(p, initproc); >> PROC_UNLOCK(p); >> sx_xunlock(&proctree_lock); >> >> /* >> * Wakeup anyone waiting for us to exit. >> */ >> wakeup(p); >> >> /* Buh-bye! */ >> exit1(td, W_EXITCODE(ecode, 0)); >> } >> ----------------------end 7.x kthread_exit()---------------------------- >> >> From the 7.x kthread(9) manpage: >> >> "While exiting, the function exit1(9) will initiate a call to >> wakeup(9) on the thread handle." >> >> The 8.x kthread manpage has no mention of the wakeup behaviour >> whatsoever. >> >> So from the code above, we can see that the 7.x kthread_exit() calls >> wakeup() on the *proc instead of the *thread. > > > That was a bug. > >> In 8.x, kthread_exit() calls wakeup() on the *thread and the newly >> added kproc_exit() function will wakeup() anyone waiting on the *proc. > > more intuitive, no? That is what what was supposed to happen > but we can't change a Kernel API in mid series. Yes I agree the 8.x behaviour is more intuitive. I'm not sure I'm clear on why the wakeup behaviour is part of the KPI though. The documented behaviour in the 7.x kthread(9) man page is that it calls wakeup() on the "thread handle". So the documented behaviour is the intuitively correct one. The actual behaviour is "wrong", although historically consistent. > > >> >> Looking at: >> http://svn.freebsd.org/viewvc/base/head/sys/kern/kern_kthread.c?view=log >> the confusion seems to have crept in around r173004 during the KPI >> refactoring to support true kernel threads. >> >> Historically it seems that kthread_exit() called wakeup on the *proc >> (which to my mind seems counter intuitive, but whatever). Then in >> r173052 we switch to the 8.x style of calling wakeup on the *thread, >> which matches the function naming convention and 7.x man page comment. > > this is because historically kthread_xxx actes on actual processes. > so the proc was unique to each. Yep I understood that looking through the history. Even still I don't see why we didn't just call wakeup() on the *thread anyway (or if *thread wasn't meaningful previously, equate the *thread to the *proc). > > the kthread man page became the kproc man page so that is where the > info on wakeup might have gone. A new kthread man page was made... Ah ok, I had missed the rename of the man page. You are correct, kproc(9) mentions calling wakeup() on the *proc. > >> >> At a minimum we need a better discussion of the differences in the man >> page, but the behaviour change seems unnecessarily intrusive to me and >> has nasty side effects i.e. deadlock. Keeping consistent wakeup >> behaviour between 7.x and 8.x would I suspect be desirable and avoid >> this issue biting others. > > which one would you change? heh, good question. On the one hand we have the intuitively correct behaviour in 8.x, although the 8.x kthread_exit() behaviour with respect to wakeup() is not documented at all in the kthread(9) man page. On the other, we have the 7.x documented behaviour which is correct, but the actual behaviour of the code (which is historically consistent) is incorrect and at odds with the 8.x behaviour. I'm playing devil's advocate here as now I'm curious whether this issue is really considered part of the KPI or not. If the actual behaviour is what's important, then we obviously can't make the change in 7.x. If the documented behaviour is what we are supposed to be honoring, then technically the change could be made, no? Devil's advocate musings aside, my personal feelings are that we should be aiming for intuitive correctness in the KPI i.e. leaving the 8.x code as it is makes sense. Even though I feel the wakeup() behaviour is not technically part of the KPI in 7.x, I don't think we should change the code. Therefore I would propose some improvements to both the 7.x and 8.x kthread(9) man pages which clearly document the actual behaviour and subtle differences between 7.x and 8.x. I also suspect an entry in UPDATING should be added close to the existing 20071020 entry that retrospectively discusses the switch and the subtle difference in kthread_exit() behaviour. Finally, mentioning that the value of __FreeBSD_version can be checked against 800002 using an #ifdef test to conditionally detect which behaviour should be used would also be a good idea. The above changes should equip developers with all the info needed to maintain code that crosses the 7.x/8.x gap with minimal loss in hair. Cheers, Lawrence From julian at elischer.org Wed Nov 19 14:41:42 2008 From: julian at elischer.org (Julian Elischer) Date: Wed Nov 19 14:41:48 2008 Subject: kthread_exit(9) unexpectedness In-Reply-To: <49248958.9060308@freebsd.org> References: <492412E8.3060700@freebsd.org> <49245D3B.8050607@elischer.org> <49248958.9060308@freebsd.org> Message-ID: <49249624.7020108@elischer.org> Lawrence Stewart wrote: > I'm not sure I'm clear on why the wakeup behaviour is part of the KPI > though. The documented behaviour in the 7.x kthread(9) man page is that > it calls wakeup() on the "thread handle". So the documented behaviour is > the intuitively correct one. The actual behaviour is "wrong", although > historically consistent. AH I thought you said the documentation for kthread didn't mention teh wakeup.. so, we should probably MFC the fix [...] >> >> which one would you change? > > heh, good question. > > On the one hand we have the intuitively correct behaviour in 8.x, > although the 8.x kthread_exit() behaviour with respect to wakeup() is > not documented at all in the kthread(9) man page. patch suggested :-) > > On the other, we have the 7.x documented behaviour which is correct, but > the actual behaviour of the code (which is historically consistent) is > incorrect and at odds with the 8.x behaviour. in 7.x nearly everything uses kproc... so we could probably safely change it now. > > I'm playing devil's advocate here as now I'm curious whether this issue > is really considered part of the KPI or not. If the actual behaviour is > what's important, then we obviously can't make the change in 7.x. If the > documented behaviour is what we are supposed to be honoring, then > technically the change could be made, no? > > Devil's advocate musings aside, my personal feelings are that we should > be aiming for intuitive correctness in the KPI i.e. leaving the 8.x code > as it is makes sense. Even though I feel the wakeup() behaviour is not > technically part of the KPI in 7.x, I don't think we should change the > code. > > Therefore I would propose some improvements to both the 7.x and 8.x > kthread(9) man pages which clearly document the actual behaviour and > subtle differences between 7.x and 8.x. > > I also suspect an entry in UPDATING should be added close to the > existing 20071020 entry that retrospectively discusses the switch and > the subtle difference in kthread_exit() behaviour. > > Finally, mentioning that the value of __FreeBSD_version can be checked > against 800002 using an #ifdef test to conditionally detect which > behaviour should be used would also be a good idea. > > The above changes should equip developers with all the info needed to > maintain code that crosses the 7.x/8.x gap with minimal loss in hair. from MEMORY the wakeup and sleep are both done inside our own functions and the user is not expected to do them himself. so as long as we fix it on both sides.... (I may be wrong on that, I haven't looked at the code but just memories..) > > Cheers, > Lawrence From lstewart at freebsd.org Wed Nov 19 16:04:48 2008 From: lstewart at freebsd.org (Lawrence Stewart) Date: Wed Nov 19 16:04:55 2008 Subject: kthread_exit(9) unexpectedness In-Reply-To: <49249624.7020108@elischer.org> References: <492412E8.3060700@freebsd.org> <49245D3B.8050607@elischer.org> <49248958.9060308@freebsd.org> <49249624.7020108@elischer.org> Message-ID: <4924A6BE.8080308@freebsd.org> Julian Elischer wrote: > Lawrence Stewart wrote: > >> I'm not sure I'm clear on why the wakeup behaviour is part of the KPI >> though. The documented behaviour in the 7.x kthread(9) man page is >> that it calls wakeup() on the "thread handle". So the documented >> behaviour is the intuitively correct one. The actual behaviour is >> "wrong", although historically consistent. > AH > > I thought you said the documentation for kthread didn't mention teh > wakeup.. > so, we should probably MFC the fix > I said the 7.x kthread(9) man page mentioned the wakeup(), the 8.x kthread(9) man page does not. This definitely needs to be corrected. > > [...] > >>> >>> which one would you change? >> >> heh, good question. >> >> On the one hand we have the intuitively correct behaviour in 8.x, >> although the 8.x kthread_exit() behaviour with respect to wakeup() is >> not documented at all in the kthread(9) man page. > > patch suggested :-) Can do. > >> >> On the other, we have the 7.x documented behaviour which is correct, >> but the actual behaviour of the code (which is historically >> consistent) is incorrect and at odds with the 8.x behaviour. > > in 7.x nearly everything uses kproc... so we could probably safely > change it now. We could definitely change it for all in tree cases easily enough. My concern is out of tree code. This change would make any out of tree code that relies on the actually implemented wakeup() mechanism potentially deadlock, which is not nice. We could argue that the documented behaviour is correct and that we're correcting a bug with the fix... still, tis cold comfort for anyone who's working code now deadlocks. I do like the maintenance simplicity the change would bring moving forward. I'm still not sure the code change is the best idea. Does anyone else have thoughts on the matter? > >> >> I'm playing devil's advocate here as now I'm curious whether this >> issue is really considered part of the KPI or not. If the actual >> behaviour is what's important, then we obviously can't make the change >> in 7.x. If the documented behaviour is what we are supposed to be >> honoring, then technically the change could be made, no? >> >> Devil's advocate musings aside, my personal feelings are that we >> should be aiming for intuitive correctness in the KPI i.e. leaving the >> 8.x code as it is makes sense. Even though I feel the wakeup() >> behaviour is not technically part of the KPI in 7.x, I don't think we >> should change the code. >> >> Therefore I would propose some improvements to both the 7.x and 8.x >> kthread(9) man pages which clearly document the actual behaviour and >> subtle differences between 7.x and 8.x. >> >> I also suspect an entry in UPDATING should be added close to the >> existing 20071020 entry that retrospectively discusses the switch and >> the subtle difference in kthread_exit() behaviour. >> >> Finally, mentioning that the value of __FreeBSD_version can be checked >> against 800002 using an #ifdef test to conditionally detect which >> behaviour should be used would also be a good idea. >> >> The above changes should equip developers with all the info needed to >> maintain code that crosses the 7.x/8.x gap with minimal loss in hair. > > from MEMORY the wakeup and sleep are both done inside our own > functions and the user is not expected to do them himself. > so as long as we fix it on both sides.... > (I may be wrong on that, I haven't looked at the code but just memories..) Not sure what sleep you're referring to, but yes we say we don't require the user to do their own wakeup() as their thread dies. Though consumers of the API in 7.x and 8.x might have worked around the difference in behaviour by adding their own wakeup() (it's how I did it initially before digging a bit deeper). Cheers, Lawrence From lstewart at freebsd.org Wed Nov 19 16:38:36 2008 From: lstewart at freebsd.org (Lawrence Stewart) Date: Wed Nov 19 16:38:43 2008 Subject: kthread_exit(9) unexpectedness In-Reply-To: <4924A6BE.8080308@freebsd.org> References: <492412E8.3060700@freebsd.org> <49245D3B.8050607@elischer.org> <49248958.9060308@freebsd.org> <49249624.7020108@elischer.org> <4924A6BE.8080308@freebsd.org> Message-ID: <4924B18A.4060100@freebsd.org> Lawrence Stewart wrote: > Julian Elischer wrote: >> Lawrence Stewart wrote: >> [snip] > >> >>> >>> On the other, we have the 7.x documented behaviour which is correct, >>> but the actual behaviour of the code (which is historically >>> consistent) is incorrect and at odds with the 8.x behaviour. >> >> in 7.x nearly everything uses kproc... so we could probably safely >> change it now. > > We could definitely change it for all in tree cases easily enough. My > concern is out of tree code. This change would make any out of tree code > that relies on the actually implemented wakeup() mechanism potentially > deadlock, which is not nice. We could argue that the documented > behaviour is correct and that we're correcting a bug with the fix... > still, tis cold comfort for anyone who's working code now deadlocks. > > I do like the maintenance simplicity the change would bring moving forward. > > I'm still not sure the code change is the best idea. Does anyone else > have thoughts on the matter? > [snip] *slaps forehead* So, it just occurred to me through my mid-morning mental haze after a chat with attilio@ that we should just call wakeup() on both the *proc _and_ *thread in 7.x and be done with it. It doesn't hurt anyone, maintains the current behaviour, ensures we're living up to our documented KPI of delivering a wakeup on the thread handle and resolves the compatibility issues between 7.x and 8.x in this respect - no potential deadlocks is good++. I'll have a go at a patch for code + man pages shortly. Cheers, Lawrence From jhb at freebsd.org Thu Nov 20 12:37:05 2008 From: jhb at freebsd.org (John Baldwin) Date: Thu Nov 20 12:37:12 2008 Subject: kthread_exit(9) unexpectedness In-Reply-To: <492412E8.3060700@freebsd.org> References: <492412E8.3060700@freebsd.org> Message-ID: <200811201502.23943.jhb@freebsd.org> On Wednesday 19 November 2008 08:21:44 am Lawrence Stewart wrote: > Hi all, > > I tracked down a deadlock in some of my code today to some weird > behaviour in the kthread(9) KPI. The executive summary is that > kthread_exit() thread termination notification using wakeup() behaves as > expected intuitively in 8.x, but not in 7.x. In 5.x/6.x/7.x kthreads are still processes and it has always been a wakeup on the proc pointer. kthread_create() in 7.x returns a proc pointer, not a thread pointer for example. In 8.x kthreads are actual threads and kthread_add() and kproc_kthread_add() both return thread pointers. Hence in 8.x kthread_exit() is used for exiting kernel threads and wakes up the thread pointer, but in 7.x kthread_exit() is used for exiting kernel processes and wakes up the proc pointer. I think what is probably needed is to simply document that arrangement as such. Note that the sleeping on proc pointer has been the documented way to synchronize with kthread_exit() since 5.0. -- John Baldwin From lstewart at freebsd.org Thu Nov 20 14:22:06 2008 From: lstewart at freebsd.org (Lawrence Stewart) Date: Thu Nov 20 14:22:21 2008 Subject: kthread_exit(9) unexpectedness In-Reply-To: <200811201502.23943.jhb@freebsd.org> References: <492412E8.3060700@freebsd.org> <200811201502.23943.jhb@freebsd.org> Message-ID: <4925E30B.8010709@freebsd.org> John Baldwin wrote: > On Wednesday 19 November 2008 08:21:44 am Lawrence Stewart wrote: >> Hi all, >> >> I tracked down a deadlock in some of my code today to some weird >> behaviour in the kthread(9) KPI. The executive summary is that >> kthread_exit() thread termination notification using wakeup() behaves as >> expected intuitively in 8.x, but not in 7.x. > > In 5.x/6.x/7.x kthreads are still processes and it has always been a wakeup on > the proc pointer. kthread_create() in 7.x returns a proc pointer, not a > thread pointer for example. In 8.x kthreads are actual threads and Yep, but the processes have a *thread in them right? The API naming is obviously slightly misleading, but it essentially creates a new single threaded process prior to 8.x. > kthread_add() and kproc_kthread_add() both return thread pointers. Hence in Yup. > 8.x kthread_exit() is used for exiting kernel threads and wakes up the thread > pointer, but in 7.x kthread_exit() is used for exiting kernel processes and > wakes up the proc pointer. I think what is probably needed is to simply In the code, yes. Our documented behaviour as far as I can tell is different though, unless we equate a "thread handle" to "proc handle" prior to 8.x, which I don't think is the case - they are still different. > document that arrangement as such. Note that the sleeping on proc pointer I agree that the arrangement needs to be better documented. The change in 8.x is subtle enough that reading the kthread man page in 7.x and 8.x doesn't immediately make it obvious what's going on. > has been the documented way to synchronize with kthread_exit() since 5.0. > Could you please point me at this documentation? I've missed it in my poking around thus far. Cheers, Lawrence From julian at elischer.org Fri Nov 21 01:25:08 2008 From: julian at elischer.org (Julian Elischer) Date: Fri Nov 21 01:25:15 2008 Subject: kthread_exit(9) unexpectedness In-Reply-To: <4925E30B.8010709@freebsd.org> References: <492412E8.3060700@freebsd.org> <200811201502.23943.jhb@freebsd.org> <4925E30B.8010709@freebsd.org> Message-ID: <492677AE.7060607@elischer.org> Lawrence Stewart wrote: > > I agree that the arrangement needs to be better documented. The change > in 8.x is subtle enough that reading the kthread man page in 7.x and 8.x > doesn't immediately make it obvious what's going on. > I tried to walk the fine line between being completely incompatible and letting people fall into a trap. The arguments and prototypes are different so code should fail to compile unless people have made changes to their code.. From jhb at freebsd.org Fri Nov 21 11:59:09 2008 From: jhb at freebsd.org (John Baldwin) Date: Fri Nov 21 11:59:15 2008 Subject: kthread_exit(9) unexpectedness In-Reply-To: <4925E30B.8010709@freebsd.org> References: <492412E8.3060700@freebsd.org> <200811201502.23943.jhb@freebsd.org> <4925E30B.8010709@freebsd.org> Message-ID: <200811211348.41536.jhb@freebsd.org> On Thursday 20 November 2008 05:22:03 pm Lawrence Stewart wrote: > John Baldwin wrote: > > On Wednesday 19 November 2008 08:21:44 am Lawrence Stewart wrote: > >> Hi all, > >> > >> I tracked down a deadlock in some of my code today to some weird > >> behaviour in the kthread(9) KPI. The executive summary is that > >> kthread_exit() thread termination notification using wakeup() behaves as > >> expected intuitively in 8.x, but not in 7.x. > > > > In 5.x/6.x/7.x kthreads are still processes and it has always been a wakeup on > > the proc pointer. kthread_create() in 7.x returns a proc pointer, not a > > thread pointer for example. In 8.x kthreads are actual threads and > > Yep, but the processes have a *thread in them right? The API naming is > obviously slightly misleading, but it essentially creates a new single > threaded process prior to 8.x. Yes, but you have to go explicitly use FIRST_THREAD_IN_PROC(). Most of the kernel modules I've written that use kthread's in < 8 do this: static struct proc *foo_thread; /* Called for MOD_LOAD. */ static void load(...) { error = kthread_create(..., &foo_thread); } static void unload(...) { /* set flag */ msleep(foo_thread, ...); } And never actually use the thread at all. However, if you write the code for 8.x, now you _do_ get a kthread and sleep on the thread so it becomes: static struct thread *foo_thread; static void load(...) { error = kproc_kthread_add(..., proc0, &foo_thread); } static void unload(...) { /* set flag */ msleep(foo_thread, ...); } > > kthread_add() and kproc_kthread_add() both return thread pointers. Hence in > > Yup. > > > 8.x kthread_exit() is used for exiting kernel threads and wakes up the thread > > pointer, but in 7.x kthread_exit() is used for exiting kernel processes and > > wakes up the proc pointer. I think what is probably needed is to simply > > In the code, yes. Our documented behaviour as far as I can tell is > different though, unless we equate a "thread handle" to "proc handle" > prior to 8.x, which I don't think is the case - they are still different. It has always been the case in < 8 that you sleep on the proc handle (what kthread_create() actually returns in < 8). And in fact, you even have to dig around in the proc you get from kthread_create() to even find the thread pointer as opposed to having the API hand it to you. > > document that arrangement as such. Note that the sleeping on proc pointer > > I agree that the arrangement needs to be better documented. The change > in 8.x is subtle enough that reading the kthread man page in 7.x and 8.x > doesn't immediately make it obvious what's going on. > > > has been the documented way to synchronize with kthread_exit() since 5.0. > > > > Could you please point me at this documentation? I've missed it in my > poking around thus far. It is probably only documented in numerous threads in the mail archives sadly, but there have been several of them and there have been several fixes to get this right (the randomdev thread and fdc(4) thread come to mind). -- John Baldwin From attilio at freebsd.org Sun Nov 23 05:22:33 2008 From: attilio at freebsd.org (Attilio Rao) Date: Sun Nov 23 05:22:40 2008 Subject: [PATCH] pmcannotate tool Message-ID: <3bbf2fe10811230502t3cc52809i6ac91082f780b730@mail.gmail.com> pmcannotate is a tool that prints out sources of a tool (in C or assembly) with inlined profiling informations retrieved by a prior pmcstat analysis. If compared with things like callgraph generation, it prints out profiling on a per-instance basis and this can be useful to find, for example, badly handled caches, too high latency instructions, etc. The tool usage is pretty simple: pmcannotate [-a] [-h] [-k path] [-l level] samples.out binaryobj where samples.out is a pmcstat raw output and binaryobj is the binary object that has been profiled and is accessible for (ELF) symbols retrieving. The options are better described in manpages but briefly: - a: performs analysis on the assembly rather than the C source - h: usage and informations - k: specify a path for the kernel in order to locate correct objects for it - l: specify a lower boundary (in total percentage time) after which functions will be displayed nomore. A typical usage of pmcannotate can be some way of kernel annotation. For example, you can follow the steps below: 1) Generate a pmc raw output of system samples: # pmcstat -S ipm-unhalted-core-cycles -O samples.out 2) Copy the samples in the kernel building dir and cd there # cp samples.out /usr/src/sys/i386/compile/GENERIC/ ; cd /usr/src/sys/i386/compile/GENERIC/ 3) Run pmcannotate # pmcannotate -k . samples.out kernel.debug > kernel.ann In the example above please note that kernel.debug has to be used in order to produce a C annotated source. This happens because in order to get the binary sources we rely on the "objdump -S" command which wants binary compiled with debugging options. If not debugging options are present assembly analynsis is still possible, but no C-backed one will be available. objdump is not the only one tool on which pmcannotare rely. Infact, in order to have it working, pmcstat needs to be present too because we need to retrieve, from the pmcstat raw output, informations about the sampled PCs (in particular the name of the function they live within, its start and ending addresses). As long as currently pmcstat doesn't return those informations, a new option has been added to the tool (-m) which can extract (from a raw pmcstat output) all pc sampled, name of the functions and symbol bundaries they live within. Also please note that pmcannotate suffers of 2 limitations. Firstly, relying on objdump to dump the C source, with heavy optimization levels and lots of inlines the code gets difficult to read. Secondly, in particular on x86 but I guess it is not the only one case, the sample is always attributed to the instruction directly following the one that was interrupted. So in a C source view some samples may be attributed to the line below the one you're interested in. It's also important to keep in mind that if a line is a jump target or the start of a function the sample really belongs elsewhere. The patch can be found here: http://www.freebsd.org/~attilio/pmcannotate.diff/ where pmcannotate/ dir contains the code and needs to go under /usr/src/usr.sbin/ and the patch has diffs against pmcstat and Makefile. This work has been developed on the behalf of Nokia with important feedbacks and directions from Jeff Roberson. Testing and feedbacks (before it hits the tree) are welcome. Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein From joseph.koshy at gmail.com Sun Nov 23 06:11:37 2008 From: joseph.koshy at gmail.com (Joseph Koshy) Date: Sun Nov 23 06:11:43 2008 Subject: [PATCH] pmcannotate tool In-Reply-To: <3bbf2fe10811230502t3cc52809i6ac91082f780b730@mail.gmail.com> References: <3bbf2fe10811230502t3cc52809i6ac91082f780b730@mail.gmail.com> Message-ID: <84dead720811230544l625a565w158ccc1833b73799@mail.gmail.com> > pmcannotate is a tool that prints out sources of a tool (in C or > assembly) with inlined profiling informations retrieved by a prior > pmcstat analysis. [snip] > This work has been developed on the behalf of Nokia with important > feedbacks and directions from Jeff Roberson. Are you "rookie" on IRC, who used to catch me with tons of questions whenever I logged on? > Testing and feedbacks (before it hits the tree) are welcome. The pmcstat changes seem fine. Koshy From nwhitehorn at freebsd.org Sun Nov 23 09:39:16 2008 From: nwhitehorn at freebsd.org (Nathan Whitehorn) Date: Sun Nov 23 09:39:50 2008 Subject: Enumerable I2C busses Message-ID: <4929877B.6060307@freebsd.org> On Apple's PowerPC systems, the firmware device tree helpfully enumerates the system's I2C busses. Marco Trillo has recently written a driver for one of the system's I2C controllers in order to support the attached audio codecs, and I'm trying to figure out the best way to import it. The current I2C bus mechanism does not support the bus adding its own children and instead relies on hints or other out-of-band information for device attachment. It would be nice to do something like what the firmware-assisted PCI bus drivers do (ofw_pci, for instance): hijack child enumeration from the MI layer and attach information from the firmware. However, since all current I2C drivers' probe() routines return 0, I can't simply add the firmware devices, because as soon as the probe() methods of the existing drivers are called, they will take over all the devices on the bus. What is the best way to handle this? -Nathan From des at des.no Sun Nov 23 10:34:49 2008 From: des at des.no (=?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?=) Date: Sun Nov 23 10:34:57 2008 Subject: Enumerable I2C busses In-Reply-To: <4929877B.6060307@freebsd.org> (Nathan Whitehorn's message of "Sun, 23 Nov 2008 10:40:27 -0600") References: <4929877B.6060307@freebsd.org> Message-ID: <86myfq9uha.fsf@ds4.des.no> Nathan Whitehorn writes: > The current I2C bus mechanism does not support the bus adding its own > children [...] That's because the I2C protocol does not support device enumeration or identification. You have to know in advance what kind of devices are attached and at what address. Even worse, it is not uncommon for similar but not entirely compatible devices to use the same I2C address (for instance, every I2C-capable RTC chip uses the same address, even though they have different feature sets) DES -- Dag-Erling Sm?rgrav - des@des.no From xcllnt at mac.com Sun Nov 23 11:28:01 2008 From: xcllnt at mac.com (Marcel Moolenaar) Date: Sun Nov 23 11:28:07 2008 Subject: Enumerable I2C busses In-Reply-To: <4929877B.6060307@freebsd.org> References: <4929877B.6060307@freebsd.org> Message-ID: On Nov 23, 2008, at 8:40 AM, Nathan Whitehorn wrote: > On Apple's PowerPC systems, the firmware device tree helpfully > enumerates the system's I2C busses. Marco Trillo has recently > written a driver for one of the system's I2C controllers in order to > support the attached audio codecs, and I'm trying to figure out the > best way to import it. > > The current I2C bus mechanism does not support the bus adding its > own children and instead relies on hints or other out-of-band > information for device attachment. It would be nice to do something > like what the firmware-assisted PCI bus drivers do (ofw_pci, for > instance): hijack child enumeration from the MI layer and attach > information from the firmware. > > However, since all current I2C drivers' probe() routines return 0, I > can't simply add the firmware devices, because as soon as the > probe() methods of the existing drivers are called, they will take > over all the devices on the bus. > > What is the best way to handle this? I think the best approach is to have the probe() method actually test for something. This is the standard thing to do when the bus does not know a priori which driver to instantiate. I believe that the bus should never create a child of a specific devclass, unless instructed by the user (read: pre-binding directives). For ocpbus(4) (see sys/powerpc/mpc85xx/ocpbus.c and sys/powerpc/include/ocpbus.h), we created simple defines to identify the hardware and use that in the probe() method to bind the driver to the right hardware, as in: In uart(4), for example we have: ... error = BUS_READ_IVAR(parent, dev, OCPBUS_IVAR_DEVTYPE, &devtype); if (error) return (error); if (devtype != OCPBUS_DEVTYPE_UART) return (ENXIO); ... I'm not saying to copy it, but it does demonstrate that you can trivially implement something that eliminates the assumption that the bus instantiates children of a particular devclass and thus that the probe() method can always return 0. What we do need is a generic way (such as the OF device tree) to describe the hardware, its resource needs and how it's all wired together (think interrupt routing). -- Marcel Moolenaar xcllnt@mac.com From Alexander at Leidinger.net Sun Nov 23 12:11:41 2008 From: Alexander at Leidinger.net (Alexander Leidinger) Date: Sun Nov 23 12:11:53 2008 Subject: [PATCH] pmcannotate tool In-Reply-To: <3bbf2fe10811230502t3cc52809i6ac91082f780b730@mail.gmail.com> References: <3bbf2fe10811230502t3cc52809i6ac91082f780b730@mail.gmail.com> Message-ID: <20081123205603.17752y578er4bcqo@webmail.leidinger.net> Quoting Attilio Rao (from Sun, 23 Nov 2008 14:02:22 +0100): > pmcannotate is a tool that prints out sources of a tool (in C or > assembly) with inlined profiling informations retrieved by a prior > pmcstat analysis. > If compared with things like callgraph generation, it prints out > profiling on a per-instance basis and this can be useful to find, for > example, badly handled caches, too high latency instructions, etc. Can this also be used to do some code coverage analysis? What I'm interested in is to enable something, run some tests in userland, disable this something, and then run a tool which tells me which parts of specific functions where run or not. At first I hoped I can use dtrace for this... I had a dtrace training and seen the userland probes in action, where you can trace every ASM instruction, but unfortunately you can not do this with kernel probes. I tried with fbt and syscall on a Solaris 10 machine. I haven't tested with FreeBSD-dtrace yet, but I doubt it is more advanced in this regard than the Solaris dtrace. So I'm still searching. Bye, Alexander. -- We should keep the Panama Canal. After all, we stole it fair and square. -- S. I. Hayakawa http://www.Leidinger.net Alexander @ Leidinger.net: PGP ID = B0063FE7 http://www.FreeBSD.org netchild @ FreeBSD.org : PGP ID = 72077137 From raj at semihalf.com Sun Nov 23 12:30:06 2008 From: raj at semihalf.com (=?ISO-8859-2?Q?Rafa=B3_Jaworowski?=) Date: Sun Nov 23 12:30:28 2008 Subject: Enumerable I2C busses In-Reply-To: <86myfq9uha.fsf@ds4.des.no> References: <4929877B.6060307@freebsd.org> <86myfq9uha.fsf@ds4.des.no> Message-ID: On 2008-11-23, at 19:18, Dag-Erling Sm?rgrav wrote: > Nathan Whitehorn writes: >> The current I2C bus mechanism does not support the bus adding its own >> children [...] > > That's because the I2C protocol does not support device enumeration or > identification. You have to know in advance what kind of devices are > attached and at what address. Even worse, it is not uncommon for > similar but not entirely compatible devices to use the same I2C > address > (for instance, every I2C-capable RTC chip uses the same address, even > though they have different feature sets) Well, hard-coded addresses and conflicting assignments between vendors do not technically prevent from scanning the bus; actually, our current iicbus code can do bus scaning when compiled with a diag define. The problem however is some slave devices are not well- behaved, and they don't like to be read/written to other than in very specific scenario: if polled during bus scan strange effects occur e.g. they disappear from the bus, or do not react to consecutive requests etc. Nathan, not sure if this helps you, but I have a nice i2c diagnostic tool, which among other features lets the user scan the I2C bus for present slave devices. This is done from userland, so doing similar thing in-kernel wouldn't be a problem. I was planning to post this for review this coming week, so you can have a look. Rafal From raykinsella78 at gmail.com Sun Nov 23 12:49:55 2008 From: raykinsella78 at gmail.com (Ray Kinsella) Date: Sun Nov 23 12:50:01 2008 Subject: [PATCH] pmcannotate tool In-Reply-To: <3bbf2fe10811230502t3cc52809i6ac91082f780b730@mail.gmail.com> References: <3bbf2fe10811230502t3cc52809i6ac91082f780b730@mail.gmail.com> Message-ID: <584ec6bb0811231226x7d1b53ccgc5e28bfc297009a2@mail.gmail.com> I know I am going to really show my FreeBSD ignorance here, but this is a patch of FreeBSD 8.0 Current isn't it ? Thanks Ray Kinsella On Sun, Nov 23, 2008 at 1:02 PM, Attilio Rao wrote: > pmcannotate is a tool that prints out sources of a tool (in C or > assembly) with inlined profiling informations retrieved by a prior > pmcstat analysis. > If compared with things like callgraph generation, it prints out > profiling on a per-instance basis and this can be useful to find, for > example, badly handled caches, too high latency instructions, etc. > > The tool usage is pretty simple: > pmcannotate [-a] [-h] [-k path] [-l level] samples.out binaryobj > > where samples.out is a pmcstat raw output and binaryobj is the binary > object that has been profiled and is accessible for (ELF) symbols > retrieving. > The options are better described in manpages but briefly: > - a: performs analysis on the assembly rather than the C source > - h: usage and informations > - k: specify a path for the kernel in order to locate correct objects for > it > - l: specify a lower boundary (in total percentage time) after which > functions will be displayed nomore. > > A typical usage of pmcannotate can be some way of kernel annotation. > For example, you can follow the steps below: > 1) Generate a pmc raw output of system samples: > # pmcstat -S ipm-unhalted-core-cycles -O samples.out > 2) Copy the samples in the kernel building dir and cd there > # cp samples.out /usr/src/sys/i386/compile/GENERIC/ ; cd > /usr/src/sys/i386/compile/GENERIC/ > 3) Run pmcannotate > # pmcannotate -k . samples.out kernel.debug > kernel.ann > > In the example above please note that kernel.debug has to be used in > order to produce a C annotated source. This happens because in order > to get the binary sources we rely on the "objdump -S" command which > wants binary compiled with debugging options. > If not debugging options are present assembly analynsis is still > possible, but no C-backed one will be available. > objdump is not the only one tool on which pmcannotare rely. Infact, in > order to have it working, pmcstat needs to be present too because we > need to retrieve, from the pmcstat raw output, informations about the > sampled PCs (in particular the name of the function they live within, > its start and ending addresses). As long as currently pmcstat doesn't > return those informations, a new option has been added to the tool > (-m) which can extract (from a raw pmcstat output) all pc sampled, > name of the functions and symbol bundaries they live within. > > Also please note that pmcannotate suffers of 2 limitations. > Firstly, relying on objdump to dump the C source, with heavy > optimization levels and lots of inlines the code gets difficult to > read. Secondly, in particular on x86 but I guess it is not the only > one case, the sample is always attributed to the instruction directly > following the one that was interrupted. So in a C source view some > samples may be attributed to the line below the one you're interested > in. It's also important to keep in mind that if a line is a jump > target or the start of a function the sample really belongs elsewhere. > > The patch can be found here: > http://www.freebsd.org/~attilio/pmcannotate.diff/ > > where pmcannotate/ dir contains the code and needs to go under > /usr/src/usr.sbin/ and the patch has diffs against pmcstat and > Makefile. > > This work has been developed on the behalf of Nokia with important > feedbacks and directions from Jeff Roberson. > > Testing and feedbacks (before it hits the tree) are welcome. > > Thanks, > Attilio > > > -- > Peace can only be achieved by understanding - A. Einstein > _______________________________________________ > freebsd-performance@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-performance > To unsubscribe, send any mail to " > freebsd-performance-unsubscribe@freebsd.org" > From nwhitehorn at freebsd.org Sun Nov 23 13:09:38 2008 From: nwhitehorn at freebsd.org (Nathan Whitehorn) Date: Sun Nov 23 13:09:43 2008 Subject: Enumerable I2C busses In-Reply-To: References: <4929877B.6060307@freebsd.org> <86myfq9uha.fsf@ds4.des.no> Message-ID: <4929C6D8.7090305@freebsd.org> Rafa? Jaworowski wrote: > > On 2008-11-23, at 19:18, Dag-Erling Sm?rgrav wrote: > >> Nathan Whitehorn writes: >>> The current I2C bus mechanism does not support the bus adding its own >>> children [...] >> >> That's because the I2C protocol does not support device enumeration or >> identification. You have to know in advance what kind of devices are >> attached and at what address. Even worse, it is not uncommon for >> similar but not entirely compatible devices to use the same I2C address >> (for instance, every I2C-capable RTC chip uses the same address, even >> though they have different feature sets) > > Well, hard-coded addresses and conflicting assignments between vendors > do not technically prevent from scanning the bus; actually, our current > iicbus code can do bus scaning when compiled with a diag define. The > problem however is some slave devices are not well-behaved, and they > don't like to be read/written to other than in very specific scenario: > if polled during bus scan strange effects occur e.g. they disappear from > the bus, or do not react to consecutive requests etc. All of this is true, but perhaps my question was badly worded. What I am trying to figure out is how to shove information from an out-of-band source (Open Firmware, in this case) into newbus without disrupting existing code. In that way, my question is not I2C specific -- we run into the same issue with the Open Firmware nexus node and pseudo-devices like cryptosoft that attach themselves. What I want to do is to have the I2C bus add the children that the firmware says it has. What the firmware cannot tell in advance, however, is which FreeBSD driver is responsible for those devices, and so the I2C bus driver can't know that without a translation table that I would prefer not to hack in to the bus driver. It seems reasonable to allow devices to use a real probe routine to look at the firmware's name and compatible properties, like we allow on other Open Firmware busses. The trouble is that existing drivers don't do this, because they expect to be attached with hints, so they will attach to all devices. I'm trying to figure out how to avoid this. My basic question comes down to whether there is a good way in newbus to handle busses that may be wholly or partially enumerated by firmware or some other method, and may also have devices that can only attach themselves if told to by hints. > Nathan, not sure if this helps you, but I have a nice i2c diagnostic > tool, which among other features lets the user scan the I2C bus for > present slave devices. This is done from userland, so doing similar > thing in-kernel wouldn't be a problem. I was planning to post this for > review this coming week, so you can have a look. It's not directly useful, no, but that's a very useful tool that will be handy to have. -Nathan From attilio at freebsd.org Sun Nov 23 15:46:32 2008 From: attilio at freebsd.org (Attilio Rao) Date: Sun Nov 23 15:46:43 2008 Subject: [PATCH] pmcannotate tool In-Reply-To: <20081123205603.17752y578er4bcqo@webmail.leidinger.net> References: <3bbf2fe10811230502t3cc52809i6ac91082f780b730@mail.gmail.com> <20081123205603.17752y578er4bcqo@webmail.leidinger.net> Message-ID: <3bbf2fe10811231546r44bd2aafqa3d714a4955f52ad@mail.gmail.com> 2008/11/23, Alexander Leidinger : > Quoting Attilio Rao (from Sun, 23 Nov 2008 14:02:22 > +0100): > > > > pmcannotate is a tool that prints out sources of a tool (in C or > > assembly) with inlined profiling informations retrieved by a prior > > pmcstat analysis. > > If compared with things like callgraph generation, it prints out > > profiling on a per-instance basis and this can be useful to find, for > > example, badly handled caches, too high latency instructions, etc. > > > > Can this also be used to do some code coverage analysis? What I'm > interested in is to enable something, run some tests in userland, disable > this something, and then run a tool which tells me which parts of specific > functions where run or not. Yes, this is exactly what it does. You can see traces for any sampled PC and so get a profiling anslysis on a per-instance basis. Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein From attilio at freebsd.org Sun Nov 23 15:47:01 2008 From: attilio at freebsd.org (Attilio Rao) Date: Sun Nov 23 15:47:07 2008 Subject: [PATCH] pmcannotate tool In-Reply-To: <584ec6bb0811231226x7d1b53ccgc5e28bfc297009a2@mail.gmail.com> References: <3bbf2fe10811230502t3cc52809i6ac91082f780b730@mail.gmail.com> <584ec6bb0811231226x7d1b53ccgc5e28bfc297009a2@mail.gmail.com> Message-ID: <3bbf2fe10811231547w651dea65h211b82ae4dcef005@mail.gmail.com> 2008/11/23, Ray Kinsella : > I know I am going to really show my FreeBSD ignorance here, but this is a > patch of FreeBSD 8.0 Current isn't it ? Yes, it is for 8.0. Is it giving you problems? Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein From imp at bsdimp.com Sun Nov 23 16:02:44 2008 From: imp at bsdimp.com (M. Warner Losh) Date: Sun Nov 23 16:02:51 2008 Subject: Enumerable I2C busses In-Reply-To: References: <4929877B.6060307@freebsd.org> Message-ID: <20081123.170334.660270635.imp@bsdimp.com> In message: Marcel Moolenaar writes: : : On Nov 23, 2008, at 8:40 AM, Nathan Whitehorn wrote: : : > On Apple's PowerPC systems, the firmware device tree helpfully : > enumerates the system's I2C busses. Marco Trillo has recently : > written a driver for one of the system's I2C controllers in order to : > support the attached audio codecs, and I'm trying to figure out the : > best way to import it. : > : > The current I2C bus mechanism does not support the bus adding its : > own children and instead relies on hints or other out-of-band : > information for device attachment. It would be nice to do something : > like what the firmware-assisted PCI bus drivers do (ofw_pci, for : > instance): hijack child enumeration from the MI layer and attach : > information from the firmware. The current i2c could easily override the hints things that I added there for platforms that didn't support a nice OF tree or the like. : > However, since all current I2C drivers' probe() routines return 0, I : > can't simply add the firmware devices, because as soon as the : > probe() methods of the existing drivers are called, they will take : > over all the devices on the bus. : > : > What is the best way to handle this? : : I think the best approach is to have the probe() method : actually test for something. This is the standard thing : to do when the bus does not know a priori which driver : to instantiate. I believe that the bus should never : create a child of a specific devclass, unless instructed : by the user (read: pre-binding directives). : : For ocpbus(4) (see sys/powerpc/mpc85xx/ocpbus.c and : sys/powerpc/include/ocpbus.h), we created simple defines : to identify the hardware and use that in the probe() : method to bind the driver to the right hardware, as in: : : In uart(4), for example we have: : : ... : error = BUS_READ_IVAR(parent, dev, OCPBUS_IVAR_DEVTYPE, : &devtype); : if (error) : return (error); : if (devtype != OCPBUS_DEVTYPE_UART) : return (ENXIO); : ... : : I'm not saying to copy it, but it does demonstrate that : you can trivially implement something that eliminates : the assumption that the bus instantiates children of : a particular devclass and thus that the probe() method : can always return 0.: I'm not saying copy it either. I don't like it for a variety of reasons... I don't like the fake devtypes, for one. It solves one problem, but introduces an number of others that have been talked about here. : What we do need is a generic way (such as the OF device : tree) to describe the hardware, its resource needs and : how it's all wired together (think interrupt routing). I think is really the right way to go. Linux currently has a flattened device tree that has information about what the device is, and what it is compatible with. The probe routines then match on the compat field to see if they should attach or not. This is what should be done for the Mac PPC i2c information in the OF tree. For the moment, we likely need a subclass of i2c to do this properly, but in the future, I'd love to move to using something like this to replace hints. Warner From imp at bsdimp.com Sun Nov 23 16:08:47 2008 From: imp at bsdimp.com (M. Warner Losh) Date: Sun Nov 23 16:08:53 2008 Subject: Enumerable I2C busses In-Reply-To: <4929C6D8.7090305@freebsd.org> References: <86myfq9uha.fsf@ds4.des.no> <4929C6D8.7090305@freebsd.org> Message-ID: <20081123.170854.-626772149.imp@bsdimp.com> In message: <4929C6D8.7090305@freebsd.org> Nathan Whitehorn writes: : Rafa? Jaworowski wrote: : > : > On 2008-11-23, at 19:18, Dag-Erling Sm?rgrav wrote: : > : >> Nathan Whitehorn writes: : >>> The current I2C bus mechanism does not support the bus adding its own : >>> children [...] : >> : >> That's because the I2C protocol does not support device enumeration or : >> identification. You have to know in advance what kind of devices are : >> attached and at what address. Even worse, it is not uncommon for : >> similar but not entirely compatible devices to use the same I2C address : >> (for instance, every I2C-capable RTC chip uses the same address, even : >> though they have different feature sets) : > : > Well, hard-coded addresses and conflicting assignments between vendors : > do not technically prevent from scanning the bus; actually, our current : > iicbus code can do bus scaning when compiled with a diag define. The : > problem however is some slave devices are not well-behaved, and they : > don't like to be read/written to other than in very specific scenario: : > if polled during bus scan strange effects occur e.g. they disappear from : > the bus, or do not react to consecutive requests etc. : : All of this is true, but perhaps my question was badly worded. What I am : trying to figure out is how to shove information from an out-of-band : source (Open Firmware, in this case) into newbus without disrupting : existing code. In that way, my question is not I2C specific -- we run : into the same issue with the Open Firmware nexus node and pseudo-devices : like cryptosoft that attach themselves. You are looking at the problem incorrectly. In newbus, a case like this the i2c bus should be a subclass (say i2c_of) that is derived from the normal i2c bus stuff, but replaces the hints insertion of devices with OF enumeration of devices. The OF higher levels will already know to attach this kind of i2c bus to certain i2c controllers, or always on certain platforms. : What I want to do is to have the I2C bus add the children that the : firmware says it has. What the firmware cannot tell in advance, however, : is which FreeBSD driver is responsible for those devices, and so the I2C : bus driver can't know that without a translation table that I would : prefer not to hack in to the bus driver. This is the bigger problem. Today, we are stuck with a lame table that will translate the OF provided properties into FreeBSD driver names. : It seems reasonable to allow devices to use a real probe routine to look : at the firmware's name and compatible properties, like we allow on other : Open Firmware busses. The trouble is that existing drivers don't do : this, because they expect to be attached with hints, so they will attach : to all devices. I'm trying to figure out how to avoid this. : : My basic question comes down to whether there is a good way in newbus to : handle busses that may be wholly or partially enumerated by firmware or : some other method, and may also have devices that can only attach : themselves if told to by hints. Yes. This is a bit of a problem. The problem is that the existing hints mechanism combines device tree and driver tree into one, and in such a scenario, we wind up with the problem that you have. One could make the probe routines return BUS_PROBE_GENERIC, and that would help somewhat. One could also have the probe routine check to see if a specific driver is assigned to the device or not. That would also help, but does mean changing all the i2c bus attached drivers in the tree. Warner From jroberson at jroberson.net Sun Nov 23 16:17:15 2008 From: jroberson at jroberson.net (Jeff Roberson) Date: Sun Nov 23 16:17:28 2008 Subject: [PATCH] pmcannotate tool In-Reply-To: <3bbf2fe10811231546r44bd2aafqa3d714a4955f52ad@mail.gmail.com> References: <3bbf2fe10811230502t3cc52809i6ac91082f780b730@mail.gmail.com> <20081123205603.17752y578er4bcqo@webmail.leidinger.net> <3bbf2fe10811231546r44bd2aafqa3d714a4955f52ad@mail.gmail.com> Message-ID: <20081123135009.I971@desktop> On Mon, 24 Nov 2008, Attilio Rao wrote: > 2008/11/23, Alexander Leidinger : >> Quoting Attilio Rao (from Sun, 23 Nov 2008 14:02:22 >> +0100): >> >> >>> pmcannotate is a tool that prints out sources of a tool (in C or >>> assembly) with inlined profiling informations retrieved by a prior >>> pmcstat analysis. >>> If compared with things like callgraph generation, it prints out >>> profiling on a per-instance basis and this can be useful to find, for >>> example, badly handled caches, too high latency instructions, etc. >>> >> >> Can this also be used to do some code coverage analysis? What I'm >> interested in is to enable something, run some tests in userland, disable >> this something, and then run a tool which tells me which parts of specific >> functions where run or not. > > Yes, this is exactly what it does. > You can see traces for any sampled PC and so get a profiling anslysis > on a per-instance basis. I would add that it is only sampled so you don't see every instruction executed. You can use gcov for that however. That's precisely what it's for. Thanks, Jeff > > Thanks, > Attilio > > > -- > Peace can only be achieved by understanding - A. Einstein > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > From lstewart at freebsd.org Sun Nov 23 17:15:01 2008 From: lstewart at freebsd.org (Lawrence Stewart) Date: Sun Nov 23 17:15:07 2008 Subject: kthread_exit(9) unexpectedness In-Reply-To: <200811211348.41536.jhb@freebsd.org> References: <492412E8.3060700@freebsd.org> <200811201502.23943.jhb@freebsd.org> <4925E30B.8010709@freebsd.org> <200811211348.41536.jhb@freebsd.org> Message-ID: <4929F90B.1040502@freebsd.org> John Baldwin wrote: > On Thursday 20 November 2008 05:22:03 pm Lawrence Stewart wrote: >> John Baldwin wrote: >>> On Wednesday 19 November 2008 08:21:44 am Lawrence Stewart wrote: >>>> Hi all, >>>> >>>> I tracked down a deadlock in some of my code today to some weird >>>> behaviour in the kthread(9) KPI. The executive summary is that >>>> kthread_exit() thread termination notification using wakeup() behaves as >>>> expected intuitively in 8.x, but not in 7.x. >>> In 5.x/6.x/7.x kthreads are still processes and it has always been a > wakeup on >>> the proc pointer. kthread_create() in 7.x returns a proc pointer, not a >>> thread pointer for example. In 8.x kthreads are actual threads and >> Yep, but the processes have a *thread in them right? The API naming is >> obviously slightly misleading, but it essentially creates a new single >> threaded process prior to 8.x. > > Yes, but you have to go explicitly use FIRST_THREAD_IN_PROC(). Most of the > kernel modules I've written that use kthread's in < 8 do this: > > static struct proc *foo_thread; > > /* Called for MOD_LOAD. */ > static void > load(...) > { > > error = kthread_create(..., &foo_thread); > } > > static void > unload(...) > { > > /* set flag */ > msleep(foo_thread, ...); > } > > And never actually use the thread at all. However, if you write the code for > 8.x, now you _do_ get a kthread and sleep on the thread so it becomes: > > static struct thread *foo_thread; > > static void > load(...) > { > > error = kproc_kthread_add(..., proc0, &foo_thread); > } > > static void > unload(...) > { > > /* set flag */ > msleep(foo_thread, ...); > } > Sure, but to write the code in this way means you are exercising undocumented knowledge of the KPI. I suspect the average developer completely unfamiliar with the KPI would (and should!) use the man page to learn about the functionality it provides. With that basis in mind, it seems unreasonable to expect the developer to come to the conclusion that "...will initiate a call to wakeup(9) on the thread handle." refers to sleeping on the *proc passed in to kthread_create. Perhaps I'm not as switched on as the average developer, but when I read it I certainly did not understand that the KPI created processes and that the man page used the term thread to really mean a single threaded process. I also did no equate "thread handle" with the *proc returned by kthread_create. It seems to me that the kthread(9) man page is somewhat unclear with respect to what the KPI actually achieves, making references to thread related activities all over the place when in reality it's manipulating single threaded processes (which for all intensive purposes can be used like threads but are structurally different). I guess this is the cause for the underlying confusion. While we're on the topic, I'm also having trouble understanding the reasoning for renaming kthread_create to kthread_add, when in reality 8.x kthread_add is doing a *real* kthread creation and the originally named kthread_create seems to have been a bit of a misnomer (corrected in 8.x by moving the functionality into the kproc_* KPI). The changing of the **proc to a **thread argument is enough to ensure code from 7.x won't compile without a tweak on 8.x, so why the rename to add further confusion? With the same line of reasoning, kproc_kthread_add should probably be kproc_kthread_create? >>> kthread_add() and kproc_kthread_add() both return thread pointers. Hence > in >> Yup. >> >>> 8.x kthread_exit() is used for exiting kernel threads and wakes up the > thread >>> pointer, but in 7.x kthread_exit() is used for exiting kernel processes > and >>> wakes up the proc pointer. I think what is probably needed is to simply >> In the code, yes. Our documented behaviour as far as I can tell is >> different though, unless we equate a "thread handle" to "proc handle" >> prior to 8.x, which I don't think is the case - they are still different. > > It has always been the case in < 8 that you sleep on the proc handle (what > kthread_create() actually returns in < 8). And in fact, you even have to dig > around in the proc you get from kthread_create() to even find the thread > pointer as opposed to having the API hand it to you. > >>> document that arrangement as such. Note that the sleeping on proc pointer >> I agree that the arrangement needs to be better documented. The change >> in 8.x is subtle enough that reading the kthread man page in 7.x and 8.x >> doesn't immediately make it obvious what's going on. >> >>> has been the documented way to synchronize with kthread_exit() since 5.0. >>> >> Could you please point me at this documentation? I've missed it in my >> poking around thus far. > > It is probably only documented in numerous threads in the mail archives sadly, > but there have been several of them and there have been several fixes to get > this right (the randomdev thread and fdc(4) thread come to mind). If we had no man page, mail archives would be the next best thing. In this instance, we have a misleading man page and I think it would be more beneficial to align the documentation/implementation rather than leaving people confused. Apart from the discussion thus far, you haven't actually commented yet on my proposed single line change to kthread_exit() in 7.x to call wakeup on the *thread as well as the *proc. Do you have any specific thoughts on or objection to that idea? Cheers, Lawrence From julian at elischer.org Sun Nov 23 22:39:56 2008 From: julian at elischer.org (Julian Elischer) Date: Sun Nov 23 22:40:02 2008 Subject: kthread_exit(9) unexpectedness In-Reply-To: <4929F90B.1040502@freebsd.org> References: <492412E8.3060700@freebsd.org> <200811201502.23943.jhb@freebsd.org> <4925E30B.8010709@freebsd.org> <200811211348.41536.jhb@freebsd.org> <4929F90B.1040502@freebsd.org> Message-ID: <492A48C8.9080302@elischer.org> Lawrence Stewart wrote: > John Baldwin wrote: >> On Thursday 20 November 2008 05:22:03 pm Lawrence Stewart wrote: >>> John Baldwin wrote: >>>> On Wednesday 19 November 2008 08:21:44 am Lawrence Stewart wrote: >>>>> Hi all, >>>>> >>>>> I tracked down a deadlock in some of my code today to some weird >>>>> behaviour in the kthread(9) KPI. The executive summary is that >>>>> kthread_exit() thread termination notification using wakeup() >>>>> behaves as expected intuitively in 8.x, but not in 7.x. >>>> In 5.x/6.x/7.x kthreads are still processes and it has always been a >> wakeup on >>>> the proc pointer. kthread_create() in 7.x returns a proc pointer, >>>> not a thread pointer for example. In 8.x kthreads are actual >>>> threads and >>> Yep, but the processes have a *thread in them right? The API naming >>> is obviously slightly misleading, but it essentially creates a new >>> single threaded process prior to 8.x. >> >> Yes, but you have to go explicitly use FIRST_THREAD_IN_PROC(). Most >> of the kernel modules I've written that use kthread's in < 8 do this: >> >> static struct proc *foo_thread; >> >> /* Called for MOD_LOAD. */ >> static void >> load(...) >> { >> >> error = kthread_create(..., &foo_thread); >> } >> >> static void >> unload(...) >> { >> >> /* set flag */ >> msleep(foo_thread, ...); >> } >> >> And never actually use the thread at all. However, if you write the >> code for 8.x, now you _do_ get a kthread and sleep on the thread so it >> becomes: >> >> static struct thread *foo_thread; >> >> static void >> load(...) >> { >> >> error = kproc_kthread_add(..., proc0, &foo_thread); >> } >> >> static void >> unload(...) >> { >> >> /* set flag */ >> msleep(foo_thread, ...); >> } >> > > > Sure, but to write the code in this way means you are exercising > undocumented knowledge of the KPI. I suspect the average developer > completely unfamiliar with the KPI would (and should!) use the man page > to learn about the functionality it provides. > > With that basis in mind, it seems unreasonable to expect the developer > to come to the conclusion that "...will initiate a call to wakeup(9) on > the thread handle." refers to sleeping on the *proc passed in to > kthread_create. Perhaps I'm not as switched on as the average developer, > but when I read it I certainly did not understand that the KPI created > processes and that the man page used the term thread to really mean a > single threaded process. I also did no equate "thread handle" with the > *proc returned by kthread_create. > > It seems to me that the kthread(9) man page is somewhat unclear with > respect to what the KPI actually achieves, making references to thread > related activities all over the place when in reality it's manipulating > single threaded processes (which for all intensive purposes can be used > like threads but are structurally different). I guess this is the cause > for the underlying confusion. > > While we're on the topic, I'm also having trouble understanding the > reasoning for renaming kthread_create to kthread_add, when in reality > 8.x kthread_add is doing a *real* kthread creation and the originally > named kthread_create seems to have been a bit of a misnomer (corrected > in 8.x by moving the functionality into the kproc_* KPI). kthread_add was named tha tway to indicate that it is adding a thread to an existing process in addition to the thread already running the code.. I wanted anyone linking in a binary module to get a link failure, even if they manageed to sneak past the other safeguards. > > The changing of the **proc to a **thread argument is enough to ensure > code from 7.x won't compile without a tweak on 8.x, so why the rename to > add further confusion? With the same line of reasoning, > kproc_kthread_add should probably be kproc_kthread_create? kproc_kthread_add will add a thread to the process if it already exists, but if it doesn't it will make a new process. Hey, it's just how I think I guess. > >>>> kthread_add() and kproc_kthread_add() both return thread pointers. >>>> Hence >> in >>> Yup. >>> >>>> 8.x kthread_exit() is used for exiting kernel threads and wakes up the >> thread >>>> pointer, but in 7.x kthread_exit() is used for exiting kernel processes >> and >>>> wakes up the proc pointer. I think what is probably needed is to >>>> simply >>> In the code, yes. Our documented behaviour as far as I can tell is >>> different though, unless we equate a "thread handle" to "proc handle" >>> prior to 8.x, which I don't think is the case - they are still >>> different. >> >> It has always been the case in < 8 that you sleep on the proc handle >> (what kthread_create() actually returns in < 8). And in fact, you >> even have to dig around in the proc you get from kthread_create() to >> even find the thread pointer as opposed to having the API hand it to you. >> >>>> document that arrangement as such. Note that the sleeping on proc >>>> pointer >>> I agree that the arrangement needs to be better documented. The >>> change in 8.x is subtle enough that reading the kthread man page in >>> 7.x and 8.x doesn't immediately make it obvious what's going on. >>> >>>> has been the documented way to synchronize with kthread_exit() since >>>> 5.0. >>>> >>> Could you please point me at this documentation? I've missed it in my >>> poking around thus far. >> >> It is probably only documented in numerous threads in the mail >> archives sadly, but there have been several of them and there have >> been several fixes to get this right (the randomdev thread and fdc(4) >> thread come to mind). > > If we had no man page, mail archives would be the next best thing. In > this instance, we have a misleading man page and I think it would be > more beneficial to align the documentation/implementation rather than > leaving people confused. > > Apart from the discussion thus far, you haven't actually commented yet > on my proposed single line change to kthread_exit() in 7.x to call > wakeup on the *thread as well as the *proc. Do you have any specific > thoughts on or objection to that idea? I really haven't got this stuff in my head at the moment so I can't comment. > > Cheers, > Lawrence > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" From Alexander at Leidinger.net Sun Nov 23 23:39:30 2008 From: Alexander at Leidinger.net (Alexander Leidinger) Date: Sun Nov 23 23:39:42 2008 Subject: [PATCH] pmcannotate tool In-Reply-To: <3bbf2fe10811231546r44bd2aafqa3d714a4955f52ad@mail.gmail.com> References: <3bbf2fe10811230502t3cc52809i6ac91082f780b730@mail.gmail.com> <20081123205603.17752y578er4bcqo@webmail.leidinger.net> <3bbf2fe10811231546r44bd2aafqa3d714a4955f52ad@mail.gmail.com> Message-ID: <20081124083920.16126d6j9o1q9mw4@webmail.leidinger.net> Quoting Attilio Rao (from Mon, 24 Nov 2008 00:46:29 +0100): > 2008/11/23, Alexander Leidinger : >> Quoting Attilio Rao (from Sun, 23 Nov 2008 14:02:22 >> +0100): >> >> >> > pmcannotate is a tool that prints out sources of a tool (in C or >> > assembly) with inlined profiling informations retrieved by a prior >> > pmcstat analysis. >> > If compared with things like callgraph generation, it prints out >> > profiling on a per-instance basis and this can be useful to find, for >> > example, badly handled caches, too high latency instructions, etc. >> > >> >> Can this also be used to do some code coverage analysis? What I'm >> interested in is to enable something, run some tests in userland, disable >> this something, and then run a tool which tells me which parts of specific >> functions where run or not. > > Yes, this is exactly what it does. > You can see traces for any sampled PC and so get a profiling anslysis > on a per-instance basis. Cool. Would be great if you could provide some example in the man page or as a script which shows how to do this for kernel code. This would immediately show us how good our regression tests are for their specific areas of test. Bye, Alexander. -- In a family recipe you just discovered in an old book, the most vital measurement will be illegible. http://www.Leidinger.net Alexander @ Leidinger.net: PGP ID = B0063FE7 http://www.FreeBSD.org netchild @ FreeBSD.org : PGP ID = 72077137 From Alexander at Leidinger.net Sun Nov 23 23:40:27 2008 From: Alexander at Leidinger.net (Alexander Leidinger) Date: Sun Nov 23 23:40:33 2008 Subject: [PATCH] pmcannotate tool In-Reply-To: <20081123135009.I971@desktop> References: <3bbf2fe10811230502t3cc52809i6ac91082f780b730@mail.gmail.com> <20081123205603.17752y578er4bcqo@webmail.leidinger.net> <3bbf2fe10811231546r44bd2aafqa3d714a4955f52ad@mail.gmail.com> <20081123135009.I971@desktop> Message-ID: <20081124084015.84153bmq6411va68@webmail.leidinger.net> Quoting Jeff Roberson (from Sun, 23 Nov 2008 13:50:35 -1000 (HST)): > > On Mon, 24 Nov 2008, Attilio Rao wrote: > >> 2008/11/23, Alexander Leidinger : >>> Quoting Attilio Rao (from Sun, 23 Nov 2008 14:02:22 >>> +0100): >>> >>> >>>> pmcannotate is a tool that prints out sources of a tool (in C or >>>> assembly) with inlined profiling informations retrieved by a prior >>>> pmcstat analysis. >>>> If compared with things like callgraph generation, it prints out >>>> profiling on a per-instance basis and this can be useful to find, for >>>> example, badly handled caches, too high latency instructions, etc. >>>> >>> >>> Can this also be used to do some code coverage analysis? What I'm >>> interested in is to enable something, run some tests in userland, disable >>> this something, and then run a tool which tells me which parts of specific >>> functions where run or not. >> >> Yes, this is exactly what it does. >> You can see traces for any sampled PC and so get a profiling anslysis >> on a per-instance basis. > > I would add that it is only sampled so you don't see every > instruction executed. You can use gcov for that however. That's > precisely what it's for. How to use gcov for the kernel? Bye, Alexander. -- If only you knew she loved you, you could face the uncertainty of whether you love her. http://www.Leidinger.net Alexander @ Leidinger.net: PGP ID = B0063FE7 http://www.FreeBSD.org netchild @ FreeBSD.org : PGP ID = 72077137 From jroberson at jroberson.net Sun Nov 23 23:48:48 2008 From: jroberson at jroberson.net (Jeff Roberson) Date: Sun Nov 23 23:48:54 2008 Subject: Limiting mbuf memory. Message-ID: <20081123213232.A971@desktop> I'm developing a patch for an alternate memory layout for mbuf clusters that relies on contigmalloc. Since this can fail, we'll still have to retain the capability of allocating traditional clusters. I'll report details on that later. I'm writing this email to address the issue of resource accounting in mbufs. Presently we use a set of limits on individual zones or sizes of mbufs. Standard mbufs, clusters, page size jumbos, 9k jumbos, and 16k jumbos. Each is administered sperately. I think this is getting a bit unwieldy. In the future, we may have even more sizes. This also introduces problems because I will have two cluster zones do they each get their own limit? I would like to consolidate this into a single limit on the number of pages in total allocated to networking. With perhaps some fractional reservation for standard mbufs and clusters to make sure they aren't overwhelmed by the larger buffers. This would be implemented by overriding the uma zone page allocator for each of the mbuf zones with one that counts pages for all. Should we reach the limit we'll block depending on the wait settings of the requestor. The limit and sleep will probably be protected by a global lock which won't be an issue because trips to the back end allocator are infrequent and protected by their own global lock anyhow. How do people feel about this? To be clear this would eliminate: nmbclusters, nmbjumbop, nmbjumbo9, nmbjumbo16 and related config settings and sysctls. They would be replaced by something like 'maxmbufbytes'. Presently we place no limit on small mbufs. I could go either way on this. It could be added to the limit or not. Thanks, Jeff From raykinsella78 at gmail.com Mon Nov 24 00:48:10 2008 From: raykinsella78 at gmail.com (Ray Kinsella) Date: Mon Nov 24 00:48:21 2008 Subject: [PATCH] pmcannotate tool In-Reply-To: <3bbf2fe10811231547w651dea65h211b82ae4dcef005@mail.gmail.com> References: <3bbf2fe10811230502t3cc52809i6ac91082f780b730@mail.gmail.com> <584ec6bb0811231226x7d1b53ccgc5e28bfc297009a2@mail.gmail.com> <3bbf2fe10811231547w651dea65h211b82ae4dcef005@mail.gmail.com> Message-ID: <584ec6bb0811240048x2304a759j2ac6fdea83a47773@mail.gmail.com> No, not at all. I am using 6.2 and 7.0 at the moment, I will build another disk with FreeBSD 8.0-CURRENT. Thanks Ray Kinsella On Sun, Nov 23, 2008 at 11:47 PM, Attilio Rao wrote: > 2008/11/23, Ray Kinsella : > > I know I am going to really show my FreeBSD ignorance here, but this is a > > patch of FreeBSD 8.0 Current isn't it ? > > Yes, it is for 8.0. > Is it giving you problems? > > Thanks, > Attilio > > > -- > Peace can only be achieved by understanding - A. Einstein > From des at des.no Mon Nov 24 00:52:38 2008 From: des at des.no (=?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?=) Date: Mon Nov 24 00:52:44 2008 Subject: Enumerable I2C busses In-Reply-To: (=?utf-8?Q?=22Rafa=C5=82?= Jaworowski"'s message of "Sun, 23 Nov 2008 21:18:54 +0100") References: <4929877B.6060307@freebsd.org> <86myfq9uha.fsf@ds4.des.no> Message-ID: <86od054iaz.fsf@ds4.des.no> Rafa? Jaworowski writes: > Well, hard-coded addresses and conflicting assignments between vendors > do not technically prevent from scanning the bus; actually, our > current iicbus code can do bus scaning when compiled with a diag > define. I haven't looked at how that is implemented, but - I2C version 3 describes device enumeration and identification, but AFAIK very few devices actually support it. This is really stupid, BTW. Philips could easily have required the first few bytes of the device address space to contain a vendor / device ID, along the lines of PCI and USB, from the start. I wonder: did it not occur to them, or did they intentionally leave it out to save a few microcents per chip? DES -- Dag-Erling Sm?rgrav - des@des.no From alfred at freebsd.org Mon Nov 24 01:07:59 2008 From: alfred at freebsd.org (Alfred Perlstein) Date: Mon Nov 24 01:08:05 2008 Subject: Limiting mbuf memory. In-Reply-To: <20081123213232.A971@desktop> References: <20081123213232.A971@desktop> Message-ID: <20081124085223.GY28578@elvis.mu.org> * Jeff Roberson [081123 23:48] wrote: > I'm developing a patch for an alternate memory layout for mbuf clusters > that relies on contigmalloc. Since this can fail, we'll still have to > retain the capability of allocating traditional clusters. I'll report > details on that later. I'm writing this email to address the issue of > resource accounting in mbufs. > > Presently we use a set of limits on individual zones or sizes of mbufs. > Standard mbufs, clusters, page size jumbos, 9k jumbos, and 16k jumbos. > Each is administered sperately. I think this is getting a bit unwieldy. > In the future, we may have even more sizes. This also introduces problems > because I will have two cluster zones do they each get their own limit? > > I would like to consolidate this into a single limit on the number of > pages in total allocated to networking. With perhaps some fractional > reservation for standard mbufs and clusters to make sure they aren't > overwhelmed by the larger buffers. > > This would be implemented by overriding the uma zone page allocator for > each of the mbuf zones with one that counts pages for all. Should we > reach the limit we'll block depending on the wait settings of the > requestor. The limit and sleep will probably be protected by a global > lock which won't be an issue because trips to the back end allocator are > infrequent and protected by their own global lock anyhow. > > How do people feel about this? To be clear this would eliminate: > > nmbclusters, nmbjumbop, nmbjumbo9, nmbjumbo16 and related config settings > and sysctls. They would be replaced by something like 'maxmbufbytes'. > Presently we place no limit on small mbufs. I could go either way on > this. It could be added to the limit or not. This sounds good but please take into consideration the possibility of deadlock due to resource allocation to a single pool that can happen. It might make sense to keep the small and large mbuf limits separate or something like that. Might also make sense to retain the limits but set them all to "unlimited" (withing the global limit) unless configured otherwise for various custom set ups. I don't feel too strongly about this, just some points to consider. -- - Alfred Perlstein From jroberson at jroberson.net Mon Nov 24 02:57:31 2008 From: jroberson at jroberson.net (Jeff Roberson) Date: Mon Nov 24 02:57:37 2008 Subject: Limiting mbuf memory. In-Reply-To: <20081124085223.GY28578@elvis.mu.org> References: <20081123213232.A971@desktop> <20081124085223.GY28578@elvis.mu.org> Message-ID: <20081124005404.H971@desktop> On Mon, 24 Nov 2008, Alfred Perlstein wrote: > * Jeff Roberson [081123 23:48] wrote: >> I'm developing a patch for an alternate memory layout for mbuf clusters >> that relies on contigmalloc. Since this can fail, we'll still have to >> retain the capability of allocating traditional clusters. I'll report >> details on that later. I'm writing this email to address the issue of >> resource accounting in mbufs. >> >> Presently we use a set of limits on individual zones or sizes of mbufs. >> Standard mbufs, clusters, page size jumbos, 9k jumbos, and 16k jumbos. >> Each is administered sperately. I think this is getting a bit unwieldy. >> In the future, we may have even more sizes. This also introduces problems >> because I will have two cluster zones do they each get their own limit? >> >> I would like to consolidate this into a single limit on the number of >> pages in total allocated to networking. With perhaps some fractional >> reservation for standard mbufs and clusters to make sure they aren't >> overwhelmed by the larger buffers. >> >> This would be implemented by overriding the uma zone page allocator for >> each of the mbuf zones with one that counts pages for all. Should we >> reach the limit we'll block depending on the wait settings of the >> requestor. The limit and sleep will probably be protected by a global >> lock which won't be an issue because trips to the back end allocator are >> infrequent and protected by their own global lock anyhow. >> >> How do people feel about this? To be clear this would eliminate: >> >> nmbclusters, nmbjumbop, nmbjumbo9, nmbjumbo16 and related config settings >> and sysctls. They would be replaced by something like 'maxmbufbytes'. >> Presently we place no limit on small mbufs. I could go either way on >> this. It could be added to the limit or not. > > This sounds good but please take into consideration the possibility > of deadlock due to resource allocation to a single pool that can > happen. > > It might make sense to keep the small and large mbuf limits separate > or something like that. This is what I meant in the third paragraph. > > Might also make sense to retain the limits but set them all to > "unlimited" (withing the global limit) unless configured otherwise > for various custom set ups. I think this is a good idea. > > I don't feel too strongly about this, just some points to consider. I appreciate the feedback. Jeff > > -- > - Alfred Perlstein > From bugmaster at FreeBSD.org Mon Nov 24 03:07:08 2008 From: bugmaster at FreeBSD.org (FreeBSD bugmaster) Date: Mon Nov 24 03:07:28 2008 Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org Message-ID: <200811241107.mAOB77kj019831@freefall.freebsd.org> Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/120749 arch [request] Suggest upping the default kern.ps_arg_cache 1 problem total. From nwhitehorn at freebsd.org Mon Nov 24 07:30:31 2008 From: nwhitehorn at freebsd.org (Nathan Whitehorn) Date: Mon Nov 24 07:30:39 2008 Subject: Enumerable I2C busses In-Reply-To: <20081123.170854.-626772149.imp@bsdimp.com> References: <86myfq9uha.fsf@ds4.des.no> <4929C6D8.7090305@freebsd.org> <20081123.170854.-626772149.imp@bsdimp.com> Message-ID: <492AC8DE.6050902@freebsd.org> M. Warner Losh wrote: > In message: <4929C6D8.7090305@freebsd.org> > Nathan Whitehorn writes: > : Rafa? Jaworowski wrote: > : > > : > On 2008-11-23, at 19:18, Dag-Erling Sm?rgrav wrote: > : > > : >> Nathan Whitehorn writes: > : >>> The current I2C bus mechanism does not support the bus adding its own > : >>> children [...] > : >> > : >> That's because the I2C protocol does not support device enumeration or > : >> identification. You have to know in advance what kind of devices are > : >> attached and at what address. Even worse, it is not uncommon for > : >> similar but not entirely compatible devices to use the same I2C address > : >> (for instance, every I2C-capable RTC chip uses the same address, even > : >> though they have different feature sets) > : > > : > Well, hard-coded addresses and conflicting assignments between vendors > : > do not technically prevent from scanning the bus; actually, our current > : > iicbus code can do bus scaning when compiled with a diag define. The > : > problem however is some slave devices are not well-behaved, and they > : > don't like to be read/written to other than in very specific scenario: > : > if polled during bus scan strange effects occur e.g. they disappear from > : > the bus, or do not react to consecutive requests etc. > : > : All of this is true, but perhaps my question was badly worded. What I am > : trying to figure out is how to shove information from an out-of-band > : source (Open Firmware, in this case) into newbus without disrupting > : existing code. In that way, my question is not I2C specific -- we run > : into the same issue with the Open Firmware nexus node and pseudo-devices > : like cryptosoft that attach themselves. > > You are looking at the problem incorrectly. In newbus, a case like > this the i2c bus should be a subclass (say i2c_of) that is derived > from the normal i2c bus stuff, but replaces the hints insertion of > devices with OF enumeration of devices. The OF higher levels will > already know to attach this kind of i2c bus to certain i2c > controllers, or always on certain platforms. Yes, this is exactly what I wanted to do, like how ofw_pci works. > : What I want to do is to have the I2C bus add the children that the > : firmware says it has. What the firmware cannot tell in advance, however, > : is which FreeBSD driver is responsible for those devices, and so the I2C > : bus driver can't know that without a translation table that I would > : prefer not to hack in to the bus driver. > > This is the bigger problem. Today, we are stuck with a lame table > that will translate the OF provided properties into FreeBSD driver > names. At the moment, I don't believe Apple uses any of the current very small number of I2C device drivers in tree. So I may skip the table for the time being, assuming the hack below is OK. In future, this may change, since G5 systems require software thermal control. But that will be the subject of another mail to this list... > : It seems reasonable to allow devices to use a real probe routine to look > : at the firmware's name and compatible properties, like we allow on other > : Open Firmware busses. The trouble is that existing drivers don't do > : this, because they expect to be attached with hints, so they will attach > : to all devices. I'm trying to figure out how to avoid this. > : > : My basic question comes down to whether there is a good way in newbus to > : handle busses that may be wholly or partially enumerated by firmware or > : some other method, and may also have devices that can only attach > : themselves if told to by hints. > > Yes. This is a bit of a problem. The problem is that the existing > hints mechanism combines device tree and driver tree into one, and in > such a scenario, we wind up with the problem that you have. > > One could make the probe routines return BUS_PROBE_GENERIC, and that > would help somewhat. One could also have the probe routine check to > see if a specific driver is assigned to the device or not. That would > also help, but does mean changing all the i2c bus attached drivers in > the tree. I think changing existing I2C drivers may be unavoidable. Would there be any objection to changing the MI iicbus drivers to return BUS_PROBE_NOWILDCARD in their probe routines? It seems to have been introduced (by you) to solve more or less exactly this problem. By my count, the relevant files are: dev/iicbus/ds133x.c dev/iicbus/icee.c dev/iicbus/ad7418.c dev/iicbus/iicsmb.c dev/iicbus/ds1672.c dev/iicbus/if_ic.c dev/iicbus/iic.c I would also like to change iicbus_probe to return -1000 like dev/pci/pci.c to allow it to be overridden by a subclass. Please let me know if this is a terrible idea or if I have forgotten any I2C device drivers. -Nathan From imp at bsdimp.com Mon Nov 24 09:57:09 2008 From: imp at bsdimp.com (M. Warner Losh) Date: Mon Nov 24 09:57:15 2008 Subject: Enumerable I2C busses In-Reply-To: <492AC8DE.6050902@freebsd.org> References: <4929C6D8.7090305@freebsd.org> <20081123.170854.-626772149.imp@bsdimp.com> <492AC8DE.6050902@freebsd.org> Message-ID: <20081124.105800.-267230932.imp@bsdimp.com> In message: <492AC8DE.6050902@freebsd.org> Nathan Whitehorn writes: : M. Warner Losh wrote: : > In message: <4929C6D8.7090305@freebsd.org> : > Nathan Whitehorn writes: : > : Rafa? Jaworowski wrote: : > : > : > : > On 2008-11-23, at 19:18, Dag-Erling Sm?rgrav wrote: : > : > : > : >> Nathan Whitehorn writes: : > : >>> The current I2C bus mechanism does not support the bus adding its own : > : >>> children [...] : > : >> : > : >> That's because the I2C protocol does not support device enumeration or : > : >> identification. You have to know in advance what kind of devices are : > : >> attached and at what address. Even worse, it is not uncommon for : > : >> similar but not entirely compatible devices to use the same I2C address : > : >> (for instance, every I2C-capable RTC chip uses the same address, even : > : >> though they have different feature sets) : > : > : > : > Well, hard-coded addresses and conflicting assignments between vendors : > : > do not technically prevent from scanning the bus; actually, our current : > : > iicbus code can do bus scaning when compiled with a diag define. The : > : > problem however is some slave devices are not well-behaved, and they : > : > don't like to be read/written to other than in very specific scenario: : > : > if polled during bus scan strange effects occur e.g. they disappear from : > : > the bus, or do not react to consecutive requests etc. : > : : > : All of this is true, but perhaps my question was badly worded. What I am : > : trying to figure out is how to shove information from an out-of-band : > : source (Open Firmware, in this case) into newbus without disrupting : > : existing code. In that way, my question is not I2C specific -- we run : > : into the same issue with the Open Firmware nexus node and pseudo-devices : > : like cryptosoft that attach themselves. : > : > You are looking at the problem incorrectly. In newbus, a case like : > this the i2c bus should be a subclass (say i2c_of) that is derived : > from the normal i2c bus stuff, but replaces the hints insertion of : > devices with OF enumeration of devices. The OF higher levels will : > already know to attach this kind of i2c bus to certain i2c : > controllers, or always on certain platforms. : : Yes, this is exactly what I wanted to do, like how ofw_pci works. : : > : What I want to do is to have the I2C bus add the children that the : > : firmware says it has. What the firmware cannot tell in advance, however, : > : is which FreeBSD driver is responsible for those devices, and so the I2C : > : bus driver can't know that without a translation table that I would : > : prefer not to hack in to the bus driver. : > : > This is the bigger problem. Today, we are stuck with a lame table : > that will translate the OF provided properties into FreeBSD driver : > names. : : At the moment, I don't believe Apple uses any of the current very small : number of I2C device drivers in tree. So I may skip the table for the : time being, assuming the hack below is OK. In future, this may change, : since G5 systems require software thermal control. But that will be the : subject of another mail to this list... : : > : It seems reasonable to allow devices to use a real probe routine to look : > : at the firmware's name and compatible properties, like we allow on other : > : Open Firmware busses. The trouble is that existing drivers don't do : > : this, because they expect to be attached with hints, so they will attach : > : to all devices. I'm trying to figure out how to avoid this. : > : : > : My basic question comes down to whether there is a good way in newbus to : > : handle busses that may be wholly or partially enumerated by firmware or : > : some other method, and may also have devices that can only attach : > : themselves if told to by hints. : > : > Yes. This is a bit of a problem. The problem is that the existing : > hints mechanism combines device tree and driver tree into one, and in : > such a scenario, we wind up with the problem that you have. : > : > One could make the probe routines return BUS_PROBE_GENERIC, and that : > would help somewhat. One could also have the probe routine check to : > see if a specific driver is assigned to the device or not. That would : > also help, but does mean changing all the i2c bus attached drivers in : > the tree. : : I think changing existing I2C drivers may be unavoidable. Would there be : any objection to changing the MI iicbus drivers to return : BUS_PROBE_NOWILDCARD in their probe routines? It seems to have been : introduced (by you) to solve more or less exactly this problem. By my : count, the relevant files are: : dev/iicbus/ds133x.c : dev/iicbus/icee.c : dev/iicbus/ad7418.c : dev/iicbus/iicsmb.c : dev/iicbus/ds1672.c : dev/iicbus/if_ic.c : dev/iicbus/iic.c : : I would also like to change iicbus_probe to return -1000 like : dev/pci/pci.c to allow it to be overridden by a subclass. Please let me : know if this is a terrible idea or if I have forgotten any I2C device : drivers. Short term, this is the right fix. There was an objection, I think by Marcel, to this approach. However, his objections were part of a larger set of objections and I think that we're working to solve those. Warner From gnn at freebsd.org Mon Nov 24 16:09:13 2008 From: gnn at freebsd.org (gnn@freebsd.org) Date: Mon Nov 24 16:09:19 2008 Subject: Limiting mbuf memory. In-Reply-To: <20081123213232.A971@desktop> References: <20081123213232.A971@desktop> Message-ID: At Sun, 23 Nov 2008 21:46:08 -1000 (HST), Jeff Roberson wrote: > > I'm developing a patch for an alternate memory layout for mbuf clusters > that relies on contigmalloc. Since this can fail, we'll still have to > retain the capability of allocating traditional clusters. I'll report > details on that later. I'm writing this email to address the issue of > resource accounting in mbufs. > > Presently we use a set of limits on individual zones or sizes of mbufs. > Standard mbufs, clusters, page size jumbos, 9k jumbos, and 16k jumbos. > Each is administered sperately. I think this is getting a bit unwieldy. > In the future, we may have even more sizes. This also introduces problems > because I will have two cluster zones do they each get their own limit? > > I would like to consolidate this into a single limit on the number of > pages in total allocated to networking. With perhaps some fractional > reservation for standard mbufs and clusters to make sure they aren't > overwhelmed by the larger buffers. > > This would be implemented by overriding the uma zone page allocator for > each of the mbuf zones with one that counts pages for all. Should we > reach the limit we'll block depending on the wait settings of the > requestor. The limit and sleep will probably be protected by a global > lock which won't be an issue because trips to the back end allocator are > infrequent and protected by their own global lock anyhow. > > How do people feel about this? To be clear this would eliminate: > > nmbclusters, nmbjumbop, nmbjumbo9, nmbjumbo16 and related config settings > and sysctls. They would be replaced by something like 'maxmbufbytes'. > Presently we place no limit on small mbufs. I could go either way on > this. It could be added to the limit or not. > I think this is a good idea with the caveat that I prefer the idea in paragraph 3 about reserving a bit of head room so we don't deadlock. A very common bug in the past was to run out of mbufs when using a lot of small UDP packets. Best, George From jroberson at jroberson.net Mon Nov 24 18:06:12 2008 From: jroberson at jroberson.net (Jeff Roberson) Date: Mon Nov 24 18:06:19 2008 Subject: Limiting mbuf memory. In-Reply-To: References: <20081123213232.A971@desktop> Message-ID: <20081124160316.I971@desktop> On Mon, 24 Nov 2008, gnn@freebsd.org wrote: > At Sun, 23 Nov 2008 21:46:08 -1000 (HST), > Jeff Roberson wrote: >> >> I'm developing a patch for an alternate memory layout for mbuf clusters >> that relies on contigmalloc. Since this can fail, we'll still have to >> retain the capability of allocating traditional clusters. I'll report >> details on that later. I'm writing this email to address the issue of >> resource accounting in mbufs. >> >> Presently we use a set of limits on individual zones or sizes of mbufs. >> Standard mbufs, clusters, page size jumbos, 9k jumbos, and 16k jumbos. >> Each is administered sperately. I think this is getting a bit unwieldy. >> In the future, we may have even more sizes. This also introduces problems >> because I will have two cluster zones do they each get their own limit? >> >> I would like to consolidate this into a single limit on the number of >> pages in total allocated to networking. With perhaps some fractional >> reservation for standard mbufs and clusters to make sure they aren't >> overwhelmed by the larger buffers. >> >> This would be implemented by overriding the uma zone page allocator for >> each of the mbuf zones with one that counts pages for all. Should we >> reach the limit we'll block depending on the wait settings of the >> requestor. The limit and sleep will probably be protected by a global >> lock which won't be an issue because trips to the back end allocator are >> infrequent and protected by their own global lock anyhow. >> >> How do people feel about this? To be clear this would eliminate: >> >> nmbclusters, nmbjumbop, nmbjumbo9, nmbjumbo16 and related config settings >> and sysctls. They would be replaced by something like 'maxmbufbytes'. >> Presently we place no limit on small mbufs. I could go either way on >> this. It could be added to the limit or not. >> > > I think this is a good idea with the caveat that I prefer the idea in > paragraph 3 about reserving a bit of head room so we don't deadlock. > A very common bug in the past was to run out of mbufs when using a lot > of small UDP packets. Ok, I believe the existing per-type limits will facilitate this. Thanks, Jeff > > Best, > George > From PublicServicePartnershipLtd_769894 at dotmailer.co.uk Tue Nov 25 02:19:05 2008 From: PublicServicePartnershipLtd_769894 at dotmailer.co.uk (Mike Cross) Date: Tue Nov 25 02:19:11 2008 Subject: Emailing to the Public Sector Made Easy Message-ID: From lulf at freebsd.org Tue Nov 25 08:40:55 2008 From: lulf at freebsd.org (Ulf Lilleengen) Date: Tue Nov 25 08:41:08 2008 Subject: HEADSUP: CVS/Mirror mode for csup to be merged soon Message-ID: <20081125154040.GA12632@nobby.lan> Hello, After some feedback on previous patches and some adjustments, I think the CVSMode for csup project have come to a place where a wider testing audience is needed, and I would like to make this a call for review and a HEADSUP to allow willing reviewers and eventual protesters to give their opinion before merging this to HEAD. A few things about the current state of CVSMode: - Complete CVS mode (mirror mode) is supported, allowing the whole CVS repository to be fetched by csup. - rsync fetch supported if not explicitly not wanted by user or not supported by server. - Support using the status file to speed up detailing of files. This means no bigger inpact on files that are up to date. For the state of the code itself, I have went over it a couple of times the last couple of days, fixing style issues and a few differences between cvsup and csup. One important thing to note is that the impact on the existing csup operation is _minimal_, so that the risk of introducing bugs to the normal csup operation is very small, and because of this I see no problems with committing the current version. If you find any issues, please e-mail me, and I will look at it. So, for those of you wanting to test, please do so now. If people are okay with this, I would like to merge it by the end of the week/early next week. A patch can be found here: http://people.freebsd.org/~lulf/csup_cvsmode.diff or you can just do a checkout of projects/csup_cvsmode -- Ulf Lilleengen -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081125/59bd925a/attachment.pgp From marck at rinet.ru Tue Nov 25 12:03:08 2008 From: marck at rinet.ru (Dmitry Morozovsky) Date: Tue Nov 25 12:03:21 2008 Subject: HEADSUP: CVS/Mirror mode for csup to be merged soon In-Reply-To: <20081125154040.GA12632@nobby.lan> References: <20081125154040.GA12632@nobby.lan> Message-ID: On Tue, 25 Nov 2008, Ulf Lilleengen wrote: UL> Hello, UL> UL> After some feedback on previous patches and some adjustments, I think the UL> CVSMode for csup project have come to a place where a wider testing audience UL> is needed, and I would like to make this a call for review and a HEADSUP to UL> allow willing reviewers and eventual protesters to give their opinion before UL> merging this to HEAD. A few things about the current state of CVSMode: UL> UL> - Complete CVS mode (mirror mode) is supported, allowing the whole CVS UL> repository to be fetched by csup. UL> - rsync fetch supported if not explicitly not wanted by user or not supported UL> by server. UL> - Support using the status file to speed up detailing of files. This means no UL> bigger inpact on files that are up to date. UL> UL> For the state of the code itself, I have went over it a couple of times the UL> last couple of days, fixing style issues and a few differences between cvsup UL> and csup. One important thing to note is that the impact on the existing csup UL> operation is _minimal_, so that the risk of introducing bugs to the normal UL> csup operation is very small, and because of this I see no problems with UL> committing the current version. If you find any issues, please e-mail me, and UL> I will look at it. UL> UL> So, for those of you wanting to test, please do so now. If people are okay UL> with this, I would like to merge it by the end of the week/early next week. UL> UL> A patch can be found here: http://people.freebsd.org/~lulf/csup_cvsmode.diff UL> or you can just do a checkout of projects/csup_cvsmode Just to make sure it does not get lost: After creating RELENG_7_1 branch csupping is broken with: Updating collection src-all/cvs Edit src/bin/chio/chio.c,v /home/ncvs/src/bin/chio/chio.c,v: Checksum mismatch -- will transfer entire file Edit src/contrib/bind9/CHANGES,v /home/ncvs/src/contrib/bind9/CHANGES,v: Checksum mismatch -- will transfer entire file Edit src/contrib/bind9/COPYRIGHT,v ... [a lot of them] .. and finally /home/ncvs/src/sys/i386/conf/NOTES,v: Checksum mismatch -- will transfer entire file Edit src/sys/i386/conf/PAE,v Error applying diff: -1 Updater failed: Protocol error Error is 'Detailer failed: Premature EOF from server' I use RELENG_7 csup with your previous patch version Sincerely, D.Marck [DM5020, MCK-RIPE, DM3-RIPN] [ FreeBSD committer: marck@FreeBSD.org ] ------------------------------------------------------------------------ *** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- marck@rinet.ru *** ------------------------------------------------------------------------ From lulf at freebsd.org Tue Nov 25 13:05:05 2008 From: lulf at freebsd.org (Ulf Lilleengen) Date: Tue Nov 25 13:05:17 2008 Subject: HEADSUP: CVS/Mirror mode for csup to be merged soon In-Reply-To: References: <20081125154040.GA12632@nobby.lan> Message-ID: <20081125200451.GA3160@nobby.lan> On Tue, Nov 25, 2008 at 10:52:03PM +0300, Dmitry Morozovsky wrote: > On Tue, 25 Nov 2008, Ulf Lilleengen wrote: > > UL> Hello, > UL> > UL> After some feedback on previous patches and some adjustments, I think the > UL> CVSMode for csup project have come to a place where a wider testing audience > UL> is needed, and I would like to make this a call for review and a HEADSUP to > UL> allow willing reviewers and eventual protesters to give their opinion before > UL> merging this to HEAD. A few things about the current state of CVSMode: > UL> > UL> - Complete CVS mode (mirror mode) is supported, allowing the whole CVS > UL> repository to be fetched by csup. > UL> - rsync fetch supported if not explicitly not wanted by user or not supported > UL> by server. > UL> - Support using the status file to speed up detailing of files. This means no > UL> bigger inpact on files that are up to date. > UL> > UL> For the state of the code itself, I have went over it a couple of times the > UL> last couple of days, fixing style issues and a few differences between cvsup > UL> and csup. One important thing to note is that the impact on the existing csup > UL> operation is _minimal_, so that the risk of introducing bugs to the normal > UL> csup operation is very small, and because of this I see no problems with > UL> committing the current version. If you find any issues, please e-mail me, and > UL> I will look at it. > UL> > UL> So, for those of you wanting to test, please do so now. If people are okay > UL> with this, I would like to merge it by the end of the week/early next week. > UL> > UL> A patch can be found here: http://people.freebsd.org/~lulf/csup_cvsmode.diff > UL> or you can just do a checkout of projects/csup_cvsmode > > Just to make sure it does not get lost: > > After creating RELENG_7_1 branch csupping is broken with: > > > Updating collection src-all/cvs > Edit src/bin/chio/chio.c,v > /home/ncvs/src/bin/chio/chio.c,v: Checksum mismatch -- will transfer entire > file > Edit src/contrib/bind9/CHANGES,v > /home/ncvs/src/contrib/bind9/CHANGES,v: Checksum mismatch -- will transfer > entire file > Edit src/contrib/bind9/COPYRIGHT,v > > ... [a lot of them] .. and finally > > /home/ncvs/src/sys/i386/conf/NOTES,v: Checksum mismatch -- will transfer entire > file > Edit src/sys/i386/conf/PAE,v > Error applying diff: -1 > Updater failed: Protocol error > Error is 'Detailer failed: Premature EOF from server' > > I use RELENG_7 csup with your previous patch version > The previous patch will break, and the issue should be fixed in the latest version. -- Ulf Lilleengen -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081125/b02a160c/attachment.pgp From gnn at freebsd.org Tue Nov 25 15:00:52 2008 From: gnn at freebsd.org (gnn@freebsd.org) Date: Tue Nov 25 15:00:58 2008 Subject: [PATCH] pmcannotate tool In-Reply-To: <3bbf2fe10811230502t3cc52809i6ac91082f780b730@mail.gmail.com> References: <3bbf2fe10811230502t3cc52809i6ac91082f780b730@mail.gmail.com> Message-ID: At Sun, 23 Nov 2008 14:02:22 +0100, Attilio Rao wrote: > > pmcannotate is a tool that prints out sources of a tool (in C or > assembly) with inlined profiling informations retrieved by a prior > pmcstat analysis. > If compared with things like callgraph generation, it prints out > profiling on a per-instance basis and this can be useful to find, for > example, badly handled caches, too high latency instructions, etc. > > The tool usage is pretty simple: > pmcannotate [-a] [-h] [-k path] [-l level] samples.out binaryobj > > where samples.out is a pmcstat raw output and binaryobj is the binary > object that has been profiled and is accessible for (ELF) symbols > retrieving. > The options are better described in manpages but briefly: > - a: performs analysis on the assembly rather than the C source > - h: usage and informations > - k: specify a path for the kernel in order to locate correct objects for it > - l: specify a lower boundary (in total percentage time) after which > functions will be displayed nomore. > > A typical usage of pmcannotate can be some way of kernel annotation. > For example, you can follow the steps below: > 1) Generate a pmc raw output of system samples: > # pmcstat -S ipm-unhalted-core-cycles -O samples.out > 2) Copy the samples in the kernel building dir and cd there > # cp samples.out /usr/src/sys/i386/compile/GENERIC/ ; cd > /usr/src/sys/i386/compile/GENERIC/ > 3) Run pmcannotate > # pmcannotate -k . samples.out kernel.debug > kernel.ann > > In the example above please note that kernel.debug has to be used in > order to produce a C annotated source. This happens because in order > to get the binary sources we rely on the "objdump -S" command which > wants binary compiled with debugging options. > If not debugging options are present assembly analynsis is still > possible, but no C-backed one will be available. > objdump is not the only one tool on which pmcannotare rely. Infact, in > order to have it working, pmcstat needs to be present too because we > need to retrieve, from the pmcstat raw output, informations about the > sampled PCs (in particular the name of the function they live within, > its start and ending addresses). As long as currently pmcstat doesn't > return those informations, a new option has been added to the tool > (-m) which can extract (from a raw pmcstat output) all pc sampled, > name of the functions and symbol bundaries they live within. > > Also please note that pmcannotate suffers of 2 limitations. > Firstly, relying on objdump to dump the C source, with heavy > optimization levels and lots of inlines the code gets difficult to > read. Secondly, in particular on x86 but I guess it is not the only > one case, the sample is always attributed to the instruction directly > following the one that was interrupted. So in a C source view some > samples may be attributed to the line below the one you're interested > in. It's also important to keep in mind that if a line is a jump > target or the start of a function the sample really belongs elsewhere. > > The patch can be found here: > http://www.freebsd.org/~attilio/pmcannotate.diff/ > > where pmcannotate/ dir contains the code and needs to go under > /usr/src/usr.sbin/ and the patch has diffs against pmcstat and > Makefile. > > This work has been developed on the behalf of Nokia with important > feedbacks and directions from Jeff Roberson. > > Testing and feedbacks (before it hits the tree) are welcome. > Hi, First of all, this is excellent work. As soon as this and some other changes in PMC hit 7.x I'll be rolling this out to all the developers I work with. I've tested this on amd64 on HEAD, and with the changes we have talked about privately (%jx vs. %x) it works quite well. Secondly, I would like to request a feature. I would like to be able to get output in a more easily parsable format so I can write some Emacs code to highlight C code with the output. I'd like something along the lines of: path:function:line:percentage Keep up the good work! Later, George From peter at wemm.org Fri Nov 28 20:28:45 2008 From: peter at wemm.org (Peter Wemm) Date: Fri Nov 28 20:28:51 2008 Subject: RFC: making gpart default In-Reply-To: <57809A37-B81C-4F50-A418-CD9303F06B72@mac.com> References: <57809A37-B81C-4F50-A418-CD9303F06B72@mac.com> Message-ID: On Thu, Sep 25, 2008 at 9:59 AM, Marcel Moolenaar wrote: > All, > > I'd like to switch all architectures to gpart for the reasons given > below. All current partitioning schemes are supported by gpart and > work on all platforms. On top of that, ia64 and powerpc are using > gpart exclusively already. [..] > In short: gpart is the first step towards a unified set of > tools and interfaces and provides the basis for extending > file system related tools by allowing us to attach real > meaning to partition types. With the commit and undo feature, > gpart is ready for use by next generation installers that > allow us to use any partitioning scheme on any platforms. > > Thoughts? oh my god. I just tried to use gpart. This needs some SERIOUS help. First, the 'gpart create' man page doesn't say what "scheme" is. After guessing, I tried: overcee# gpart create -s gpt /dev/twed1 gpart: 22 scheme 'gpt' What does that mean? It turns out that I didn't have GEOM_PART_GPT compiled in. After continuing the guessing game: overcee# gpart create -s gpt /dev/twed1 gpart: 22 provider '/dev/twed1' That was useful. Out other tools generally allow /dev prefixes to be optional. overcee# gpart create -s gpt twed1 twed1 created Now what? Boot code.. there's no example of this either. I tried: overcee# gpart bootcode -b /boot/pmbr -p /boot/gptboot -i 1 twed1 gpart: /dev/twed1p1: Invalid argument I suppose that beats "22". This works though: overcee# gpart bootcode -b /boot/pmbr twed1 This doesn't: overcee# gpart bootcode -p /boot/gptboot -i 1 twed1 gpart: /dev/twed1p1: Invalid argument I haven't figured this out yet. I'm guessing this is because /boot/gptboot isn't a multiple of 512 bytes. The error message is obviously giving no help here. Let's try padding it: overcee# dd if=/boot/gptboot of=/tmp/gptboot conv=sync 14+1 records in 15+0 records out 7680 bytes transferred in 0.000098 secs (78375316 bytes/sec) overcee# gpart bootcode -p /tmp/gptboot -i 1 twed1 overcee# Yep, that worked. Now for a partition... overcee# gpart add -b 512 -s 512m -t freebsd-ufs twed1 gpart: 22 size '512m' Huh? "22"? overcee# gpart add -b 512 -s 1048576 -t freebsd-ufs twed1 twed1p2 added But at least I think I'm getting some progress: overcee# gpart show twed1 => 34 976771053 twed1 GPT (500.1GB) 34 478 1 freebsd-boot (244.7KB) 512 1048576 2 freebsd-ufs (536.9MB) 1049088 975721999 - free - (499.6GB) So I continue.. I figure gpart would pick the first free space: overcee# gpart add -s 4058062 -t freebsd-ufs twed1 gpart: Option 'b' not specified. Apparently not... overcee# gpart add -b 1049088 -s 4058062 -t freebsd-ufs twed1 twed1p3 added Now one has to do a gpart show ; add ; show ; add loop to get the start address. This is really, really raw and unfriendly stuff. -- Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI6FJV "All of this is for nothing if we don't go to the stars" - JMS/B5 "If Java had true garbage collection, most programs would delete themselves upon execution." -- Robert Sewell From imp at bsdimp.com Fri Nov 28 20:41:59 2008 From: imp at bsdimp.com (M. Warner Losh) Date: Fri Nov 28 20:42:05 2008 Subject: Enumerable I2C busses In-Reply-To: <86od054iaz.fsf@ds4.des.no> References: <86myfq9uha.fsf@ds4.des.no> <86od054iaz.fsf@ds4.des.no> Message-ID: <20081128.213925.-1597343424.imp@bsdimp.com> In message: <86od054iaz.fsf@ds4.des.no> Dag-Erling_Sm?rgrav writes: : Rafa? Jaworowski writes: : > Well, hard-coded addresses and conflicting assignments between vendors : > do not technically prevent from scanning the bus; actually, our : > current iicbus code can do bus scaning when compiled with a diag : > define. : : I haven't looked at how that is implemented, but - I2C version 3 : describes device enumeration and identification, but AFAIK very few : devices actually support it. : : This is really stupid, BTW. Philips could easily have required the : first few bytes of the device address space to contain a vendor / device : ID, along the lines of PCI and USB, from the start. I wonder: did it : not occur to them, or did they intentionally leave it out to save a few : microcents per chip? EEPROMs typically don't have registers at all. They are just memory. Forcing them to have set content would break all kinds of applications. Warner From rpaulo at fnop.net Sat Nov 29 06:07:49 2008 From: rpaulo at fnop.net (Rui Paulo) Date: Sat Nov 29 06:08:20 2008 Subject: RFC: making gpart default In-Reply-To: References: <57809A37-B81C-4F50-A418-CD9303F06B72@mac.com> Message-ID: <28C48AAC-CF27-4434-819C-A2688D6485CC@fnop.net> On 29 Nov 2008, at 04:07, Peter Wemm wrote: > On Thu, Sep 25, 2008 at 9:59 AM, Marcel Moolenaar > wrote: >> All, >> >> I'd like to switch all architectures to gpart for the reasons given >> below. All current partitioning schemes are supported by gpart and >> work on all platforms. On top of that, ia64 and powerpc are using >> gpart exclusively already. > [..] >> In short: gpart is the first step towards a unified set of >> tools and interfaces and provides the basis for extending >> file system related tools by allowing us to attach real >> meaning to partition types. With the commit and undo feature, >> gpart is ready for use by next generation installers that >> allow us to use any partitioning scheme on any platforms. >> >> Thoughts? > > oh my god. I just tried to use gpart. This needs some SERIOUS help. I agree. > > > First, the 'gpart create' man page doesn't say what "scheme" is. Right. Also, the gpart man page should have much more examples. > After guessing, I tried: > > overcee# gpart create -s gpt /dev/twed1 > gpart: 22 scheme 'gpt' > > What does that mean? It turns out that I didn't have GEOM_PART_GPT > compiled in. Yes, these should probably go into sys/conf instead of GENERIC. > After continuing the guessing game: > > overcee# gpart create -s gpt /dev/twed1 > gpart: 22 provider '/dev/twed1' > > That was useful. Out other tools generally allow /dev prefixes to > be optional. In this case, I think '/dev/' should be stripped and then sent to GEOM. Or maybe I don't know what I'm talking about. > overcee# gpart create -s gpt twed1 > twed1 created > > Now what? Boot code.. there's no example of this either. I tried: > > overcee# gpart bootcode -b /boot/pmbr -p /boot/gptboot -i 1 twed1 > gpart: /dev/twed1p1: Invalid argument > > I suppose that beats "22". > > This works though: > overcee# gpart bootcode -b /boot/pmbr twed1 > > This doesn't: > overcee# gpart bootcode -p /boot/gptboot -i 1 twed1 > gpart: /dev/twed1p1: Invalid argument > I haven't figured this out yet. I'm guessing this is because > /boot/gptboot isn't a multiple of 512 bytes. The error message is > obviously giving no help here. > > Let's try padding it: > overcee# dd if=/boot/gptboot of=/tmp/gptboot conv=sync > 14+1 records in > 15+0 records out > 7680 bytes transferred in 0.000098 secs (78375316 bytes/sec) > overcee# gpart bootcode -p /tmp/gptboot -i 1 twed1 > overcee# > > Yep, that worked. Now for a partition... > > overcee# gpart add -b 512 -s 512m -t freebsd-ufs twed1 > gpart: 22 size '512m' > Huh? "22"? > > overcee# gpart add -b 512 -s 1048576 -t freebsd-ufs twed1 > twed1p2 added Size prefixes should be supported, yes. > But at least I think I'm getting some progress: > overcee# gpart show twed1 > => 34 976771053 twed1 GPT (500.1GB) > 34 478 1 freebsd-boot (244.7KB) > 512 1048576 2 freebsd-ufs (536.9MB) > 1049088 975721999 - free - (499.6GB) > > So I continue.. I figure gpart would pick the first free space: > overcee# gpart add -s 4058062 -t freebsd-ufs twed1 > gpart: Option 'b' not specified. gpt did this with no problems... :-/ > > > Apparently not... > overcee# gpart add -b 1049088 -s 4058062 -t freebsd-ufs twed1 > twed1p3 added > > Now one has to do a gpart show ; add ; show ; add loop to get the > start address. > > > This is really, really raw and unfriendly stuff. Yes. This looks like fdisk, something we really want to avoid. I'll see what I can do to help. Regards, -- Rui Paulo From joao.barros at gmail.com Sat Nov 29 06:29:52 2008 From: joao.barros at gmail.com (Joao Barros) Date: Sat Nov 29 06:29:59 2008 Subject: RFC: making gpart default In-Reply-To: References: <57809A37-B81C-4F50-A418-CD9303F06B72@mac.com> Message-ID: <70e8236f0811290609h2539ede7jc01778edac9c1d5@mail.gmail.com> On Sat, Nov 29, 2008 at 4:07 AM, Peter Wemm wrote: > On Thu, Sep 25, 2008 at 9:59 AM, Marcel Moolenaar wrote: >> All, >> >> I'd like to switch all architectures to gpart for the reasons given >> below. All current partitioning schemes are supported by gpart and >> work on all platforms. On top of that, ia64 and powerpc are using >> gpart exclusively already. > [..] >> In short: gpart is the first step towards a unified set of >> tools and interfaces and provides the basis for extending >> file system related tools by allowing us to attach real >> meaning to partition types. With the commit and undo feature, >> gpart is ready for use by next generation installers that >> allow us to use any partitioning scheme on any platforms. >> >> Thoughts? > > oh my god. I just tried to use gpart. This needs some SERIOUS help. > > First, the 'gpart create' man page doesn't say what "scheme" is. True. > After guessing, I tried: > > overcee# gpart create -s gpt /dev/twed1 > gpart: 22 scheme 'gpt' > > What does that mean? It turns out that I didn't have GEOM_PART_GPT compiled in. A recent CURRENT has it by default. > > After continuing the guessing game: > > overcee# gpart create -s gpt /dev/twed1 > gpart: 22 provider '/dev/twed1' > > That was useful. Out other tools generally allow /dev prefixes to be optional. > > overcee# gpart create -s gpt twed1 > twed1 created > > Now what? Boot code.. there's no example of this either. I tried: > > overcee# gpart bootcode -b /boot/pmbr -p /boot/gptboot -i 1 twed1 > gpart: /dev/twed1p1: Invalid argument > > I suppose that beats "22". > > This works though: > overcee# gpart bootcode -b /boot/pmbr twed1 > > This doesn't: > overcee# gpart bootcode -p /boot/gptboot -i 1 twed1 > gpart: /dev/twed1p1: Invalid argument > I haven't figured this out yet. I'm guessing this is because > /boot/gptboot isn't a multiple of 512 bytes. The error message is > obviously giving no help here. > > Let's try padding it: > overcee# dd if=/boot/gptboot of=/tmp/gptboot conv=sync > 14+1 records in > 15+0 records out > 7680 bytes transferred in 0.000098 secs (78375316 bytes/sec) > overcee# gpart bootcode -p /tmp/gptboot -i 1 twed1 > overcee# > > Yep, that worked. Now for a partition... Fixed on Nov 18: http://svn.freebsd.org/viewvc/base?view=revision&revision=185038 > > overcee# gpart add -b 512 -s 512m -t freebsd-ufs twed1 > gpart: 22 size '512m' > Huh? "22"? This would be nice... > > overcee# gpart add -b 512 -s 1048576 -t freebsd-ufs twed1 > twed1p2 added > > But at least I think I'm getting some progress: > overcee# gpart show twed1 > => 34 976771053 twed1 GPT (500.1GB) > 34 478 1 freebsd-boot (244.7KB) > 512 1048576 2 freebsd-ufs (536.9MB) > 1049088 975721999 - free - (499.6GB) > > So I continue.. I figure gpart would pick the first free space: > overcee# gpart add -s 4058062 -t freebsd-ufs twed1 > gpart: Option 'b' not specified. This would be nice too... > > Apparently not... > overcee# gpart add -b 1049088 -s 4058062 -t freebsd-ufs twed1 > twed1p3 added > > Now one has to do a gpart show ; add ; show ; add loop to get the start address. > Yes, it could be smarter and more helpfull. > > This is really, really raw and unfriendly stuff. -- Joao Barros From peter at wemm.org Sat Nov 29 13:10:00 2008 From: peter at wemm.org (Peter Wemm) Date: Sat Nov 29 13:10:06 2008 Subject: RFC: making gpart default In-Reply-To: <70e8236f0811290609h2539ede7jc01778edac9c1d5@mail.gmail.com> References: <57809A37-B81C-4F50-A418-CD9303F06B72@mac.com> <70e8236f0811290609h2539ede7jc01778edac9c1d5@mail.gmail.com> Message-ID: On Sat, Nov 29, 2008 at 6:09 AM, Joao Barros wrote: > On Sat, Nov 29, 2008 at 4:07 AM, Peter Wemm wrote: >> On Thu, Sep 25, 2008 at 9:59 AM, Marcel Moolenaar wrote: >>> All, >>> >>> I'd like to switch all architectures to gpart for the reasons given >>> below. All current partitioning schemes are supported by gpart and >>> work on all platforms. On top of that, ia64 and powerpc are using >>> gpart exclusively already. >> [..] >>> In short: gpart is the first step towards a unified set of >>> tools and interfaces and provides the basis for extending >>> file system related tools by allowing us to attach real >>> meaning to partition types. With the commit and undo feature, >>> gpart is ready for use by next generation installers that >>> allow us to use any partitioning scheme on any platforms. >>> >>> Thoughts? >> >> oh my god. I just tried to use gpart. This needs some SERIOUS help. >> >> First, the 'gpart create' man page doesn't say what "scheme" is. > > True. My gripe was that it just said "this is how you specify the scheme". What is a scheme? Is this where you type "guid" or "gpt" or "mbr"? Is it lowercase or uppercase? Is it a filename to the backing .so name? Is it the numerical index in some array inside the g_part kernel module? >> After guessing, I tried: >> >> overcee# gpart create -s gpt /dev/twed1 >> gpart: 22 scheme 'gpt' >> >> What does that mean? It turns out that I didn't have GEOM_PART_GPT compiled in. > > A recent CURRENT has it by default. Only in GENERIC. That didn't help my machine which has been running and upgraded all the way from freebsd-3.2. I used to have it turned on at one point. The point was that gpart was gave me no clues about diagnosing the problem. However, conf/DEFAULTS still has GEOM_BSD and GEOM_MBR in it. Recall that this thread is about having g_part take over mbr and bsd, which means switching tools. Having tools that give an error message of "22" isn't going to cut it. -- Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI6FJV "All of this is for nothing if we don't go to the stars" - JMS/B5 "If Java had true garbage collection, most programs would delete themselves upon execution." -- Robert Sewell From peter at wemm.org Sat Nov 29 13:56:07 2008 From: peter at wemm.org (Peter Wemm) Date: Sat Nov 29 13:56:13 2008 Subject: RFC: making gpart default In-Reply-To: <0F1745AA-611F-40B2-85F3-32FD78BC4B58@mac.com> References: <57809A37-B81C-4F50-A418-CD9303F06B72@mac.com> <0F1745AA-611F-40B2-85F3-32FD78BC4B58@mac.com> Message-ID: On Sat, Nov 29, 2008 at 1:32 PM, Marcel Moolenaar wrote: > On Nov 28, 2008, at 8:07 PM, Peter Wemm wrote: > >> First, the 'gpart create' man page doesn't say what "scheme" is. > > Yes, the manpage needs some work. > >> After guessing, I tried: >> >> overcee# gpart create -s gpt /dev/twed1 >> gpart: 22 scheme 'gpt' >> >> What does that mean? It turns out that I didn't have GEOM_PART_GPT >> compiled in. > > Oops, forgot about parsing the error string... > I just fixed it (rev 185454). > > The background: geom(8) simply prints the error that > the kernel creates. This then requires that the kernel > creates a user-friendly error message. This is not a > good idea, because it side-steps things like i18n > completely. > > So, for gpart I chose a different approach. The kernel > uses a certain form for the error messages, which is: > [ 'value'] > The intend is to have user-space interpret what is > meant and print something the user understands. This > I forgot to do. > > For now, the translation is straight-forward, but it > should not be too hard to improve upon it further. > > In general, geom(8) needs to do a bit more pre- and post- > processing. It mostly just converts the command line into > a gctl request and as such forces the user to do all the > work. I'll work on that... For the record, I like the idea of having a consistent control interface. I switched my machine to use GEOM_PART_MBR/BSD as well. A couple of thoughts.. * Can gpart enumerate the list of schemes that the kernel supports? If so, it could avoid the problem of having to interpret the kernel's reaction to avoidable errors. * The same goes for the list of 'geoms'. 'geom disk list' (among other things) can find the providers list. gpart could avoid passing bogus provider names into the kernel in the first place. * A couple of DWIM concessions would go along way. Humanized number suffixes, the ability to search for start addresses automatically, find next free partition index etc. I'd like this sort of thing to work: # gpart add -s 4g -t freebsd-ufs twed1 created partition 3 from 1049088 to 8388608 on twed1. # gpart add -t freebsd-ufs twed1 created partition 7 from 30409216 to 946361871 on twed1 * There should be some guidance or hints on laying out disks. For example, a gpart create -s gpt on a raid volume ends up with a start sector of 34 for the free space. There should be a documentation hint to round up start sectors to a power of 2 and/or block size on a raid. eg: if you have a raid with 64K stripes, then move the start sector from 34 to 128. Otherwise we end up with file systems issuing transactions that can split across multiple raid stripes. FWIW, I conveniently filled this hole with boot code. The last issue isn't specific to gpart. There was one device at work where the fdisk free space starts at sector 63. (31.5K). When creating 16K ufs blocks, the particular raid controller generated *two* operations for every single file system read/write we did. UFS helpfully did it all in what it thought were 16K transactions, (or clustered 64K transactions) but were actually unaligned thanks to the default mbr layout. -- Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI6FJV "All of this is for nothing if we don't go to the stars" - JMS/B5 "If Java had true garbage collection, most programs would delete themselves upon execution." -- Robert Sewell From xcllnt at mac.com Sat Nov 29 14:32:32 2008 From: xcllnt at mac.com (Marcel Moolenaar) Date: Sat Nov 29 14:32:38 2008 Subject: RFC: making gpart default In-Reply-To: References: <57809A37-B81C-4F50-A418-CD9303F06B72@mac.com> Message-ID: <0F1745AA-611F-40B2-85F3-32FD78BC4B58@mac.com> On Nov 28, 2008, at 8:07 PM, Peter Wemm wrote: > First, the 'gpart create' man page doesn't say what "scheme" is. Yes, the manpage needs some work. > After guessing, I tried: > > overcee# gpart create -s gpt /dev/twed1 > gpart: 22 scheme 'gpt' > > What does that mean? It turns out that I didn't have GEOM_PART_GPT > compiled in. Oops, forgot about parsing the error string... I just fixed it (rev 185454). The background: geom(8) simply prints the error that the kernel creates. This then requires that the kernel creates a user-friendly error message. This is not a good idea, because it side-steps things like i18n completely. So, for gpart I chose a different approach. The kernel uses a certain form for the error messages, which is: [ 'value'] The intend is to have user-space interpret what is meant and print something the user understands. This I forgot to do. For now, the translation is straight-forward, but it should not be too hard to improve upon it further. In general, geom(8) needs to do a bit more pre- and post- processing. It mostly just converts the command line into a gctl request and as such forces the user to do all the work. I'll work on that... -- Marcel Moolenaar xcllnt@mac.com From xcllnt at mac.com Sat Nov 29 19:30:50 2008 From: xcllnt at mac.com (Marcel Moolenaar) Date: Sat Nov 29 19:30:56 2008 Subject: RFC: making gpart default In-Reply-To: References: <57809A37-B81C-4F50-A418-CD9303F06B72@mac.com> <0F1745AA-611F-40B2-85F3-32FD78BC4B58@mac.com> Message-ID: <68B9D78C-C0CF-4D64-AF53-C3736EEC8D23@mac.com> On Nov 29, 2008, at 1:56 PM, Peter Wemm wrote: > On Sat, Nov 29, 2008 at 1:32 PM, Marcel Moolenaar > wrote: >> On Nov 28, 2008, at 8:07 PM, Peter Wemm wrote: >> >>> First, the 'gpart create' man page doesn't say what "scheme" is. >> >> Yes, the manpage needs some work. >> >>> After guessing, I tried: >>> >>> overcee# gpart create -s gpt /dev/twed1 >>> gpart: 22 scheme 'gpt' >>> >>> What does that mean? It turns out that I didn't have GEOM_PART_GPT >>> compiled in. >> >> Oops, forgot about parsing the error string... >> I just fixed it (rev 185454). >> >> The background: geom(8) simply prints the error that >> the kernel creates. This then requires that the kernel >> creates a user-friendly error message. This is not a >> good idea, because it side-steps things like i18n >> completely. >> >> So, for gpart I chose a different approach. The kernel >> uses a certain form for the error messages, which is: >> [ 'value'] >> The intend is to have user-space interpret what is >> meant and print something the user understands. This >> I forgot to do. >> >> For now, the translation is straight-forward, but it >> should not be too hard to improve upon it further. >> >> In general, geom(8) needs to do a bit more pre- and post- >> processing. It mostly just converts the command line into >> a gctl request and as such forces the user to do all the >> work. I'll work on that... > > For the record, I like the idea of having a consistent control > interface. I switched my machine to use GEOM_PART_MBR/BSD as well. > > A couple of thoughts.. > > * Can gpart enumerate the list of schemes that the kernel supports? > If so, it could avoid the problem of having to interpret the kernel's > reaction to avoidable errors. Not yet, but can be added fairly easily. The kernel has a list of supported schemes, which it just needs to export in the XML. In that case, and with each scheme a kernel module, gpart(8) can also try to load the kernel module on demand... > * The same goes for the list of 'geoms'. 'geom disk list' (among > other things) can find the providers list. gpart could avoid passing > bogus provider names into the kernel in the first place. True. The list of GEOMs is already in the XML, so it can be checked up-front by gpart(8). This also aligns well with stripping /dev/ from the provider and geom name. > * A couple of DWIM concessions would go along way. Humanized number > suffixes, the ability to search for start addresses automatically, > find next free partition index etc. Yup. I considered this already and just haven't gotten around to work on it. > * There should be some guidance or hints on laying out disks. For > example, a gpart create -s gpt on a raid volume ends up with a start > sector of 34 for the free space. There should be a documentation hint > to round up start sectors to a power of 2 and/or block size on a raid. > eg: if you have a raid with 64K stripes, then move the start sector > from 34 to 128. Otherwise we end up with file systems issuing > transactions that can split across multiple raid stripes. FWIW, I > conveniently filled this hole with boot code. Hmmm... gpart(8) typically can't store this kind of information on-disk, but other than that it supports alignment/padding already. We just need a way to tell gpart about it. Maybe this should come from the provider (i.e. underlying geom)... -- Marcel Moolenaar xcllnt@mac.com