From bugmaster at FreeBSD.org Mon Dec 1 03:06:52 2008 From: bugmaster at FreeBSD.org (FreeBSD bugmaster) Date: Mon Dec 1 03:07:26 2008 Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org Message-ID: <200812011106.mB1B6pFH052479@freefall.freebsd.org> Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/120749 arch [request] Suggest upping the default kern.ps_arg_cache 1 problem total. From des at des.no Mon Dec 1 03:59:52 2008 From: des at des.no (=?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?=) Date: Mon Dec 1 03:59:59 2008 Subject: RFC: making gpart default In-Reply-To: (Peter Wemm's message of "Fri, 28 Nov 2008 20:07:50 -0800") References: <57809A37-B81C-4F50-A418-CD9303F06B72@mac.com> Message-ID: <86vdu4qg2m.fsf@ds4.des.no> "Peter Wemm" writes: > After continuing the guessing game: > > overcee# gpart create -s gpt /dev/twed1 > gpart: 22 provider '/dev/twed1' > > That was useful. Out other tools generally allow /dev prefixes to be optional. In this case, twed1 is the name of a GEOM provider, not a device. I'm not sure all GEOM providers necessarily have device nodes - at least, the GEOM framework doesn't enforce it - and for those that do, the device name might not match the GEOM name. DES -- Dag-Erling Sm?rgrav - des@des.no From peter at wemm.org Mon Dec 1 16:05:40 2008 From: peter at wemm.org (Peter Wemm) Date: Mon Dec 1 16:05:46 2008 Subject: RFC: making gpart default In-Reply-To: <68B9D78C-C0CF-4D64-AF53-C3736EEC8D23@mac.com> References: <57809A37-B81C-4F50-A418-CD9303F06B72@mac.com> <0F1745AA-611F-40B2-85F3-32FD78BC4B58@mac.com> <68B9D78C-C0CF-4D64-AF53-C3736EEC8D23@mac.com> Message-ID: On Sat, Nov 29, 2008 at 7:30 PM, Marcel Moolenaar wrote: > > On Nov 29, 2008, at 1:56 PM, Peter Wemm wrote: [..] >> * There should be some guidance or hints on laying out disks. For >> example, a gpart create -s gpt on a raid volume ends up with a start >> sector of 34 for the free space. There should be a documentation hint >> to round up start sectors to a power of 2 and/or block size on a raid. >> eg: if you have a raid with 64K stripes, then move the start sector >> from 34 to 128. Otherwise we end up with file systems issuing >> transactions that can split across multiple raid stripes. FWIW, I >> conveniently filled this hole with boot code. > > Hmmm... gpart(8) typically can't store this kind > of information on-disk, but other than that it > supports alignment/padding already. We just need > a way to tell gpart about it. Maybe this should > come from the provider (i.e. underlying geom)... I was more thinking of a man page note to warn of the issue. Also, in the gpt case, it might make sense in gpt partition table case to round up the initial size to a power of 2. Right now we lose 34 sectors from the beginning. Rounding it to 64 total at least gets us to an even power of 2. UFS's frequent block size of 16K shouldn't cross any underlying stripe boundaries in the usual case. -- Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI6FJV "All of this is for nothing if we don't go to the stars" - JMS/B5 "If Java had true garbage collection, most programs would delete themselves upon execution." -- Robert Sewell From jhb at freebsd.org Tue Dec 2 14:32:57 2008 From: jhb at freebsd.org (John Baldwin) Date: Tue Dec 2 14:33:02 2008 Subject: kthread_exit(9) unexpectedness In-Reply-To: <4929F90B.1040502@freebsd.org> References: <492412E8.3060700@freebsd.org> <200811211348.41536.jhb@freebsd.org> <4929F90B.1040502@freebsd.org> Message-ID: <200812021707.41545.jhb@freebsd.org> On Sunday 23 November 2008 07:44:59 pm Lawrence Stewart wrote: > John Baldwin wrote: > > On Thursday 20 November 2008 05:22:03 pm Lawrence Stewart wrote: > >> John Baldwin wrote: > >>> On Wednesday 19 November 2008 08:21:44 am Lawrence Stewart wrote: > >>>> Hi all, > >>>> > >>>> I tracked down a deadlock in some of my code today to some weird > >>>> behaviour in the kthread(9) KPI. The executive summary is that > >>>> kthread_exit() thread termination notification using wakeup() behaves as > >>>> expected intuitively in 8.x, but not in 7.x. > >>> In 5.x/6.x/7.x kthreads are still processes and it has always been a > > wakeup on > >>> the proc pointer. kthread_create() in 7.x returns a proc pointer, not a > >>> thread pointer for example. In 8.x kthreads are actual threads and > >> Yep, but the processes have a *thread in them right? The API naming is > >> obviously slightly misleading, but it essentially creates a new single > >> threaded process prior to 8.x. > > > > Yes, but you have to go explicitly use FIRST_THREAD_IN_PROC(). Most of the > > kernel modules I've written that use kthread's in < 8 do this: > > > > static struct proc *foo_thread; > > > > /* Called for MOD_LOAD. */ > > static void > > load(...) > > { > > > > error = kthread_create(..., &foo_thread); > > } > > > > static void > > unload(...) > > { > > > > /* set flag */ > > msleep(foo_thread, ...); > > } > > > > And never actually use the thread at all. However, if you write the code for > > 8.x, now you _do_ get a kthread and sleep on the thread so it becomes: > > > > static struct thread *foo_thread; > > > > static void > > load(...) > > { > > > > error = kproc_kthread_add(..., proc0, &foo_thread); > > } > > > > static void > > unload(...) > > { > > > > /* set flag */ > > msleep(foo_thread, ...); > > } > > > > > Sure, but to write the code in this way means you are exercising > undocumented knowledge of the KPI. I suspect the average developer > completely unfamiliar with the KPI would (and should!) use the man page > to learn about the functionality it provides. > > With that basis in mind, it seems unreasonable to expect the developer > to come to the conclusion that "...will initiate a call to wakeup(9) on > the thread handle." refers to sleeping on the *proc passed in to > kthread_create. Perhaps I'm not as switched on as the average developer, > but when I read it I certainly did not understand that the KPI created > processes and that the man page used the term thread to really mean a > single threaded process. I also did no equate "thread handle" with the > *proc returned by kthread_create. This is why the API in 8 is better, it is far less confusing. > Apart from the discussion thus far, you haven't actually commented yet > on my proposed single line change to kthread_exit() in 7.x to call > wakeup on the *thread as well as the *proc. Do you have any specific > thoughts on or objection to that idea? I would rather fix the docs first and not encourage folks to use FIRST_THREAD_IN_PROC (a). My only worry about the additional wakeup is other places that may be sleeping on the thread pointer for another reason. It might even be better to add a dedicated condvar to 'struct thread' in 8.x that is used for the wakeup and do the wakeup on that rather than the thread pointer to be honest. -- John Baldwin From lstewart at freebsd.org Tue Dec 2 18:02:34 2008 From: lstewart at freebsd.org (Lawrence Stewart) Date: Tue Dec 2 18:02:40 2008 Subject: kthread_exit(9) unexpectedness In-Reply-To: <200812021707.41545.jhb@freebsd.org> References: <492412E8.3060700@freebsd.org> <200811211348.41536.jhb@freebsd.org> <4929F90B.1040502@freebsd.org> <200812021707.41545.jhb@freebsd.org> Message-ID: <4935E4B5.6090204@freebsd.org> John Baldwin wrote: > On Sunday 23 November 2008 07:44:59 pm Lawrence Stewart wrote: >> John Baldwin wrote: >>> On Thursday 20 November 2008 05:22:03 pm Lawrence Stewart wrote: >>>> John Baldwin wrote: >>>>> On Wednesday 19 November 2008 08:21:44 am Lawrence Stewart wrote: >>>>>> Hi all, >>>>>> >>>>>> I tracked down a deadlock in some of my code today to some weird >>>>>> behaviour in the kthread(9) KPI. The executive summary is that >>>>>> kthread_exit() thread termination notification using wakeup() behaves > as >>>>>> expected intuitively in 8.x, but not in 7.x. >>>>> In 5.x/6.x/7.x kthreads are still processes and it has always been a >>> wakeup on >>>>> the proc pointer. kthread_create() in 7.x returns a proc pointer, not a >>>>> thread pointer for example. In 8.x kthreads are actual threads and >>>> Yep, but the processes have a *thread in them right? The API naming is >>>> obviously slightly misleading, but it essentially creates a new single >>>> threaded process prior to 8.x. >>> Yes, but you have to go explicitly use FIRST_THREAD_IN_PROC(). Most of > the >>> kernel modules I've written that use kthread's in < 8 do this: >>> >>> static struct proc *foo_thread; >>> >>> /* Called for MOD_LOAD. */ >>> static void >>> load(...) >>> { >>> >>> error = kthread_create(..., &foo_thread); >>> } >>> >>> static void >>> unload(...) >>> { >>> >>> /* set flag */ >>> msleep(foo_thread, ...); >>> } >>> >>> And never actually use the thread at all. However, if you write the code > for >>> 8.x, now you _do_ get a kthread and sleep on the thread so it becomes: >>> >>> static struct thread *foo_thread; >>> >>> static void >>> load(...) >>> { >>> >>> error = kproc_kthread_add(..., proc0, &foo_thread); >>> } >>> >>> static void >>> unload(...) >>> { >>> >>> /* set flag */ >>> msleep(foo_thread, ...); >>> } >>> >> >> Sure, but to write the code in this way means you are exercising >> undocumented knowledge of the KPI. I suspect the average developer >> completely unfamiliar with the KPI would (and should!) use the man page >> to learn about the functionality it provides. >> >> With that basis in mind, it seems unreasonable to expect the developer >> to come to the conclusion that "...will initiate a call to wakeup(9) on >> the thread handle." refers to sleeping on the *proc passed in to >> kthread_create. Perhaps I'm not as switched on as the average developer, >> but when I read it I certainly did not understand that the KPI created >> processes and that the man page used the term thread to really mean a >> single threaded process. I also did no equate "thread handle" with the >> *proc returned by kthread_create. > > This is why the API in 8 is better, it is far less confusing. > >> Apart from the discussion thus far, you haven't actually commented yet >> on my proposed single line change to kthread_exit() in 7.x to call >> wakeup on the *thread as well as the *proc. Do you have any specific >> thoughts on or objection to that idea? > > I would rather fix the docs first and not encourage folks to use > FIRST_THREAD_IN_PROC (a). My only worry about the additional wakeup is other > places that may be sleeping on the thread pointer for another reason. It Ok, I'll start by having a go at rejigging the 7.x and 8.x kthread(9) man pages to be more informational and draw out the subtle differences we've been discussing in this thread. I'll post man page patches for review when they're ready. > might even be better to add a dedicated condvar to 'struct thread' in 8.x > that is used for the wakeup and do the wakeup on that rather than the thread > pointer to be honest. > What are the pros/cons of using mtx_sleep/wakeup vs cv_wait/cv_broadcast? Cheers, Lawrence From jhb at freebsd.org Wed Dec 3 14:52:33 2008 From: jhb at freebsd.org (John Baldwin) Date: Wed Dec 3 14:52:40 2008 Subject: kthread_exit(9) unexpectedness In-Reply-To: <4935E4B5.6090204@freebsd.org> References: <492412E8.3060700@freebsd.org> <200812021707.41545.jhb@freebsd.org> <4935E4B5.6090204@freebsd.org> Message-ID: <200812031502.42218.jhb@freebsd.org> On Tuesday 02 December 2008 08:45:25 pm Lawrence Stewart wrote: > > might even be better to add a dedicated condvar to 'struct thread' in 8.x > > that is used for the wakeup and do the wakeup on that rather than the thread > > pointer to be honest. > > > > What are the pros/cons of using mtx_sleep/wakeup vs cv_wait/cv_broadcast? Forces you to use explicit wait channels as opposed to some of the problems we have now with 3-4 places sleeping on proc pointers for example. -- John Baldwin From imp at bsdimp.com Wed Dec 3 18:39:07 2008 From: imp at bsdimp.com (M. Warner Losh) Date: Wed Dec 3 18:39:14 2008 Subject: RFC: making gpart default In-Reply-To: References: <68B9D78C-C0CF-4D64-AF53-C3736EEC8D23@mac.com> Message-ID: <20081203.193714.693830802.imp@bsdimp.com> In message: "Peter Wemm" writes: : On Sat, Nov 29, 2008 at 7:30 PM, Marcel Moolenaar wrote: : > : > On Nov 29, 2008, at 1:56 PM, Peter Wemm wrote: : [..] : >> * There should be some guidance or hints on laying out disks. For : >> example, a gpart create -s gpt on a raid volume ends up with a start : >> sector of 34 for the free space. There should be a documentation hint : >> to round up start sectors to a power of 2 and/or block size on a raid. : >> eg: if you have a raid with 64K stripes, then move the start sector : >> from 34 to 128. Otherwise we end up with file systems issuing : >> transactions that can split across multiple raid stripes. FWIW, I : >> conveniently filled this hole with boot code. : > : > Hmmm... gpart(8) typically can't store this kind : > of information on-disk, but other than that it : > supports alignment/padding already. We just need : > a way to tell gpart about it. Maybe this should : > come from the provider (i.e. underlying geom)... : : I was more thinking of a man page note to warn of the issue. : : Also, in the gpt case, it might make sense in gpt partition table case : to round up the initial size to a power of 2. Right now we lose 34 : sectors from the beginning. Rounding it to 64 total at least gets us : to an even power of 2. UFS's frequent block size of 16K shouldn't : cross any underlying stripe boundaries in the usual case. This likely is a hang over from the MBR code that puts the first partition at one cylendar offset from the beginning to conform with the MBR conventions of (some?) Bioses that use that to get the parameters for the disk... Warner From flo at kasimir.com Thu Dec 4 10:24:10 2008 From: flo at kasimir.com (Florian Smeets) Date: Thu Dec 4 10:24:16 2008 Subject: Adding strndup(3) to libc viable/useful? Message-ID: <49381DD4.2000506@kasimir.com> Hi, first of all i hope arch is the correct place to discuss this. While porting an application to FreeBSD i found that FreeBSDs libc does not have strndup. NetBSD added this about 2 years ago. A port of this to FreeBSD was very easy. There are 13 ports in the ports tree right now that patch in strndup via a patch in the files/ dir, well actually 12 bring there own version of strndup and one replaces it with a call to malloc/strncpy. Would it make sense to add this to our libc? A patch which does this is available here at http://webmail.solomo.de/~flo/strndup.patch I don't know if there is such a thing as minimum number of ports to require a function so that it can be added to the base system... Any feedback appreciated. Cheers, Florian From delphij at delphij.net Thu Dec 4 10:44:37 2008 From: delphij at delphij.net (Xin LI) Date: Thu Dec 4 10:44:45 2008 Subject: Adding strndup(3) to libc viable/useful? In-Reply-To: <49381DD4.2000506@kasimir.com> References: <49381DD4.2000506@kasimir.com> Message-ID: <49382502.1040403@delphij.net> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, Florian, Florian Smeets wrote: > Hi, > > first of all i hope arch is the correct place to discuss this. > > While porting an application to FreeBSD i found that FreeBSDs libc does > not have strndup. NetBSD added this about 2 years ago. A port of this to > FreeBSD was very easy. > > There are 13 ports in the ports tree right now that patch in strndup via > a patch in the files/ dir, well actually 12 bring there own version of > strndup and one replaces it with a call to malloc/strncpy. > > Would it make sense to add this to our libc? A patch which does this is > available here at http://webmail.solomo.de/~flo/strndup.patch > > I don't know if there is such a thing as minimum number of ports to > require a function so that it can be added to the base system... > > Any feedback appreciated. I think whether or not to add it really depends on how popular it is :) We included strdup() because it is a very common extension. Your patch looks fine but perhaps it would be a good idea to explicitly mention that this is not a commonly implemented GNU extension (inheritedly, this could reduce portability). Cheers, - -- Xin LI http://www.delphij.net/ FreeBSD - The Power to Serve! -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (FreeBSD) iEYEARECAAYFAkk4JQEACgkQi+vbBBjt66BNKwCgtLMQccFG6VvPv2xLjsmZr7I5 VuEAnjiCRzKtuhmMHHzuwrBibKGluidX =Y4XX -----END PGP SIGNATURE----- From des at des.no Thu Dec 4 10:46:54 2008 From: des at des.no (=?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?=) Date: Thu Dec 4 10:47:00 2008 Subject: Adding strndup(3) to libc viable/useful? In-Reply-To: <49381DD4.2000506@kasimir.com> (Florian Smeets's message of "Thu, 04 Dec 2008 19:13:40 +0100") References: <49381DD4.2000506@kasimir.com> Message-ID: <86r64nwzfn.fsf@ds4.des.no> Florian Smeets writes: > There are 13 ports in the ports tree right now that patch in strndup > via a patch in the files/ dir, well actually 12 bring there own > version of strndup and one replaces it with a call to malloc/strncpy. > > Would it make sense to add this to our libc? A patch which does this > is available here at http://webmail.solomo.de/~flo/strndup.patch Not a bad idea, but those ports patches would still be required for compatibility with existing releases. DES -- Dag-Erling Sm?rgrav - des@des.no From peter at wemm.org Thu Dec 4 11:27:03 2008 From: peter at wemm.org (Peter Wemm) Date: Thu Dec 4 11:27:10 2008 Subject: Adding strndup(3) to libc viable/useful? In-Reply-To: <49382502.1040403@delphij.net> References: <49381DD4.2000506@kasimir.com> <49382502.1040403@delphij.net> Message-ID: On Thu, Dec 4, 2008 at 10:44 AM, Xin LI wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi, Florian, > > Florian Smeets wrote: >> Hi, >> >> first of all i hope arch is the correct place to discuss this. >> >> While porting an application to FreeBSD i found that FreeBSDs libc does >> not have strndup. NetBSD added this about 2 years ago. A port of this to >> FreeBSD was very easy. >> >> There are 13 ports in the ports tree right now that patch in strndup via >> a patch in the files/ dir, well actually 12 bring there own version of >> strndup and one replaces it with a call to malloc/strncpy. >> >> Would it make sense to add this to our libc? A patch which does this is >> available here at http://webmail.solomo.de/~flo/strndup.patch >> >> I don't know if there is such a thing as minimum number of ports to >> require a function so that it can be added to the base system... >> >> Any feedback appreciated. > > I think whether or not to add it really depends on how popular it is :) > We included strdup() because it is a very common extension. > > Your patch looks fine but perhaps it would be a good idea to explicitly > mention that this is not a commonly implemented GNU extension > (inheritedly, this could reduce portability). glibc has had this for a long time and the trend for this function seems to be gaining ground. I think solaris is the last remaining major holdout. str*() namespace belongs to the implementation (us). There are lots of places where I've seen strndup() implemented as compatability shims. Everything from Asterisk to Varnish. I wouldn't be suprised if there were ports that had this knowledge hard coded. FWIW, there are a bunch of other useful utility str*() and mem*() functions that glibc has that we do not. I've run into the lack of fmemopen() in the past. I found an implementation from rwatson. -- Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI6FJV "All of this is for nothing if we don't go to the stars" - JMS/B5 "If Java had true garbage collection, most programs would delete themselves upon execution." -- Robert Sewell From max at love2party.net Thu Dec 4 11:54:43 2008 From: max at love2party.net (Max Laier) Date: Thu Dec 4 11:54:49 2008 Subject: Adding strndup(3) to libc viable/useful? In-Reply-To: References: <49381DD4.2000506@kasimir.com> <49382502.1040403@delphij.net> Message-ID: <200812042042.06181.max@love2party.net> On Thursday 04 December 2008 20:27:01 Peter Wemm wrote: > On Thu, Dec 4, 2008 at 10:44 AM, Xin LI wrote: > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA1 > > > > Hi, Florian, > > > > Florian Smeets wrote: > >> Hi, > >> > >> first of all i hope arch is the correct place to discuss this. > >> > >> While porting an application to FreeBSD i found that FreeBSDs libc does > >> not have strndup. NetBSD added this about 2 years ago. A port of this to > >> FreeBSD was very easy. > >> > >> There are 13 ports in the ports tree right now that patch in strndup via > >> a patch in the files/ dir, well actually 12 bring there own version of > >> strndup and one replaces it with a call to malloc/strncpy. > >> > >> Would it make sense to add this to our libc? A patch which does this is > >> available here at http://webmail.solomo.de/~flo/strndup.patch > >> > >> I don't know if there is such a thing as minimum number of ports to > >> require a function so that it can be added to the base system... > >> > >> Any feedback appreciated. > > > > I think whether or not to add it really depends on how popular it is :) > > We included strdup() because it is a very common extension. > > > > Your patch looks fine but perhaps it would be a good idea to explicitly > > mention that this is not a commonly implemented GNU extension > > (inheritedly, this could reduce portability). > > glibc has had this for a long time and the trend for this function > seems to be gaining ground. I think solaris is the last remaining > major holdout. > > str*() namespace belongs to the implementation (us). > > There are lots of places where I've seen strndup() implemented as > compatability shims. Everything from Asterisk to Varnish. I wouldn't > be suprised if there were ports that had this knowledge hard coded. > > FWIW, there are a bunch of other useful utility str*() and mem*() > functions that glibc has that we do not. strnvis! (from OpenBSD) > I've run into the lack of fmemopen() in the past. I found an > implementation from rwatson. -- /"\ Best regards, | mlaier@freebsd.org \ / Max Laier | ICQ #67774661 X http://pf4freebsd.love2party.net/ | mlaier@EFnet / \ ASCII Ribbon Campaign | Against HTML Mail and News From jhb at freebsd.org Thu Dec 4 13:24:38 2008 From: jhb at freebsd.org (John Baldwin) Date: Thu Dec 4 13:24:50 2008 Subject: RFC: making gpart default In-Reply-To: <20081203.193714.693830802.imp@bsdimp.com> References: <20081203.193714.693830802.imp@bsdimp.com> Message-ID: <200812041313.34565.jhb@freebsd.org> On Wednesday 03 December 2008 09:37:14 pm M. Warner Losh wrote: > In message: > "Peter Wemm" writes: > : On Sat, Nov 29, 2008 at 7:30 PM, Marcel Moolenaar wrote: > : > > : > On Nov 29, 2008, at 1:56 PM, Peter Wemm wrote: > : [..] > : >> * There should be some guidance or hints on laying out disks. For > : >> example, a gpart create -s gpt on a raid volume ends up with a start > : >> sector of 34 for the free space. There should be a documentation hint > : >> to round up start sectors to a power of 2 and/or block size on a raid. > : >> eg: if you have a raid with 64K stripes, then move the start sector > : >> from 34 to 128. Otherwise we end up with file systems issuing > : >> transactions that can split across multiple raid stripes. FWIW, I > : >> conveniently filled this hole with boot code. > : > > : > Hmmm... gpart(8) typically can't store this kind > : > of information on-disk, but other than that it > : > supports alignment/padding already. We just need > : > a way to tell gpart about it. Maybe this should > : > come from the provider (i.e. underlying geom)... > : > : I was more thinking of a man page note to warn of the issue. > : > : Also, in the gpt case, it might make sense in gpt partition table case > : to round up the initial size to a power of 2. Right now we lose 34 > : sectors from the beginning. Rounding it to 64 total at least gets us > : to an even power of 2. UFS's frequent block size of 16K shouldn't > : cross any underlying stripe boundaries in the usual case. > > This likely is a hang over from the MBR code that puts the first > partition at one cylendar offset from the beginning to conform with > the MBR conventions of (some?) Bioses that use that to get the > parameters for the disk... No, the way GPT works, you have a PMBR at sector 0, then immediately following that you have the Primary partition table in the next N sectors (the first sector in the table has a header that contains the size of the table). Then you have a backup Secondary partition table in the last N sectors of the disk as well. At least with the old gpt(8) tool you could actually tell it how big of a table to make when you created a GPT, and I imagine gpart probably can do the same. -- John Baldwin From jhb at freebsd.org Thu Dec 4 13:24:38 2008 From: jhb at freebsd.org (John Baldwin) Date: Thu Dec 4 13:24:50 2008 Subject: RFC: making gpart default In-Reply-To: <20081203.193714.693830802.imp@bsdimp.com> References: <20081203.193714.693830802.imp@bsdimp.com> Message-ID: <200812041313.34565.jhb@freebsd.org> On Wednesday 03 December 2008 09:37:14 pm M. Warner Losh wrote: > In message: > "Peter Wemm" writes: > : On Sat, Nov 29, 2008 at 7:30 PM, Marcel Moolenaar wrote: > : > > : > On Nov 29, 2008, at 1:56 PM, Peter Wemm wrote: > : [..] > : >> * There should be some guidance or hints on laying out disks. For > : >> example, a gpart create -s gpt on a raid volume ends up with a start > : >> sector of 34 for the free space. There should be a documentation hint > : >> to round up start sectors to a power of 2 and/or block size on a raid. > : >> eg: if you have a raid with 64K stripes, then move the start sector > : >> from 34 to 128. Otherwise we end up with file systems issuing > : >> transactions that can split across multiple raid stripes. FWIW, I > : >> conveniently filled this hole with boot code. > : > > : > Hmmm... gpart(8) typically can't store this kind > : > of information on-disk, but other than that it > : > supports alignment/padding already. We just need > : > a way to tell gpart about it. Maybe this should > : > come from the provider (i.e. underlying geom)... > : > : I was more thinking of a man page note to warn of the issue. > : > : Also, in the gpt case, it might make sense in gpt partition table case > : to round up the initial size to a power of 2. Right now we lose 34 > : sectors from the beginning. Rounding it to 64 total at least gets us > : to an even power of 2. UFS's frequent block size of 16K shouldn't > : cross any underlying stripe boundaries in the usual case. > > This likely is a hang over from the MBR code that puts the first > partition at one cylendar offset from the beginning to conform with > the MBR conventions of (some?) Bioses that use that to get the > parameters for the disk... No, the way GPT works, you have a PMBR at sector 0, then immediately following that you have the Primary partition table in the next N sectors (the first sector in the table has a header that contains the size of the table). Then you have a backup Secondary partition table in the last N sectors of the disk as well. At least with the old gpt(8) tool you could actually tell it how big of a table to make when you created a GPT, and I imagine gpart probably can do the same. -- John Baldwin From xcllnt at mac.com Thu Dec 4 15:08:21 2008 From: xcllnt at mac.com (Marcel Moolenaar) Date: Thu Dec 4 15:08:27 2008 Subject: RFC: making gpart default In-Reply-To: <200812041313.34565.jhb@freebsd.org> References: <20081203.193714.693830802.imp@bsdimp.com> <200812041313.34565.jhb@freebsd.org> Message-ID: <5783CEB0-6163-429E-8B28-2F9D6FBCF4A8@mac.com> On Dec 4, 2008, at 10:13 AM, John Baldwin wrote: > No, the way GPT works, you have a PMBR at sector 0, then immediately > following > that you have the Primary partition table in the next N sectors (the > first > sector in the table has a header that contains the size of the > table). Then > you have a backup Secondary partition table in the last N sectors of > the disk > as well. At least with the old gpt(8) tool you could actually tell > it how > big of a table to make when you created a GPT, and I imagine gpart > probably > can do the same. Yes. For schemes that support it, you can specify how many entries to allocate. The 34 corresponds to 128 entries for GPT (4 entries per sector)... -- Marcel Moolenaar xcllnt@mac.com From wollman at hergotha.csail.mit.edu Thu Dec 4 15:51:59 2008 From: wollman at hergotha.csail.mit.edu (Garrett Wollman) Date: Thu Dec 4 15:52:05 2008 Subject: Adding strndup(3) to libc viable/useful? In-Reply-To: References: <49381DD4.2000506@kasimir.com> <49382502.1040403@delphij.net> Message-ID: <200812042313.mB4NDA6j045334@hergotha.csail.mit.edu> In article , Peter Wemm writes: >glibc has had this for a long time and the trend for this function >seems to be gaining ground. I think solaris is the last remaining >major holdout. Many formerly glibc-only functions have been adopted in the current (IEEE Std.1003.1-2008, ISO/IEC 9945:2009) POSIX revision. >FWIW, there are a bunch of other useful utility str*() and mem*() >functions that glibc has that we do not. Any of the functions that POSIX has adopted should definitely be added. >I've run into the lack of fmemopen() in the past. I found an >implementation from rwatson. fmemopen() is one of them. -GAWollman From delphij at delphij.net Sat Dec 6 13:29:45 2008 From: delphij at delphij.net (Xin LI) Date: Sat Dec 6 13:29:51 2008 Subject: Adding strndup(3) to libc viable/useful? In-Reply-To: <200812042313.mB4NDA6j045334@hergotha.csail.mit.edu> References: <49381DD4.2000506@kasimir.com> <49382502.1040403@delphij.net> <200812042313.mB4NDA6j045334@hergotha.csail.mit.edu> Message-ID: <493AEEBD.9050101@delphij.net> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Garrett Wollman wrote: > In article , > Peter Wemm writes: > >> glibc has had this for a long time and the trend for this function >> seems to be gaining ground. I think solaris is the last remaining >> major holdout. > > Many formerly glibc-only functions have been adopted in the current > (IEEE Std.1003.1-2008, ISO/IEC 9945:2009) POSIX revision. Is there a public available version on the Internet? Google turns out only an approval notice on Austin group... >> FWIW, there are a bunch of other useful utility str*() and mem*() >> functions that glibc has that we do not. > > Any of the functions that POSIX has adopted should definitely be > added. > >> I've run into the lack of fmemopen() in the past. I found an >> implementation from rwatson. > > fmemopen() is one of them. > > -GAWollman > > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" - -- Xin LI http://www.delphij.net/ FreeBSD - The Power to Serve! -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (FreeBSD) iEYEARECAAYFAkk67r0ACgkQi+vbBBjt66CNxQCfVT7AqvP5wzYcZ11HS1icuDno XBYAnj4l/AGiuxUf+PhzRaH3Skt9XEzc =n9ZK -----END PGP SIGNATURE----- From wollman at bimajority.org Sat Dec 6 14:10:40 2008 From: wollman at bimajority.org (Garrett Wollman) Date: Sat Dec 6 14:10:47 2008 Subject: Adding strndup(3) to libc viable/useful? In-Reply-To: <493AEEBD.9050101@delphij.net> References: <49381DD4.2000506@kasimir.com> <49382502.1040403@delphij.net> <200812042313.mB4NDA6j045334@hergotha.csail.mit.edu> <493AEEBD.9050101@delphij.net> Message-ID: <18746.63456.941134.552539@hergotha.csail.mit.edu> < said: > Is there a public available version on the Internet? Google turns out > only an approval notice on Austin group... The public HTML version is being worked on right now and should be avilable soon. The official PDF is available to Austin Group members and I think for sale from the Open Group catalogue. (Note that the ISO/IEC version of the standard has not yet been approved; I believe it's still being balloted. When it is approved, the PDF with updated front matter will be available from the ISO bookstore.) -GAWollman From peter at wemm.org Sat Dec 6 15:27:23 2008 From: peter at wemm.org (Peter Wemm) Date: Sat Dec 6 15:27:29 2008 Subject: RFC: making gpart default In-Reply-To: <20081203.193714.693830802.imp@bsdimp.com> References: <68B9D78C-C0CF-4D64-AF53-C3736EEC8D23@mac.com> <20081203.193714.693830802.imp@bsdimp.com> Message-ID: On Wed, Dec 3, 2008 at 6:37 PM, M. Warner Losh wrote: > In message: > "Peter Wemm" writes: > : On Sat, Nov 29, 2008 at 7:30 PM, Marcel Moolenaar wrote: > : > > : > On Nov 29, 2008, at 1:56 PM, Peter Wemm wrote: > : [..] > : >> * There should be some guidance or hints on laying out disks. For > : >> example, a gpart create -s gpt on a raid volume ends up with a start > : >> sector of 34 for the free space. There should be a documentation hint > : >> to round up start sectors to a power of 2 and/or block size on a raid. > : >> eg: if you have a raid with 64K stripes, then move the start sector > : >> from 34 to 128. Otherwise we end up with file systems issuing > : >> transactions that can split across multiple raid stripes. FWIW, I > : >> conveniently filled this hole with boot code. > : > > : > Hmmm... gpart(8) typically can't store this kind > : > of information on-disk, but other than that it > : > supports alignment/padding already. We just need > : > a way to tell gpart about it. Maybe this should > : > come from the provider (i.e. underlying geom)... > : > : I was more thinking of a man page note to warn of the issue. > : > : Also, in the gpt case, it might make sense in gpt partition table case > : to round up the initial size to a power of 2. Right now we lose 34 > : sectors from the beginning. Rounding it to 64 total at least gets us > : to an even power of 2. UFS's frequent block size of 16K shouldn't > : cross any underlying stripe boundaries in the usual case. > > This likely is a hang over from the MBR code that puts the first > partition at one cylendar offset from the beginning to conform with > the MBR conventions of (some?) Bioses that use that to get the > parameters for the disk... I don't recall what happens for "small" partitions, but I know from experience that bioses care far more about the end sector geometry than the start. eg: typically we used to set start: lba = 63, cyl 0, head 1, sector 1. All that implies is that there are 63 sectors per track. It gives no hints to the bios about number of heads. When we set end: lba = 1234567, cyl 1023, head 254, sector 63 then that implies number of heads and number of sectors. Cyl = 1023 means "rest of disk". No bioses that I've come across in the last 8 years gave a damn about the start info, just the end info. It was effectively required that you "end" on a cylinder boundary. We have not been "start"ing on a cylinder boundary - just a track/head boundary. Linux (and windows, I think) do start on full cylinder boundaries for partitions >= 8GB. head boundary start: cyl 0, head 1, sec 1 cylinder boundary start: cyl 1, head 0, sec 1. Anyway.. the important thing is that we do actually have a choice. Typically we'd have: sysid 165 (0xa5),(FreeBSD/NetBSD/386BSD) start 63, size 160086465 (78167 Meg), flag 80 (active) beg: cyl 0/ head 1/ sector 1; end: cyl 1023/ head 254/ sector 63 But this is faster: sysid 165 (0xa5),(FreeBSD/NetBSD/386BSD) start 512, size 285458473 (139384 Meg), flag 80 (active) beg: cyl 0/ head 8/ sector 9; end: cyl 1023/ head 254/ sector 63 -- Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI6FJV "All of this is for nothing if we don't go to the stars" - JMS/B5 "If Java had true garbage collection, most programs would delete themselves upon execution." -- Robert Sewell From peter at wemm.org Sat Dec 6 15:31:50 2008 From: peter at wemm.org (Peter Wemm) Date: Sat Dec 6 15:31:56 2008 Subject: RFC: making gpart default In-Reply-To: <5783CEB0-6163-429E-8B28-2F9D6FBCF4A8@mac.com> References: <20081203.193714.693830802.imp@bsdimp.com> <200812041313.34565.jhb@freebsd.org> <5783CEB0-6163-429E-8B28-2F9D6FBCF4A8@mac.com> Message-ID: On Thu, Dec 4, 2008 at 3:08 PM, Marcel Moolenaar wrote: > > On Dec 4, 2008, at 10:13 AM, John Baldwin wrote: > >> No, the way GPT works, you have a PMBR at sector 0, then immediately >> following >> that you have the Primary partition table in the next N sectors (the first >> sector in the table has a header that contains the size of the table). >> Then >> you have a backup Secondary partition table in the last N sectors of the >> disk >> as well. At least with the old gpt(8) tool you could actually tell it how >> big of a table to make when you created a GPT, and I imagine gpart >> probably >> can do the same. > > Yes. For schemes that support it, you can specify how many entries > to allocate. The 34 corresponds to 128 entries for GPT (4 entries > per sector)... Yes. 1 sector (pmbr) 1 sector (header) 32 sectors (128 partitions) = 34 sectors. Or we could have 1 sector (pmbr) 1 sector (header) 62 sectors (248 partitions) = 64 sectors. At least it is a power of two, even if only 32K. I'd love it if the man page told users to reserve another 32K for "boot code", so that the start address becomes sector 128, or 64K. This is a commonly used stripe size. -- Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI6FJV "All of this is for nothing if we don't go to the stars" - JMS/B5 "If Java had true garbage collection, most programs would delete themselves upon execution." -- Robert Sewell From marcus at FreeBSD.org Sun Dec 7 08:26:09 2008 From: marcus at FreeBSD.org (Joe Marcus Clarke) Date: Sun Dec 7 08:26:18 2008 Subject: RFC: New VOP to translate vnode to its component name Message-ID: <1228667168.69753.16.camel@shumai.marcuscom.com> Background: Procstat (i.e. kinfo_file) was a great addition which allows userland processes to get a list of open files for a process without the need for elevated privileges (e.g. kmem access). This feature uses the VFS cache to find component names from vnodes in a process' file descriptor table. Because of its ease of use, I quickly deployed it into libgtop so that it could provide an lsof-like feature for FreeBSD. Another need arose that seemed perfect for procstat: the ability to find out what process had the various mouse devices open. This was needed for X.Org's HAL integration. Unfortunately, due to the fact that devfs did not make use of the VFS cache, this was impossible to do without bringing it a lot of kvm code from fstat, or simply exec'ing fstat periodically. I chose the latter. The consequence is easier-to-read code, but a performance hit with default HAL configurations. Robert Watson suggested I teach the VFS cache lookup function to query file systems directly when cache lookups fail. After a few false starts, and with the help of kib, I think I have a committable implementation. Solution: Here is a patch to HEAD, along with a man page, for VOP_CNP. VOP_CNP translates a vnode to its component name. It is currently called from vn_fullpath1() to traverse a vnode hierarchy when cache lookups for those vnodes fail. I have currently implemented VOP_CNP for devfs and pseudofs. Kostik has thoroughly reviewed the devfs implementation. I only recently did the pseudofs implementation at his request. Additionally, the devfs implementation has gone through a Peter Holm stress test, and survives (the pseudofs implementation survives WITNESS and VFS lock debugging). I would like to commit this work with a possible MFC to RELENG_7 to come later. http://www.marcuscom.com/downloads/vop_cnp_10.diff http://www.marcuscom.com/downloads/VOP_CNP.9 Joe -- Joe Marcus Clarke FreeBSD GNOME Team :: gnome@FreeBSD.org FreeNode / #freebsd-gnome http://www.FreeBSD.org/gnome -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: This is a digitally signed message part Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081207/aefcb4cd/attachment.pgp From ed at 80386.nl Sun Dec 7 09:03:55 2008 From: ed at 80386.nl (Ed Schouten) Date: Sun Dec 7 09:04:02 2008 Subject: RFC: New VOP to translate vnode to its component name In-Reply-To: <1228667168.69753.16.camel@shumai.marcuscom.com> References: <1228667168.69753.16.camel@shumai.marcuscom.com> Message-ID: <20081207170354.GI18652@hoeg.nl> Hello Joe, * Joe Marcus Clarke wrote: > Here is a patch to HEAD, along with a man page, for VOP_CNP. Maybe this should be called VOP_COMPONENTNAME? I know, it's not as short as VOP_CNP, but is probably less cryptic to people who are trying to figure out how the VFS works. Yours, -- Ed Schouten WWW: http://80386.nl/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081207/0245c076/attachment.pgp From kabaev at gmail.com Sun Dec 7 09:16:35 2008 From: kabaev at gmail.com (Alexander Kabaev) Date: Sun Dec 7 09:16:41 2008 Subject: RFC: New VOP to translate vnode to its component name In-Reply-To: <1228667168.69753.16.camel@shumai.marcuscom.com> References: <1228667168.69753.16.camel@shumai.marcuscom.com> Message-ID: <20081207114938.44255b35@kan.dnsalias.net> On Sun, 07 Dec 2008 11:26:08 -0500 Joe Marcus Clarke wrote: > Background: > > Procstat (i.e. kinfo_file) was a great addition which allows userland > processes to get a list of open files for a process without the need > for elevated privileges (e.g. kmem access). This feature uses the > VFS cache to find component names from vnodes in a process' file > descriptor table. Because of its ease of use, I quickly deployed it > into libgtop so that it could provide an lsof-like feature for > FreeBSD. > > Another need arose that seemed perfect for procstat: the ability to > find out what process had the various mouse devices open. This was > needed for X.Org's HAL integration. Unfortunately, due to the fact > that devfs did not make use of the VFS cache, this was impossible to > do without bringing it a lot of kvm code from fstat, or simply > exec'ing fstat periodically. I chose the latter. The consequence is > easier-to-read code, but a performance hit with default HAL > configurations. > > Robert Watson suggested I teach the VFS cache lookup function to query > file systems directly when cache lookups fail. After a few false > starts, and with the help of kib, I think I have a committable > implementation. > > Solution: > > Here is a patch to HEAD, along with a man page, for VOP_CNP. VOP_CNP > translates a vnode to its component name. It is currently called from > vn_fullpath1() to traverse a vnode hierarchy when cache lookups for > those vnodes fail. I have currently implemented VOP_CNP for devfs and > pseudofs. Kostik has thoroughly reviewed the devfs implementation. I > only recently did the pseudofs implementation at his request. > Additionally, the devfs implementation has gone through a Peter Holm > stress test, and survives (the pseudofs implementation survives > WITNESS and VFS lock debugging). > > I would like to commit this work with a possible MFC to RELENG_7 to > come later. > > http://www.marcuscom.com/downloads/vop_cnp_10.diff > http://www.marcuscom.com/downloads/VOP_CNP.9 > > Joe > In general, the relationship between vnode and componentnames is not 1:1, so I do not see how this VOP can possibly be made a permanent part of our VFS interface, as its definition is bogus by design. -- Alexander Kabaev -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081207/686efc70/signature.pgp From marcus at FreeBSD.org Sun Dec 7 09:16:44 2008 From: marcus at FreeBSD.org (Joe Marcus Clarke) Date: Sun Dec 7 09:16:51 2008 Subject: RFC: New VOP to translate vnode to its component name In-Reply-To: <20081207170354.GI18652@hoeg.nl> References: <1228667168.69753.16.camel@shumai.marcuscom.com> <20081207170354.GI18652@hoeg.nl> Message-ID: <1228670197.69753.24.camel@shumai.marcuscom.com> On Sun, 2008-12-07 at 18:03 +0100, Ed Schouten wrote: > Hello Joe, > > * Joe Marcus Clarke wrote: > > Here is a patch to HEAD, along with a man page, for VOP_CNP. > > Maybe this should be called VOP_COMPONENTNAME? I know, it's not as short > as VOP_CNP, but is probably less cryptic to people who are trying to > figure out how the VFS works. I'm open to a new name, but VOP_COMPONENTNAME does seem a bit unwieldy. What about VOP_VPTONAME (in the same vein as VOP_VPTOFH)? Joe -- Joe Marcus Clarke FreeBSD GNOME Team :: gnome@FreeBSD.org FreeNode / #freebsd-gnome http://www.FreeBSD.org/gnome -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: This is a digitally signed message part Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081207/07e7bf0c/attachment.pgp From kostikbel at gmail.com Sun Dec 7 10:12:34 2008 From: kostikbel at gmail.com (Kostik Belousov) Date: Sun Dec 7 10:12:41 2008 Subject: RFC: New VOP to translate vnode to its component name In-Reply-To: <20081207114938.44255b35@kan.dnsalias.net> References: <1228667168.69753.16.camel@shumai.marcuscom.com> <20081207114938.44255b35@kan.dnsalias.net> Message-ID: <20081207173755.GN2038@deviant.kiev.zoral.com.ua> On Sun, Dec 07, 2008 at 11:49:38AM -0500, Alexander Kabaev wrote: > On Sun, 07 Dec 2008 11:26:08 -0500 > Joe Marcus Clarke wrote: > > > Background: > > > > Procstat (i.e. kinfo_file) was a great addition which allows userland > > processes to get a list of open files for a process without the need > > for elevated privileges (e.g. kmem access). This feature uses the > > VFS cache to find component names from vnodes in a process' file > > descriptor table. Because of its ease of use, I quickly deployed it > > into libgtop so that it could provide an lsof-like feature for > > FreeBSD. > > > > Another need arose that seemed perfect for procstat: the ability to > > find out what process had the various mouse devices open. This was > > needed for X.Org's HAL integration. Unfortunately, due to the fact > > that devfs did not make use of the VFS cache, this was impossible to > > do without bringing it a lot of kvm code from fstat, or simply > > exec'ing fstat periodically. I chose the latter. The consequence is > > easier-to-read code, but a performance hit with default HAL > > configurations. > > > > Robert Watson suggested I teach the VFS cache lookup function to query > > file systems directly when cache lookups fail. After a few false > > starts, and with the help of kib, I think I have a committable > > implementation. > > > > Solution: > > > > Here is a patch to HEAD, along with a man page, for VOP_CNP. VOP_CNP > > translates a vnode to its component name. It is currently called from > > vn_fullpath1() to traverse a vnode hierarchy when cache lookups for > > those vnodes fail. I have currently implemented VOP_CNP for devfs and > > pseudofs. Kostik has thoroughly reviewed the devfs implementation. I > > only recently did the pseudofs implementation at his request. > > Additionally, the devfs implementation has gone through a Peter Holm > > stress test, and survives (the pseudofs implementation survives > > WITNESS and VFS lock debugging). > > > > I would like to commit this work with a possible MFC to RELENG_7 to > > come later. > > > > http://www.marcuscom.com/downloads/vop_cnp_10.diff > > http://www.marcuscom.com/downloads/VOP_CNP.9 > > > > Joe > > > In general, the relationship between vnode and componentnames is not > 1:1, so I do not see how this VOP can possibly be made a permanent part > of our VFS interface, as its definition is bogus by design. In what sence its definition is bogus ? The vop should try to give a component name and a parent directory, if possible. It is perfectly acceptable to have several names, and return whatever is considered most suitable. Implementation may choose to always return a third element in some internal list, imagine any weird variant. Devfs implementation falls into this category. Or, it is possible to always return ENOENT, as is done in default implementation. I already discussed a possibility to add helper function that would do the usual readdir("..") to find vnode name for VDIR vnodes, with Peter Wemm. Using it as default implementation of vop_cnp should improve operation of vn_fullpath in general, and esp. on NFS. Personally for me, it would improve the accuracy of still alive patch that adds $ORIGIN support to rtld :). Please, state you objections more explicit. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081207/da63b498/attachment.pgp From marcus at FreeBSD.org Sun Dec 7 11:00:15 2008 From: marcus at FreeBSD.org (Joe Marcus Clarke) Date: Sun Dec 7 11:00:21 2008 Subject: RFC: New VOP to translate vnode to its component name In-Reply-To: <20081207114938.44255b35@kan.dnsalias.net> References: <1228667168.69753.16.camel@shumai.marcuscom.com> <20081207114938.44255b35@kan.dnsalias.net> Message-ID: <1228676405.69753.30.camel@shumai.marcuscom.com> On Sun, 2008-12-07 at 11:49 -0500, Alexander Kabaev wrote: > On Sun, 07 Dec 2008 11:26:08 -0500 > Joe Marcus Clarke wrote: > > > Background: > > > > Procstat (i.e. kinfo_file) was a great addition which allows userland > > processes to get a list of open files for a process without the need > > for elevated privileges (e.g. kmem access). This feature uses the > > VFS cache to find component names from vnodes in a process' file > > descriptor table. Because of its ease of use, I quickly deployed it > > into libgtop so that it could provide an lsof-like feature for > > FreeBSD. > > > > Another need arose that seemed perfect for procstat: the ability to > > find out what process had the various mouse devices open. This was > > needed for X.Org's HAL integration. Unfortunately, due to the fact > > that devfs did not make use of the VFS cache, this was impossible to > > do without bringing it a lot of kvm code from fstat, or simply > > exec'ing fstat periodically. I chose the latter. The consequence is > > easier-to-read code, but a performance hit with default HAL > > configurations. > > > > Robert Watson suggested I teach the VFS cache lookup function to query > > file systems directly when cache lookups fail. After a few false > > starts, and with the help of kib, I think I have a committable > > implementation. > > > > Solution: > > > > Here is a patch to HEAD, along with a man page, for VOP_CNP. VOP_CNP > > translates a vnode to its component name. It is currently called from > > vn_fullpath1() to traverse a vnode hierarchy when cache lookups for > > those vnodes fail. I have currently implemented VOP_CNP for devfs and > > pseudofs. Kostik has thoroughly reviewed the devfs implementation. I > > only recently did the pseudofs implementation at his request. > > Additionally, the devfs implementation has gone through a Peter Holm > > stress test, and survives (the pseudofs implementation survives > > WITNESS and VFS lock debugging). > > > > I would like to commit this work with a possible MFC to RELENG_7 to > > come later. > > > > http://www.marcuscom.com/downloads/vop_cnp_10.diff > > http://www.marcuscom.com/downloads/VOP_CNP.9 > > > > Joe > > > In general, the relationship between vnode and componentnames is not > 1:1, so I do not see how this VOP can possibly be made a permanent part > of our VFS interface, as its definition is bogus by design. VOP_CNP is not a replacement for VFS cache lookups. It tries to supplement the lookup when the lookup fails to find a hit. VOP_CNP itself may not succeed, or may not be appropriate for every file system (its default implementation simply returns ENOENT). However, it does fit well with devfs and pseudofs, and provides accurate enough results in the general cases. Joe -- Joe Marcus Clarke FreeBSD GNOME Team :: gnome@FreeBSD.org FreeNode / #freebsd-gnome http://www.FreeBSD.org/gnome -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: This is a digitally signed message part Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081207/7e41dfe9/attachment.pgp From marcus at FreeBSD.org Sun Dec 7 11:00:35 2008 From: marcus at FreeBSD.org (Joe Marcus Clarke) Date: Sun Dec 7 11:00:41 2008 Subject: RFC: New VOP to translate vnode to its component name In-Reply-To: <20081207173755.GN2038@deviant.kiev.zoral.com.ua> References: <1228667168.69753.16.camel@shumai.marcuscom.com> <20081207114938.44255b35@kan.dnsalias.net> <20081207173755.GN2038@deviant.kiev.zoral.com.ua> Message-ID: <1228676434.69753.33.camel@shumai.marcuscom.com> On Sun, 2008-12-07 at 19:37 +0200, Kostik Belousov wrote: > On Sun, Dec 07, 2008 at 11:49:38AM -0500, Alexander Kabaev wrote: > > On Sun, 07 Dec 2008 11:26:08 -0500 > > Joe Marcus Clarke wrote: > > > > > Background: > > > > > > Procstat (i.e. kinfo_file) was a great addition which allows userland > > > processes to get a list of open files for a process without the need > > > for elevated privileges (e.g. kmem access). This feature uses the > > > VFS cache to find component names from vnodes in a process' file > > > descriptor table. Because of its ease of use, I quickly deployed it > > > into libgtop so that it could provide an lsof-like feature for > > > FreeBSD. > > > > > > Another need arose that seemed perfect for procstat: the ability to > > > find out what process had the various mouse devices open. This was > > > needed for X.Org's HAL integration. Unfortunately, due to the fact > > > that devfs did not make use of the VFS cache, this was impossible to > > > do without bringing it a lot of kvm code from fstat, or simply > > > exec'ing fstat periodically. I chose the latter. The consequence is > > > easier-to-read code, but a performance hit with default HAL > > > configurations. > > > > > > Robert Watson suggested I teach the VFS cache lookup function to query > > > file systems directly when cache lookups fail. After a few false > > > starts, and with the help of kib, I think I have a committable > > > implementation. > > > > > > Solution: > > > > > > Here is a patch to HEAD, along with a man page, for VOP_CNP. VOP_CNP > > > translates a vnode to its component name. It is currently called from > > > vn_fullpath1() to traverse a vnode hierarchy when cache lookups for > > > those vnodes fail. I have currently implemented VOP_CNP for devfs and > > > pseudofs. Kostik has thoroughly reviewed the devfs implementation. I > > > only recently did the pseudofs implementation at his request. > > > Additionally, the devfs implementation has gone through a Peter Holm > > > stress test, and survives (the pseudofs implementation survives > > > WITNESS and VFS lock debugging). > > > > > > I would like to commit this work with a possible MFC to RELENG_7 to > > > come later. > > > > > > http://www.marcuscom.com/downloads/vop_cnp_10.diff > > > http://www.marcuscom.com/downloads/VOP_CNP.9 > > > > > > Joe > > > > > In general, the relationship between vnode and componentnames is not > > 1:1, so I do not see how this VOP can possibly be made a permanent part > > of our VFS interface, as its definition is bogus by design. > > In what sence its definition is bogus ? The vop should try to give a > component name and a parent directory, if possible. > > It is perfectly acceptable to have several names, and return whatever > is considered most suitable. Implementation may choose to always return > a third element in some internal list, imagine any weird variant. Devfs > implementation falls into this category. As does the pseudofs implementation which handles the case of a directory on a procfs file system specially. Even though the directory has a pn_name of "pid," the VOP_CNP returns the PID as the name instead. > Or, it is possible to always return ENOENT, as is done in default > implementation. > > I already discussed a possibility to add helper function that would > do the usual readdir("..") to find vnode name for VDIR vnodes, with > Peter Wemm. Using it as default implementation of vop_cnp should improve > operation of vn_fullpath in general, and esp. on NFS. Yes, Peter briefly mentioned this to me as well. Joe -- Joe Marcus Clarke FreeBSD GNOME Team :: gnome@FreeBSD.org FreeNode / #freebsd-gnome http://www.FreeBSD.org/gnome -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: This is a digitally signed message part Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081207/a3a61804/attachment.pgp From alfred at freebsd.org Sun Dec 7 11:39:56 2008 From: alfred at freebsd.org (Alfred Perlstein) Date: Sun Dec 7 11:40:01 2008 Subject: RFC: New VOP to translate vnode to its component name In-Reply-To: <1228670197.69753.24.camel@shumai.marcuscom.com> References: <1228667168.69753.16.camel@shumai.marcuscom.com> <20081207170354.GI18652@hoeg.nl> <1228670197.69753.24.camel@shumai.marcuscom.com> Message-ID: <20081207193953.GJ27096@elvis.mu.org> * Joe Marcus Clarke [081207 09:17] wrote: > On Sun, 2008-12-07 at 18:03 +0100, Ed Schouten wrote: > > Hello Joe, > > > > * Joe Marcus Clarke wrote: > > > Here is a patch to HEAD, along with a man page, for VOP_CNP. > > > > Maybe this should be called VOP_COMPONENTNAME? I know, it's not as short > > as VOP_CNP, but is probably less cryptic to people who are trying to > > figure out how the VFS works. > > I'm open to a new name, but VOP_COMPONENTNAME does seem a bit unwieldy. > What about VOP_VPTONAME (in the same vein as VOP_VPTOFH)? either VOP_VPTONAME or VOP_VPTOCNP (this is cool) -Alfred From peter at wemm.org Sun Dec 7 11:43:02 2008 From: peter at wemm.org (Peter Wemm) Date: Sun Dec 7 11:43:09 2008 Subject: RFC: New VOP to translate vnode to its component name In-Reply-To: <1228670197.69753.24.camel@shumai.marcuscom.com> References: <1228667168.69753.16.camel@shumai.marcuscom.com> <20081207170354.GI18652@hoeg.nl> <1228670197.69753.24.camel@shumai.marcuscom.com> Message-ID: On Sun, Dec 7, 2008 at 9:16 AM, Joe Marcus Clarke wrote: > On Sun, 2008-12-07 at 18:03 +0100, Ed Schouten wrote: >> Hello Joe, >> >> * Joe Marcus Clarke wrote: >> > Here is a patch to HEAD, along with a man page, for VOP_CNP. >> >> Maybe this should be called VOP_COMPONENTNAME? I know, it's not as short >> as VOP_CNP, but is probably less cryptic to people who are trying to >> figure out how the VFS works. > > I'm open to a new name, but VOP_COMPONENTNAME does seem a bit unwieldy. > What about VOP_VPTONAME (in the same vein as VOP_VPTOFH)? > > Joe Well, you already know I love the idea. Valgrind "knows" that you can obtain the pathname from a fd or mmap address reliably at any time because of procfs on linux, so its code is structured with that assumption. Using procfs (on 4.x and 6.x) or the kinfo stuff on 7.x+ sort of works, but it quickly becomes unusable if any paths involve NFS. nfs times out its cached items very quickly. Anyway, I see this as a good first step. I very much want to see a real vop_default implementation that does the readdir("..") method to fill in the gaps. It isn't particularly important to me if vn_fullpath() recovers the original pathname or not, so long as it can find *a* valid pathname that will work. As for names.. VOP_CNP doesn't tell me what it does at a glance. Ideas: VOP_VPTOCNP (vnode to component name, or VOP_VNTOCNP) VOP_RLOOKUP (reverse lookup) We have precedent for the first form. VOP_FHTOVP(). I don't think VOP_VPTOCNP() is too unwieldy and I think it would be a little more intuitive to a casual observer. I don't want to get trapped in a bikeshed sized To:/CC: list over it though. I'd rather see it committed to head and get some progress. BTW: at work we do extensive open-by-filehandle stuff on NFS. For the vast majority of vnodes on those machines, there never was a name cache entry. It would be priceless if the vop_default readdir(..) method could discover those names in procstat etc. -- Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI6FJV "All of this is for nothing if we don't go to the stars" - JMS/B5 "If Java had true garbage collection, most programs would delete themselves upon execution." -- Robert Sewell From kabaev at gmail.com Sun Dec 7 11:48:32 2008 From: kabaev at gmail.com (Alexander Kabaev) Date: Sun Dec 7 11:48:38 2008 Subject: RFC: New VOP to translate vnode to its component name In-Reply-To: <20081207173755.GN2038@deviant.kiev.zoral.com.ua> References: <1228667168.69753.16.camel@shumai.marcuscom.com> <20081207114938.44255b35@kan.dnsalias.net> <20081207173755.GN2038@deviant.kiev.zoral.com.ua> Message-ID: <20081207144822.7ff16504@kan.dnsalias.net> On Sun, 7 Dec 2008 19:37:55 +0200 Kostik Belousov wrote: > On Sun, Dec 07, 2008 at 11:49:38AM -0500, Alexander Kabaev wrote: > > On Sun, 07 Dec 2008 11:26:08 -0500 > > Joe Marcus Clarke wrote: > > > > > Background: > > > > > > Procstat (i.e. kinfo_file) was a great addition which allows > > > userland processes to get a list of open files for a process > > > without the need for elevated privileges (e.g. kmem access). > > > This feature uses the VFS cache to find component names from > > > vnodes in a process' file descriptor table. Because of its ease > > > of use, I quickly deployed it into libgtop so that it could > > > provide an lsof-like feature for FreeBSD. > > > > > > Another need arose that seemed perfect for procstat: the ability > > > to find out what process had the various mouse devices open. > > > This was needed for X.Org's HAL integration. Unfortunately, due > > > to the fact that devfs did not make use of the VFS cache, this > > > was impossible to do without bringing it a lot of kvm code from > > > fstat, or simply exec'ing fstat periodically. I chose the > > > latter. The consequence is easier-to-read code, but a > > > performance hit with default HAL configurations. > > > > > > Robert Watson suggested I teach the VFS cache lookup function to > > > query file systems directly when cache lookups fail. After a few > > > false starts, and with the help of kib, I think I have a > > > committable implementation. > > > > > > Solution: > > > > > > Here is a patch to HEAD, along with a man page, for VOP_CNP. > > > VOP_CNP translates a vnode to its component name. It is > > > currently called from vn_fullpath1() to traverse a vnode > > > hierarchy when cache lookups for those vnodes fail. I have > > > currently implemented VOP_CNP for devfs and pseudofs. Kostik has > > > thoroughly reviewed the devfs implementation. I only recently > > > did the pseudofs implementation at his request. Additionally, the > > > devfs implementation has gone through a Peter Holm stress test, > > > and survives (the pseudofs implementation survives WITNESS and > > > VFS lock debugging). > > > > > > I would like to commit this work with a possible MFC to RELENG_7 > > > to come later. > > > > > > http://www.marcuscom.com/downloads/vop_cnp_10.diff > > > http://www.marcuscom.com/downloads/VOP_CNP.9 > > > > > > Joe > > > > > In general, the relationship between vnode and componentnames is not > > 1:1, so I do not see how this VOP can possibly be made a permanent > > part of our VFS interface, as its definition is bogus by design. > > In what sence its definition is bogus ? The vop should try to give a > component name and a parent directory, if possible. > Which one from possible multiple names should that be and what makes one name more equal than others? > It is perfectly acceptable to have several names, and return whatever > is considered most suitable. Decides who? This is _generic_ VFS interface we are speaking about, not procfs or devfs kludge. VOP_CNP is precisely that - a kludge. > Implementation may choose to always > return a third element in some internal list, imagine any weird > variant. Devfs implementation falls into this category. > Or, it is possible to always return ENOENT, as is done in default > implementation. > > I already discussed a possibility to add helper function that would > do the usual readdir("..") to find vnode name for VDIR vnodes, with > Peter Wemm. Using it as default implementation of vop_cnp should > improve operation of vn_fullpath in general, and esp. on NFS. Then it does belong in vn_fullpatch and not as VNODE operation. > Personally for me, it would improve the accuracy of still alive patch > that adds $ORIGIN support to rtld :). > > Please, state you objections more explicit. I believe I did. -- Alexander Kabaev -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081207/77143351/signature.pgp From peter at wemm.org Sun Dec 7 12:07:29 2008 From: peter at wemm.org (Peter Wemm) Date: Sun Dec 7 12:07:35 2008 Subject: RFC: New VOP to translate vnode to its component name In-Reply-To: <20081207144822.7ff16504@kan.dnsalias.net> References: <1228667168.69753.16.camel@shumai.marcuscom.com> <20081207114938.44255b35@kan.dnsalias.net> <20081207173755.GN2038@deviant.kiev.zoral.com.ua> <20081207144822.7ff16504@kan.dnsalias.net> Message-ID: On Sun, Dec 7, 2008 at 11:48 AM, Alexander Kabaev wrote: > On Sun, 7 Dec 2008 19:37:55 +0200 > Kostik Belousov wrote: > >> On Sun, Dec 07, 2008 at 11:49:38AM -0500, Alexander Kabaev wrote: >> > On Sun, 07 Dec 2008 11:26:08 -0500 >> > Joe Marcus Clarke wrote: >> > >> > > Background: >> > > >> > > Procstat (i.e. kinfo_file) was a great addition which allows >> > > userland processes to get a list of open files for a process >> > > without the need for elevated privileges (e.g. kmem access). >> > > This feature uses the VFS cache to find component names from >> > > vnodes in a process' file descriptor table. Because of its ease >> > > of use, I quickly deployed it into libgtop so that it could >> > > provide an lsof-like feature for FreeBSD. >> > > >> > > Another need arose that seemed perfect for procstat: the ability >> > > to find out what process had the various mouse devices open. >> > > This was needed for X.Org's HAL integration. Unfortunately, due >> > > to the fact that devfs did not make use of the VFS cache, this >> > > was impossible to do without bringing it a lot of kvm code from >> > > fstat, or simply exec'ing fstat periodically. I chose the >> > > latter. The consequence is easier-to-read code, but a >> > > performance hit with default HAL configurations. >> > > >> > > Robert Watson suggested I teach the VFS cache lookup function to >> > > query file systems directly when cache lookups fail. After a few >> > > false starts, and with the help of kib, I think I have a >> > > committable implementation. >> > > >> > > Solution: >> > > >> > > Here is a patch to HEAD, along with a man page, for VOP_CNP. >> > > VOP_CNP translates a vnode to its component name. It is >> > > currently called from vn_fullpath1() to traverse a vnode >> > > hierarchy when cache lookups for those vnodes fail. I have >> > > currently implemented VOP_CNP for devfs and pseudofs. Kostik has >> > > thoroughly reviewed the devfs implementation. I only recently >> > > did the pseudofs implementation at his request. Additionally, the >> > > devfs implementation has gone through a Peter Holm stress test, >> > > and survives (the pseudofs implementation survives WITNESS and >> > > VFS lock debugging). >> > > >> > > I would like to commit this work with a possible MFC to RELENG_7 >> > > to come later. >> > > >> > > http://www.marcuscom.com/downloads/vop_cnp_10.diff >> > > http://www.marcuscom.com/downloads/VOP_CNP.9 >> > > >> > > Joe >> > > >> > In general, the relationship between vnode and componentnames is not >> > 1:1, so I do not see how this VOP can possibly be made a permanent >> > part of our VFS interface, as its definition is bogus by design. >> >> In what sence its definition is bogus ? The vop should try to give a >> component name and a parent directory, if possible. >> > > Which one from possible multiple names should that be and what makes one > name more equal than others? > >> It is perfectly acceptable to have several names, and return whatever >> is considered most suitable. > > > Decides who? This is _generic_ VFS interface we are speaking about, > not procfs or devfs kludge. VOP_CNP is precisely that - a kludge. vn_fullpath() is already this way. It is NOT guaranteed to give you the exact path that was used, but rather *a* working path. It is already using the *first* match it finds in the cache. I see nothing wrong with a generic VOP that asks "tell me A name and parent directory". This is strictly "best effort" only. If you want to determine the actual path, then you're going to need to modify the filedesc and vm_map_* structures to cache the actual pathname used. Of course, that is useless when you start renaming parent directory components, or files get moved, or whatever. Do you have a use in mind that would justify the complexity of changing the VOP_CNP() from returning a single path/parent to instead return a list of path/parent pairs? I don't see this vop needing to spread further than devfs, pseudofs and a 'readdir("..")' default method. -- Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI6FJV "All of this is for nothing if we don't go to the stars" - JMS/B5 "If Java had true garbage collection, most programs would delete themselves upon execution." -- Robert Sewell From des at des.no Mon Dec 8 02:20:25 2008 From: des at des.no (=?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?=) Date: Mon Dec 8 02:20:34 2008 Subject: RFC: New VOP to translate vnode to its component name In-Reply-To: <1228667168.69753.16.camel@shumai.marcuscom.com> (Joe Marcus Clarke's message of "Sun, 07 Dec 2008 11:26:08 -0500") References: <1228667168.69753.16.camel@shumai.marcuscom.com> Message-ID: <86tz9fynmf.fsf@ds4.des.no> Joe Marcus Clarke writes: > Here is a patch to HEAD, along with a man page, for VOP_CNP. VOP_CNP > translates a vnode to its component name. It is currently called from > vn_fullpath1() to traverse a vnode hierarchy when cache lookups for > those vnodes fail. I have currently implemented VOP_CNP for devfs and > pseudofs. Kostik has thoroughly reviewed the devfs implementation. I > only recently did the pseudofs implementation at his request. I would prefer pidbuf[PFS_NAMLEN] to pidbuf[11], and you can avoid two strlen()s by storing the return value from snprintf(). Also, defining pidbuf at the start of the block instead of the start of the function is a style(9) violation. Other than that, the pseudofs part of the patch has my approval. BTW, snprintf(buf, buflen, "%d", i) is so common in the kernel that we should consider adding some sort of itoa(9) to avoid the overhead of snprintf(9). DES -- Dag-Erling Sm?rgrav - des@des.no From bugmaster at FreeBSD.org Mon Dec 8 03:06:53 2008 From: bugmaster at FreeBSD.org (FreeBSD bugmaster) Date: Mon Dec 8 03:07:24 2008 Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org Message-ID: <200812081106.mB8B6q6q014192@freefall.freebsd.org> Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/120749 arch [request] Suggest upping the default kern.ps_arg_cache 1 problem total. From marcus at FreeBSD.org Mon Dec 8 09:28:04 2008 From: marcus at FreeBSD.org (Joe Marcus Clarke) Date: Mon Dec 8 09:28:10 2008 Subject: RFC: New VOP to translate vnode to its component name In-Reply-To: <86tz9fynmf.fsf@ds4.des.no> References: <1228667168.69753.16.camel@shumai.marcuscom.com> <86tz9fynmf.fsf@ds4.des.no> Message-ID: <1228757283.69132.14.camel@shumai.marcuscom.com> On Mon, 2008-12-08 at 11:20 +0100, Dag-Erling Sm?rgrav wrote: > Joe Marcus Clarke writes: > > Here is a patch to HEAD, along with a man page, for VOP_CNP. VOP_CNP > > translates a vnode to its component name. It is currently called from > > vn_fullpath1() to traverse a vnode hierarchy when cache lookups for > > those vnodes fail. I have currently implemented VOP_CNP for devfs and > > pseudofs. Kostik has thoroughly reviewed the devfs implementation. I > > only recently did the pseudofs implementation at his request. > > I would prefer pidbuf[PFS_NAMLEN] to pidbuf[11], and you can avoid two > strlen()s by storing the return value from snprintf(). Also, defining > pidbuf at the start of the block instead of the start of the function is > a style(9) violation. Other than that, the pseudofs part of the patch > has my approval. Thanks for the feedback. This was a section of the pfs code I especially wanted some comments on. I'll take care of your suggestions. Joe -- Joe Marcus Clarke FreeBSD GNOME Team :: gnome@FreeBSD.org FreeNode / #freebsd-gnome http://www.FreeBSD.org/gnome -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: This is a digitally signed message part Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081208/58036e8e/attachment.pgp From ed at 80386.nl Mon Dec 8 09:53:28 2008 From: ed at 80386.nl (Ed Schouten) Date: Mon Dec 8 09:53:34 2008 Subject: RFC: New VOP to translate vnode to its component name In-Reply-To: <1228670197.69753.24.camel@shumai.marcuscom.com> References: <1228667168.69753.16.camel@shumai.marcuscom.com> <20081207170354.GI18652@hoeg.nl> <1228670197.69753.24.camel@shumai.marcuscom.com> Message-ID: <20081208175327.GA7013@hoeg.nl> * Joe Marcus Clarke wrote: > On Sun, 2008-12-07 at 18:03 +0100, Ed Schouten wrote: > > Maybe this should be called VOP_COMPONENTNAME? I know, it's not as short > > as VOP_CNP, but is probably less cryptic to people who are trying to > > figure out how the VFS works. > > I'm open to a new name, but VOP_COMPONENTNAME does seem a bit unwieldy. > What about VOP_VPTONAME (in the same vein as VOP_VPTOFH)? Sounds good. :-) -- Ed Schouten WWW: http://80386.nl/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081208/8169a502/attachment.pgp From rwatson at FreeBSD.org Mon Dec 8 10:08:14 2008 From: rwatson at FreeBSD.org (Robert Watson) Date: Mon Dec 8 10:08:26 2008 Subject: RFC: New VOP to translate vnode to its component name In-Reply-To: References: <1228667168.69753.16.camel@shumai.marcuscom.com> <20081207170354.GI18652@hoeg.nl> <1228670197.69753.24.camel@shumai.marcuscom.com> Message-ID: On Sun, 7 Dec 2008, Peter Wemm wrote: > Well, you already know I love the idea. Valgrind "knows" that you can > obtain the pathname from a fd or mmap address reliably at any time because > of procfs on linux, so its code is structured with that assumption. Just to give a general vote of "we need to do something here, whether the details are exactly these or not" -- having better object->path resolution is quite important for audit, as well as if we want to adopt a file system notification services along the lines of Apple's fsevents (which is path-centric and operates from close() events rather than open() events). I don't think we should run in the Linux 'dentry' direction, but having a more robust translation service would be quite valuable. One comment: I think we should continue to have a KPI which does a sleep-free translation to call, but with weaker semantics than one that is sleepable and can query for more robust reverse lookup. Robert N M Watson Computer Laboratory University of Cambridge > > Using procfs (on 4.x and 6.x) or the kinfo stuff on 7.x+ sort of > works, but it quickly becomes unusable if any paths involve NFS. nfs > times out its cached items very quickly. > > Anyway, I see this as a good first step. I very much want to see a > real vop_default implementation that does the readdir("..") method to > fill in the gaps. It isn't particularly important to me if > vn_fullpath() recovers the original pathname or not, so long as it can > find *a* valid pathname that will work. > > As for names.. VOP_CNP doesn't tell me what it does at a glance. Ideas: > VOP_VPTOCNP (vnode to component name, or VOP_VNTOCNP) > VOP_RLOOKUP (reverse lookup) > > We have precedent for the first form. VOP_FHTOVP(). > > I don't think VOP_VPTOCNP() is too unwieldy and I think it would be a > little more intuitive to a casual observer. I don't want to get > trapped in a bikeshed sized To:/CC: list over it though. I'd rather > see it committed to head and get some progress. > > BTW: at work we do extensive open-by-filehandle stuff on NFS. For the > vast majority of vnodes on those machines, there never was a name > cache entry. It would be priceless if the vop_default readdir(..) > method could discover those names in procstat etc. > > -- > Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI6FJV > "All of this is for nothing if we don't go to the stars" - JMS/B5 > "If Java had true garbage collection, most programs would delete > themselves upon execution." -- Robert Sewell > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > From marcus at FreeBSD.org Mon Dec 8 10:08:16 2008 From: marcus at FreeBSD.org (Joe Marcus Clarke) Date: Mon Dec 8 10:08:26 2008 Subject: RFC: New VOP to translate vnode to its component name In-Reply-To: <86tz9fynmf.fsf@ds4.des.no> References: <1228667168.69753.16.camel@shumai.marcuscom.com> <86tz9fynmf.fsf@ds4.des.no> Message-ID: <1228759690.69132.28.camel@shumai.marcuscom.com> On Mon, 2008-12-08 at 11:20 +0100, Dag-Erling Sm?rgrav wrote: > Joe Marcus Clarke writes: > > Here is a patch to HEAD, along with a man page, for VOP_CNP. VOP_CNP > > translates a vnode to its component name. It is currently called from > > vn_fullpath1() to traverse a vnode hierarchy when cache lookups for > > those vnodes fail. I have currently implemented VOP_CNP for devfs and > > pseudofs. Kostik has thoroughly reviewed the devfs implementation. I > > only recently did the pseudofs implementation at his request. > > I would prefer pidbuf[PFS_NAMLEN] to pidbuf[11], and you can avoid two > strlen()s by storing the return value from snprintf(). Also, defining > pidbuf at the start of the block instead of the start of the function is > a style(9) violation. Other than that, the pseudofs part of the patch > has my approval. http://www.marcuscom.com/downloads/vop_vptocnp_5.diff Joe -- Joe Marcus Clarke FreeBSD GNOME Team :: gnome@FreeBSD.org FreeNode / #freebsd-gnome http://www.FreeBSD.org/gnome -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: This is a digitally signed message part Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081208/942796a2/attachment.pgp From marcus at FreeBSD.org Mon Dec 8 17:06:51 2008 From: marcus at FreeBSD.org (Joe Marcus Clarke) Date: Mon Dec 8 17:06:57 2008 Subject: RFC: New VOP to translate vnode to its component name In-Reply-To: References: <1228667168.69753.16.camel@shumai.marcuscom.com> <20081207170354.GI18652@hoeg.nl> <1228670197.69753.24.camel@shumai.marcuscom.com> Message-ID: <1228784805.69132.66.camel@shumai.marcuscom.com> On Mon, 2008-12-08 at 18:08 +0000, Robert Watson wrote: > On Sun, 7 Dec 2008, Peter Wemm wrote: > > > Well, you already know I love the idea. Valgrind "knows" that you can > > obtain the pathname from a fd or mmap address reliably at any time because > > of procfs on linux, so its code is structured with that assumption. > > Just to give a general vote of "we need to do something here, whether the > details are exactly these or not" -- having better object->path resolution is > quite important for audit, as well as if we want to adopt a file system > notification services along the lines of Apple's fsevents (which is > path-centric and operates from close() events rather than open() events). I > don't think we should run in the Linux 'dentry' direction, but having a more > robust translation service would be quite valuable. One comment: I think we > should continue to have a KPI which does a sleep-free translation to call, but > with weaker semantics than one that is sleepable and can query for more robust > reverse lookup. Okay, what about a name? vn_fullpath_cache vn_fullpath_quick vn_fullpath_fast vn_fullpath_nosleep ... Joe > > Robert N M Watson > Computer Laboratory > University of Cambridge > > > > > Using procfs (on 4.x and 6.x) or the kinfo stuff on 7.x+ sort of > > works, but it quickly becomes unusable if any paths involve NFS. nfs > > times out its cached items very quickly. > > > > Anyway, I see this as a good first step. I very much want to see a > > real vop_default implementation that does the readdir("..") method to > > fill in the gaps. It isn't particularly important to me if > > vn_fullpath() recovers the original pathname or not, so long as it can > > find *a* valid pathname that will work. > > > > As for names.. VOP_CNP doesn't tell me what it does at a glance. Ideas: > > VOP_VPTOCNP (vnode to component name, or VOP_VNTOCNP) > > VOP_RLOOKUP (reverse lookup) > > > > We have precedent for the first form. VOP_FHTOVP(). > > > > I don't think VOP_VPTOCNP() is too unwieldy and I think it would be a > > little more intuitive to a casual observer. I don't want to get > > trapped in a bikeshed sized To:/CC: list over it though. I'd rather > > see it committed to head and get some progress. > > > > BTW: at work we do extensive open-by-filehandle stuff on NFS. For the > > vast majority of vnodes on those machines, there never was a name > > cache entry. It would be priceless if the vop_default readdir(..) > > method could discover those names in procstat etc. > > > > -- > > Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI6FJV > > "All of this is for nothing if we don't go to the stars" - JMS/B5 > > "If Java had true garbage collection, most programs would delete > > themselves upon execution." -- Robert Sewell > > _______________________________________________ > > freebsd-arch@freebsd.org mailing list > > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > > > -- Joe Marcus Clarke FreeBSD GNOME Team :: gnome@FreeBSD.org FreeNode / #freebsd-gnome http://www.FreeBSD.org/gnome -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: This is a digitally signed message part Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081209/225ad509/attachment.pgp From rwatson at FreeBSD.org Tue Dec 9 03:22:30 2008 From: rwatson at FreeBSD.org (Robert Watson) Date: Tue Dec 9 03:22:37 2008 Subject: RFC: New VOP to translate vnode to its component name In-Reply-To: <1228784805.69132.66.camel@shumai.marcuscom.com> References: <1228667168.69753.16.camel@shumai.marcuscom.com> <20081207170354.GI18652@hoeg.nl> <1228670197.69753.24.camel@shumai.marcuscom.com> <1228784805.69132.66.camel@shumai.marcuscom.com> Message-ID: On Mon, 8 Dec 2008, Joe Marcus Clarke wrote: >> Just to give a general vote of "we need to do something here, whether the >> details are exactly these or not" -- having better object->path resolution >> is quite important for audit, as well as if we want to adopt a file system >> notification services along the lines of Apple's fsevents (which is >> path-centric and operates from close() events rather than open() events). >> I don't think we should run in the Linux 'dentry' direction, but having a >> more robust translation service would be quite valuable. One comment: I >> think we should continue to have a KPI which does a sleep-free translation >> to call, but with weaker semantics than one that is sleepable and can query >> for more robust reverse lookup. > > Okay, what about a name? Oh, I do love a good bikeshed. I'm actually fine with any of these, but vn_fullpath_cache() sounds good to me. One of the cases I have in mind is path-based MAC policies, which will convert from a vnode to a path while holding the vnode lock -- if something is going to run around locking vnodes and doing VOP_READDIR's, that is not the time to do it. Robert N M Watson Computer Laboratory University of Cambridge > > vn_fullpath_cache > vn_fullpath_quick > vn_fullpath_fast > vn_fullpath_nosleep > ... > > Joe > >> >> Robert N M Watson >> Computer Laboratory >> University of Cambridge >> >>> >>> Using procfs (on 4.x and 6.x) or the kinfo stuff on 7.x+ sort of >>> works, but it quickly becomes unusable if any paths involve NFS. nfs >>> times out its cached items very quickly. >>> >>> Anyway, I see this as a good first step. I very much want to see a >>> real vop_default implementation that does the readdir("..") method to >>> fill in the gaps. It isn't particularly important to me if >>> vn_fullpath() recovers the original pathname or not, so long as it can >>> find *a* valid pathname that will work. >>> >>> As for names.. VOP_CNP doesn't tell me what it does at a glance. Ideas: >>> VOP_VPTOCNP (vnode to component name, or VOP_VNTOCNP) >>> VOP_RLOOKUP (reverse lookup) >>> >>> We have precedent for the first form. VOP_FHTOVP(). >>> >>> I don't think VOP_VPTOCNP() is too unwieldy and I think it would be a >>> little more intuitive to a casual observer. I don't want to get >>> trapped in a bikeshed sized To:/CC: list over it though. I'd rather >>> see it committed to head and get some progress. >>> >>> BTW: at work we do extensive open-by-filehandle stuff on NFS. For the >>> vast majority of vnodes on those machines, there never was a name >>> cache entry. It would be priceless if the vop_default readdir(..) >>> method could discover those names in procstat etc. >>> >>> -- >>> Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI6FJV >>> "All of this is for nothing if we don't go to the stars" - JMS/B5 >>> "If Java had true garbage collection, most programs would delete >>> themselves upon execution." -- Robert Sewell >>> _______________________________________________ >>> freebsd-arch@freebsd.org mailing list >>> http://lists.freebsd.org/mailman/listinfo/freebsd-arch >>> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" >>> >> > -- > Joe Marcus Clarke > FreeBSD GNOME Team :: gnome@FreeBSD.org > FreeNode / #freebsd-gnome > http://www.FreeBSD.org/gnome > From des at des.no Tue Dec 9 09:31:27 2008 From: des at des.no (=?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?=) Date: Tue Dec 9 09:31:34 2008 Subject: RFC: New VOP to translate vnode to its component name In-Reply-To: <1228759690.69132.28.camel@shumai.marcuscom.com> (Joe Marcus Clarke's message of "Mon, 08 Dec 2008 13:08:10 -0500") References: <1228667168.69753.16.camel@shumai.marcuscom.com> <86tz9fynmf.fsf@ds4.des.no> <1228759690.69132.28.camel@shumai.marcuscom.com> Message-ID: <861vwhz24y.fsf@ds4.des.no> Joe Marcus Clarke writes: > http://www.marcuscom.com/downloads/vop_vptocnp_5.diff Looks good as far as pseudofs is concerned. Thank you. DES -- Dag-Erling Sm?rgrav - des@des.no From jroberson at jroberson.net Tue Dec 9 18:24:34 2008 From: jroberson at jroberson.net (Jeff Roberson) Date: Tue Dec 9 18:24:43 2008 Subject: UMA & mbuf cache utilization. Message-ID: <20081209155714.K960@desktop> Hello, Nokia has graciously allowed me to release a patch which I developed to improve general mbuf and cluster cache behavior. This is based on others observations that due to simple alignment at 2k and 256k we achieve a poor cache distribution for the header area of packets and the most heavily used mbuf header fields. In addition, modern machines stripe memory access across several memories and even memory controllers. Accessing heavily aligned locations such as these can also create load imbalances among memories. To solve this problem I have added two new features to UMA. The first is the zone flag UMA_ZONE_CACHESPREAD. This flag modifies the meaning of the alignment field such that start addresses are staggered by at least align + 1 bytes. In the case of clusters and mbufs this means adding uma_cache_align + 1 bytes to the amount of storage allocated. This creates a certain constant amount of waste, 3% and 12% respectively. It also means we must use contiguous physical and virtual memory consisting of several pages to efficiently use the memory and land on as many cache lines as possible. Because contiguous physical memory is not always available, the allocator had to have a fallback mechanism. We don't simply want to have all mbuf allocations check two zones as once we deplete available contiguous memory the check on the first zone will always fail using the most expensive code path. To resolve this issue, I added the ability for secondary zones to stack on top of multiple primary zones. Secondary zones are zones which get their storage from another zone but handle their own caching, ctors, dtors, etc. By adding this feature a secondary zone can be created that can allocate either from the contiguous memory pool or the non-contiguous single-page pool depending on availability. It is also much faster to fail between them deep in the allocator because it is only required when we exhaust the already available mbuf memory. For mbufs and clusters there are now three zones each. A contigmalloc backed zone, a single-page allocator zone, and a secondary zone with the original zome_mbuf or zone_clust name. The packet zone also takes from both available mbuf zones. The individual backend zones are not exposed outside of kern_mbuf.c. Currently, each backend zone can have its own limit. The secondary zone only blocks when both are full. Statistic wise the limit should be reported as the sum of the backend limits, however, that isn't presently done. The secondary zone can not have its own limit independent of the backends at this time. I'm not sure if that's valuable or not. I have test results from nokia which show a dramatic improvement in several workloads but which I am probably not at liberty to discuss. I'm in the process of convincing Kip to help me get some benchmark data on our stack. Also as part of the patch I renamed a few functions since many were non-obvious and grew new keg abstractions to tidy things up a bit. I suspect those of you with UMA experience (robert, bosko) will find the renaming a welcome improvement. The patch is available at: http://people.freebsd.org/~jeff/mbuf_contig.diff I would love to hear any feedback you may have. I have been developing this and testing various version off and on for months, however, this is a fresh port to current and it is a little green so should be considered experimental. In particular, I'm most nervous about how the vm will respond to new pressure on contig physical pages. I'm also interested in hearing from embedded/limited memory people about how we might want to limit or tune this. Thanks, Jeff From marcus at FreeBSD.org Tue Dec 9 23:32:25 2008 From: marcus at FreeBSD.org (Joe Marcus Clarke) Date: Tue Dec 9 23:32:31 2008 Subject: RFC: New VOP to translate vnode to its component name In-Reply-To: References: <1228667168.69753.16.camel@shumai.marcuscom.com> <20081207170354.GI18652@hoeg.nl> <1228670197.69753.24.camel@shumai.marcuscom.com> <1228784805.69132.66.camel@shumai.marcuscom.com> Message-ID: <1228894344.35477.40.camel@shumai.marcuscom.com> On Tue, 2008-12-09 at 11:22 +0000, Robert Watson wrote: > On Mon, 8 Dec 2008, Joe Marcus Clarke wrote: > > >> Just to give a general vote of "we need to do something here, whether the > >> details are exactly these or not" -- having better object->path resolution > >> is quite important for audit, as well as if we want to adopt a file system > >> notification services along the lines of Apple's fsevents (which is > >> path-centric and operates from close() events rather than open() events). > >> I don't think we should run in the Linux 'dentry' direction, but having a > >> more robust translation service would be quite valuable. One comment: I > >> think we should continue to have a KPI which does a sleep-free translation > >> to call, but with weaker semantics than one that is sleepable and can query > >> for more robust reverse lookup. > > > > Okay, what about a name? > > Oh, I do love a good bikeshed. I'm actually fine with any of these, but > vn_fullpath_cache() sounds good to me. One of the cases I have in mind is > path-based MAC policies, which will convert from a vnode to a path while > holding the vnode lock -- if something is going to run around locking vnodes > and doing VOP_READDIR's, that is not the time to do it. I've duplicated vn_fullpath, vn_fullpath_global, and textvp_fullpath in this latest version. The new (well, old) functions are named *_cache. I even updated the vn_fullpath.9 man page. http://www.marcuscom.com/downloads/vop_vptocnp_7.diff Joe > > Robert N M Watson > Computer Laboratory > University of Cambridge > > > > > vn_fullpath_cache > > vn_fullpath_quick > > vn_fullpath_fast > > vn_fullpath_nosleep > > ... > > > > Joe > > > >> > >> Robert N M Watson > >> Computer Laboratory > >> University of Cambridge > >> > >>> > >>> Using procfs (on 4.x and 6.x) or the kinfo stuff on 7.x+ sort of > >>> works, but it quickly becomes unusable if any paths involve NFS. nfs > >>> times out its cached items very quickly. > >>> > >>> Anyway, I see this as a good first step. I very much want to see a > >>> real vop_default implementation that does the readdir("..") method to > >>> fill in the gaps. It isn't particularly important to me if > >>> vn_fullpath() recovers the original pathname or not, so long as it can > >>> find *a* valid pathname that will work. > >>> > >>> As for names.. VOP_CNP doesn't tell me what it does at a glance. Ideas: > >>> VOP_VPTOCNP (vnode to component name, or VOP_VNTOCNP) > >>> VOP_RLOOKUP (reverse lookup) > >>> > >>> We have precedent for the first form. VOP_FHTOVP(). > >>> > >>> I don't think VOP_VPTOCNP() is too unwieldy and I think it would be a > >>> little more intuitive to a casual observer. I don't want to get > >>> trapped in a bikeshed sized To:/CC: list over it though. I'd rather > >>> see it committed to head and get some progress. > >>> > >>> BTW: at work we do extensive open-by-filehandle stuff on NFS. For the > >>> vast majority of vnodes on those machines, there never was a name > >>> cache entry. It would be priceless if the vop_default readdir(..) > >>> method could discover those names in procstat etc. > >>> > >>> -- > >>> Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI6FJV > >>> "All of this is for nothing if we don't go to the stars" - JMS/B5 > >>> "If Java had true garbage collection, most programs would delete > >>> themselves upon execution." -- Robert Sewell > >>> _______________________________________________ > >>> freebsd-arch@freebsd.org mailing list > >>> http://lists.freebsd.org/mailman/listinfo/freebsd-arch > >>> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > >>> > >> > > -- > > Joe Marcus Clarke > > FreeBSD GNOME Team :: gnome@FreeBSD.org > > FreeNode / #freebsd-gnome > > http://www.FreeBSD.org/gnome > > > -- Joe Marcus Clarke FreeBSD GNOME Team :: gnome@FreeBSD.org FreeNode / #freebsd-gnome http://www.FreeBSD.org/gnome -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: This is a digitally signed message part Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081210/caa102b4/attachment.pgp From kostikbel at gmail.com Wed Dec 10 08:23:19 2008 From: kostikbel at gmail.com (Kostik Belousov) Date: Wed Dec 10 08:23:26 2008 Subject: RFC: New VOP to translate vnode to its component name In-Reply-To: <1228894344.35477.40.camel@shumai.marcuscom.com> References: <1228667168.69753.16.camel@shumai.marcuscom.com> <20081207170354.GI18652@hoeg.nl> <1228670197.69753.24.camel@shumai.marcuscom.com> <1228784805.69132.66.camel@shumai.marcuscom.com> <1228894344.35477.40.camel@shumai.marcuscom.com> Message-ID: <20081210162312.GO2038@deviant.kiev.zoral.com.ua> On Wed, Dec 10, 2008 at 02:32:24AM -0500, Joe Marcus Clarke wrote: > On Tue, 2008-12-09 at 11:22 +0000, Robert Watson wrote: > > On Mon, 8 Dec 2008, Joe Marcus Clarke wrote: > > > > >> Just to give a general vote of "we need to do something here, whether the > > >> details are exactly these or not" -- having better object->path resolution > > >> is quite important for audit, as well as if we want to adopt a file system > > >> notification services along the lines of Apple's fsevents (which is > > >> path-centric and operates from close() events rather than open() events). > > >> I don't think we should run in the Linux 'dentry' direction, but having a > > >> more robust translation service would be quite valuable. One comment: I > > >> think we should continue to have a KPI which does a sleep-free translation > > >> to call, but with weaker semantics than one that is sleepable and can query > > >> for more robust reverse lookup. > > > > > > Okay, what about a name? > > > > Oh, I do love a good bikeshed. I'm actually fine with any of these, but > > vn_fullpath_cache() sounds good to me. One of the cases I have in mind is > > path-based MAC policies, which will convert from a vnode to a path while > > holding the vnode lock -- if something is going to run around locking vnodes > > and doing VOP_READDIR's, that is not the time to do it. > > I've duplicated vn_fullpath, vn_fullpath_global, and textvp_fullpath in > this latest version. The new (well, old) functions are named *_cache. > I even updated the vn_fullpath.9 man page. > > http://www.marcuscom.com/downloads/vop_vptocnp_7.diff Main reason for vn_fullpath_cache() is to not sleep inside the function. I think this shall be reflected in the man page changes you did. I do not like having the old code pasted into the vfs_cache.c together with new implementation. I think that vn_fullpath1 can take a flag specifying whether to call the vop, and return ENOENT when call is disabled. This shall give the same effect without code bloat. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081210/8fef05fb/attachment.pgp From marcus at FreeBSD.org Wed Dec 10 15:23:18 2008 From: marcus at FreeBSD.org (Joe Marcus Clarke) Date: Wed Dec 10 15:23:25 2008 Subject: RFC: New VOP to translate vnode to its component name In-Reply-To: <20081210162312.GO2038@deviant.kiev.zoral.com.ua> References: <1228667168.69753.16.camel@shumai.marcuscom.com> <20081207170354.GI18652@hoeg.nl> <1228670197.69753.24.camel@shumai.marcuscom.com> <1228784805.69132.66.camel@shumai.marcuscom.com> <1228894344.35477.40.camel@shumai.marcuscom.com> <20081210162312.GO2038@deviant.kiev.zoral.com.ua> Message-ID: <1228951393.39938.7.camel@shumai.marcuscom.com> On Wed, 2008-12-10 at 18:23 +0200, Kostik Belousov wrote: > On Wed, Dec 10, 2008 at 02:32:24AM -0500, Joe Marcus Clarke wrote: > > On Tue, 2008-12-09 at 11:22 +0000, Robert Watson wrote: > > > On Mon, 8 Dec 2008, Joe Marcus Clarke wrote: > > > > > > >> Just to give a general vote of "we need to do something here, whether the > > > >> details are exactly these or not" -- having better object->path resolution > > > >> is quite important for audit, as well as if we want to adopt a file system > > > >> notification services along the lines of Apple's fsevents (which is > > > >> path-centric and operates from close() events rather than open() events). > > > >> I don't think we should run in the Linux 'dentry' direction, but having a > > > >> more robust translation service would be quite valuable. One comment: I > > > >> think we should continue to have a KPI which does a sleep-free translation > > > >> to call, but with weaker semantics than one that is sleepable and can query > > > >> for more robust reverse lookup. > > > > > > > > Okay, what about a name? > > > > > > Oh, I do love a good bikeshed. I'm actually fine with any of these, but > > > vn_fullpath_cache() sounds good to me. One of the cases I have in mind is > > > path-based MAC policies, which will convert from a vnode to a path while > > > holding the vnode lock -- if something is going to run around locking vnodes > > > and doing VOP_READDIR's, that is not the time to do it. > > > > I've duplicated vn_fullpath, vn_fullpath_global, and textvp_fullpath in > > this latest version. The new (well, old) functions are named *_cache. > > I even updated the vn_fullpath.9 man page. > > > > http://www.marcuscom.com/downloads/vop_vptocnp_7.diff > > Main reason for vn_fullpath_cache() is to not sleep inside the function. > I think this shall be reflected in the man page changes you did. > > I do not like having the old code pasted into the vfs_cache.c together > with new implementation. I think that vn_fullpath1 can take a flag > specifying whether to call the vop, and return ENOENT when call is > disabled. This shall give the same effect without code bloat. Of course. I should have thought of that. New diff is up. http://www.marcuscom.com/downloads/vop_vptocnp_8.diff Joe -- Joe Marcus Clarke FreeBSD GNOME Team :: gnome@FreeBSD.org FreeNode / #freebsd-gnome http://www.FreeBSD.org/gnome -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: This is a digitally signed message part Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081210/41622e12/attachment.pgp From ed at 80386.nl Thu Dec 11 09:52:27 2008 From: ed at 80386.nl (Ed Schouten) Date: Thu Dec 11 09:52:33 2008 Subject: What about strnlen(3)? In-Reply-To: <49381DD4.2000506@kasimir.com> References: <49381DD4.2000506@kasimir.com> Message-ID: <20081211175519.GD1176@hoeg.nl> Hello all, News flash: P1003.1 Issue 7 just got released. The spec seems to mention this routine: size_t strnlen(const char *s,size_t maxlen); Maybe we should add this one as well? -- Ed Schouten WWW: http://80386.nl/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081211/545fd80b/attachment.pgp From pluknet at gmail.com Thu Dec 11 10:52:52 2008 From: pluknet at gmail.com (pluknet) Date: Thu Dec 11 10:52:58 2008 Subject: What about strnlen(3)? In-Reply-To: <20081211175519.GD1176@hoeg.nl> References: <49381DD4.2000506@kasimir.com> <20081211175519.GD1176@hoeg.nl> Message-ID: 2008/12/11 Ed Schouten : > Hello all, > > News flash: P1003.1 Issue 7 just got released. The spec seems to mention > this routine: > > size_t strnlen(const char *s,size_t maxlen); > > Maybe we should add this one as well? > And strnlen() is used by 3rd party software, such as GlusterFS. my 2c. -- wbr, pluknet From ed at 80386.nl Thu Dec 11 11:01:47 2008 From: ed at 80386.nl (Ed Schouten) Date: Thu Dec 11 11:01:55 2008 Subject: [Patch] strnlen(3) In-Reply-To: <20081211175519.GD1176@hoeg.nl> References: <49381DD4.2000506@kasimir.com> <20081211175519.GD1176@hoeg.nl> Message-ID: <20081211190436.GE1176@hoeg.nl> Skipped content of type multipart/mixed-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081211/6ca7cc4f/attachment.pgp From pluknet at gmail.com Thu Dec 11 11:34:24 2008 From: pluknet at gmail.com (pluknet) Date: Thu Dec 11 11:34:30 2008 Subject: [Patch] strnlen(3) In-Reply-To: <20081211190436.GE1176@hoeg.nl> References: <49381DD4.2000506@kasimir.com> <20081211175519.GD1176@hoeg.nl> <20081211190436.GE1176@hoeg.nl> Message-ID: 2008/12/11 Ed Schouten : > Hello all, > > I've attached a patch, that adds strnlen(3) to libc. It also moves > strndup(3) out of __BSD_VISIBLE. I'll see if it survives `make universe' > and commit it soonish. Any comments? > btw, we already have strnlen under BSD license in the tree (in contribs), which is more optimized (uses one less instruction). > -- > Ed Schouten > WWW: http://80386.nl/ > -- wbr, pluknet From kostikbel at gmail.com Thu Dec 11 11:57:47 2008 From: kostikbel at gmail.com (Kostik Belousov) Date: Thu Dec 11 11:57:53 2008 Subject: [Patch] strnlen(3) In-Reply-To: <20081211190436.GE1176@hoeg.nl> References: <49381DD4.2000506@kasimir.com> <20081211175519.GD1176@hoeg.nl> <20081211190436.GE1176@hoeg.nl> Message-ID: <20081211195741.GW2038@deviant.kiev.zoral.com.ua> On Thu, Dec 11, 2008 at 08:04:36PM +0100, Ed Schouten wrote: > Hello all, > > I've attached a patch, that adds strnlen(3) to libc. It also moves > strndup(3) out of __BSD_VISIBLE. I'll see if it survives `make universe' > and commit it soonish. Any comments? strndup shall stay under __BSD_VISIBLE, and strnlen declaration shall go unto this define too. Not doing this will pollute namespace for the POSIX revisions we are (partially) trying to support. I think that style recommends to put empty statements constituting loop body on the separate line, properly indented. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081211/ae5ac64e/attachment.pgp From stefan at fafoe.narf.at Thu Dec 11 15:48:09 2008 From: stefan at fafoe.narf.at (Stefan Farfeleder) Date: Thu Dec 11 15:48:15 2008 Subject: [Patch] strnlen(3) In-Reply-To: <20081211195741.GW2038@deviant.kiev.zoral.com.ua> References: <49381DD4.2000506@kasimir.com> <20081211175519.GD1176@hoeg.nl> <20081211190436.GE1176@hoeg.nl> <20081211195741.GW2038@deviant.kiev.zoral.com.ua> Message-ID: <20081211233244.GA1414@lizard.fafoe.narf.at> On Thu, Dec 11, 2008 at 09:57:41PM +0200, Kostik Belousov wrote: > On Thu, Dec 11, 2008 at 08:04:36PM +0100, Ed Schouten wrote: > > Hello all, > > > > I've attached a patch, that adds strnlen(3) to libc. It also moves > > strndup(3) out of __BSD_VISIBLE. I'll see if it survives `make universe' > > and commit it soonish. Any comments? > strndup shall stay under __BSD_VISIBLE, and strnlen declaration shall > go unto this define too. Not doing this will pollute namespace > for the POSIX revisions we are (partially) trying to support. It should propably be inside #if __POSIX_VISIBLE >= 2008xx for a suitable value of 2008xx for P1003.1 Issue 7. From kostikbel at gmail.com Fri Dec 12 02:10:12 2008 From: kostikbel at gmail.com (Kostik Belousov) Date: Fri Dec 12 02:10:18 2008 Subject: [Patch] strnlen(3) In-Reply-To: <20081211233244.GA1414@lizard.fafoe.narf.at> References: <49381DD4.2000506@kasimir.com> <20081211175519.GD1176@hoeg.nl> <20081211190436.GE1176@hoeg.nl> <20081211195741.GW2038@deviant.kiev.zoral.com.ua> <20081211233244.GA1414@lizard.fafoe.narf.at> Message-ID: <20081212101006.GZ2038@deviant.kiev.zoral.com.ua> On Fri, Dec 12, 2008 at 12:32:45AM +0100, Stefan Farfeleder wrote: > On Thu, Dec 11, 2008 at 09:57:41PM +0200, Kostik Belousov wrote: > > On Thu, Dec 11, 2008 at 08:04:36PM +0100, Ed Schouten wrote: > > > Hello all, > > > > > > I've attached a patch, that adds strnlen(3) to libc. It also moves > > > strndup(3) out of __BSD_VISIBLE. I'll see if it survives `make universe' > > > and commit it soonish. Any comments? > > strndup shall stay under __BSD_VISIBLE, and strnlen declaration shall > > go unto this define too. Not doing this will pollute namespace > > for the POSIX revisions we are (partially) trying to support. > > It should propably be inside #if __POSIX_VISIBLE >= 2008xx for a > suitable value of 2008xx for P1003.1 Issue 7. Exactly. Since the 2008xx infrastructure work is not done yet, we shall keep it in BSD namespace (that is enabled by default). -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20081212/097bf0e9/attachment.pgp From mav at FreeBSD.org Sat Dec 13 16:43:55 2008 From: mav at FreeBSD.org (Alexander Motin) Date: Sat Dec 13 16:44:01 2008 Subject: m_devget() and const buffer Message-ID: <4944465F.1030102@FreeBSD.org> Hi. Does anybody knows why m_devget() receives non-const char* as first argument? Is there any destructive copy routines used with it? I have searched all the kernel, but haven't found any. May be we could change that? -- Alexander Motin From imp at bsdimp.com Sun Dec 14 10:30:45 2008 From: imp at bsdimp.com (M. Warner Losh) Date: Sun Dec 14 10:30:52 2008 Subject: Adding strndup(3) to libc viable/useful? In-Reply-To: <200812042313.mB4NDA6j045334@hergotha.csail.mit.edu> References: <49382502.1040403@delphij.net> <200812042313.mB4NDA6j045334@hergotha.csail.mit.edu> Message-ID: <20081214.112855.-1484714042.imp@bsdimp.com> In message: <200812042313.mB4NDA6j045334@hergotha.csail.mit.edu> Garrett Wollman writes: : >FWIW, there are a bunch of other useful utility str*() and mem*() : >functions that glibc has that we do not. : : Any of the functions that POSIX has adopted should definitely be : added. Got a list? Warner From bugmaster at FreeBSD.org Mon Dec 15 03:06:49 2008 From: bugmaster at FreeBSD.org (FreeBSD bugmaster) Date: Mon Dec 15 03:07:29 2008 Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org Message-ID: <200812151106.mBFB6mbG004261@freefall.freebsd.org> Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/120749 arch [request] Suggest upping the default kern.ps_arg_cache 1 problem total. From imp at bsdimp.com Tue Dec 16 12:21:13 2008 From: imp at bsdimp.com (M. Warner Losh) Date: Tue Dec 16 12:21:20 2008 Subject: Removing some cruft... Message-ID: <20081216.131845.-1739986974.imp@bsdimp.com> I was looking at the MIPS elf stuff based on a submission of some 64-bit support. In doing so, I discovered a number of 'unused' types that appear to have comments that indicate that they can be removed now and were just slavishly copied from arch to arch to arch. /* * The following non-standard values are used for passing information * from John Polstra's testbed program to the dynamic linker. These * are expected to go away soon. * * Unfortunately, these overlap the Linux non-standard values, so they * must not be used in the same context. */ #define AT_BRK 10 /* Starting point for sbrk and brk. */ #define AT_DEBUG 11 /* Debugging level. */ These have be slavishly copied to arm, powerpc, sparc64, ia64, mips, sun4v and amd64. All these files have nearly identical comments (except powerpc, which changes the value). The only place these are used in the kernel is in the Linux! emulation in i386/linux/linux_sysvec.c and amd64/linux32/linux32_sysvec.c: if (args->trace) AUXARGS_ENTRY(pos, AT_DEBUG, 1); Since AT_DEBUG and AT_UID have the same value, and we look at AT_UID later, we wind up passing the wrong value for AT_UID. Fortunately, we don't use AT_UID for anything in the tree.... So I'd like to remove all this stuff unless there's a compelling reason to keep it. Can anybody think of a reason to keep it? It seems completely non-functional... Warner From rdivacky at freebsd.org Tue Dec 16 13:49:40 2008 From: rdivacky at freebsd.org (Roman Divacky) Date: Tue Dec 16 13:49:46 2008 Subject: Removing some cruft... In-Reply-To: <20081216.131845.-1739986974.imp@bsdimp.com> References: <20081216.131845.-1739986974.imp@bsdimp.com> Message-ID: <20081216212746.GA28834@freebsd.org> On Tue, Dec 16, 2008 at 01:18:45PM -0700, M. Warner Losh wrote: > I was looking at the MIPS elf stuff based on a submission of some > 64-bit support. In doing so, I discovered a number of 'unused' types > that appear to have comments that indicate that they can be removed > now and were just slavishly copied from arch to arch to arch. > > /* > * The following non-standard values are used for passing information > * from John Polstra's testbed program to the dynamic linker. These > * are expected to go away soon. > * > * Unfortunately, these overlap the Linux non-standard values, so they > * must not be used in the same context. > */ > #define AT_BRK 10 /* Starting point for sbrk and brk. */ > #define AT_DEBUG 11 /* Debugging level. */ > > These have be slavishly copied to arm, powerpc, sparc64, ia64, mips, > sun4v and amd64. All these files have nearly identical comments > (except powerpc, which changes the value). > > The only place these are used in the kernel is in the Linux! > emulation in i386/linux/linux_sysvec.c and > amd64/linux32/linux32_sysvec.c: > > if (args->trace) > AUXARGS_ENTRY(pos, AT_DEBUG, 1); > > Since AT_DEBUG and AT_UID have the same value, and we look at AT_UID > later, we wind up passing the wrong value for AT_UID. Fortunately, we > don't use AT_UID for anything in the tree.... I cannot find any reference of AT_DEBUG in linux 2.6.16 sources and it indeed looks bogus... From imp at bsdimp.com Tue Dec 16 15:19:26 2008 From: imp at bsdimp.com (M. Warner Losh) Date: Tue Dec 16 15:19:32 2008 Subject: Removing some cruft... In-Reply-To: <20081216212746.GA28834@freebsd.org> References: <20081216.131845.-1739986974.imp@bsdimp.com> <20081216212746.GA28834@freebsd.org> Message-ID: <20081216.161638.644659879.imp@bsdimp.com> In message: <20081216212746.GA28834@freebsd.org> Roman Divacky writes: : On Tue, Dec 16, 2008 at 01:18:45PM -0700, M. Warner Losh wrote: : > I was looking at the MIPS elf stuff based on a submission of some : > 64-bit support. In doing so, I discovered a number of 'unused' types : > that appear to have comments that indicate that they can be removed : > now and were just slavishly copied from arch to arch to arch. : > : > /* : > * The following non-standard values are used for passing information : > * from John Polstra's testbed program to the dynamic linker. These : > * are expected to go away soon. : > * : > * Unfortunately, these overlap the Linux non-standard values, so they : > * must not be used in the same context. : > */ : > #define AT_BRK 10 /* Starting point for sbrk and brk. */ : > #define AT_DEBUG 11 /* Debugging level. */ : > : > These have be slavishly copied to arm, powerpc, sparc64, ia64, mips, : > sun4v and amd64. All these files have nearly identical comments : > (except powerpc, which changes the value). : > : > The only place these are used in the kernel is in the Linux! : > emulation in i386/linux/linux_sysvec.c and : > amd64/linux32/linux32_sysvec.c: : > : > if (args->trace) : > AUXARGS_ENTRY(pos, AT_DEBUG, 1); : > : > Since AT_DEBUG and AT_UID have the same value, and we look at AT_UID : > later, we wind up passing the wrong value for AT_UID. Fortunately, we : > don't use AT_UID for anything in the tree.... : : I cannot find any reference of AT_DEBUG in linux 2.6.16 sources and it : indeed looks bogus... What do you think of the following patch? Warner Index: sys/amd64/linux32/linux32_sysvec.c =================================================================== --- sys/amd64/linux32/linux32_sysvec.c (revision 186097) +++ sys/amd64/linux32/linux32_sysvec.c (working copy) @@ -254,8 +254,6 @@ args = (Elf32_Auxargs *)imgp->auxargs; pos = base + (imgp->args->argc + imgp->args->envc + 2); - if (args->trace) - AUXARGS_ENTRY_32(pos, AT_DEBUG, 1); if (args->execfd != -1) AUXARGS_ENTRY_32(pos, AT_EXECFD, args->execfd); AUXARGS_ENTRY_32(pos, AT_PHDR, args->phdr); Index: sys/i386/linux/linux_sysvec.c =================================================================== --- sys/i386/linux/linux_sysvec.c (revision 186097) +++ sys/i386/linux/linux_sysvec.c (working copy) @@ -245,8 +245,6 @@ args = (Elf32_Auxargs *)imgp->auxargs; pos = *stack_base + (imgp->args->argc + imgp->args->envc + 2); - if (args->trace) - AUXARGS_ENTRY(pos, AT_DEBUG, 1); if (args->execfd != -1) AUXARGS_ENTRY(pos, AT_EXECFD, args->execfd); AUXARGS_ENTRY(pos, AT_PHDR, args->phdr); From peter at wemm.org Tue Dec 16 19:33:49 2008 From: peter at wemm.org (Peter Wemm) Date: Tue Dec 16 19:33:56 2008 Subject: Removing some cruft... In-Reply-To: <20081216.131845.-1739986974.imp@bsdimp.com> References: <20081216.131845.-1739986974.imp@bsdimp.com> Message-ID: On Tue, Dec 16, 2008 at 12:18 PM, M. Warner Losh wrote: > I was looking at the MIPS elf stuff based on a submission of some > 64-bit support. In doing so, I discovered a number of 'unused' types > that appear to have comments that indicate that they can be removed > now and were just slavishly copied from arch to arch to arch. > > /* > * The following non-standard values are used for passing information > * from John Polstra's testbed program to the dynamic linker. These > * are expected to go away soon. > * > * Unfortunately, these overlap the Linux non-standard values, so they > * must not be used in the same context. > */ > #define AT_BRK 10 /* Starting point for sbrk and brk. */ > #define AT_DEBUG 11 /* Debugging level. */ > > These have be slavishly copied to arm, powerpc, sparc64, ia64, mips, > sun4v and amd64. All these files have nearly identical comments > (except powerpc, which changes the value). [..] > So I'd like to remove all this stuff unless there's a compelling > reason to keep it. > > Can anybody think of a reason to keep it? It seems completely > non-functional... Remove it completely. It probably should never have been committed to the tree in the first place. In either case, it has been OBE for a good ~10 years. -- Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI6FJV "All of this is for nothing if we don't go to the stars" - JMS/B5 "If Java had true garbage collection, most programs would delete themselves upon execution." -- Robert Sewell From ps at mu.org Wed Dec 17 00:20:53 2008 From: ps at mu.org (Paul Saab) Date: Wed Dec 17 00:21:00 2008 Subject: UMA & mbuf cache utilization. In-Reply-To: <20081209155714.K960@desktop> References: <20081209155714.K960@desktop> Message-ID: <5c0ff6a70812162349n38395f84o45020f334cd09853@mail.gmail.com> So far testing has shown in a pure transmit test, that this doesn't hurt performance at all. On Tue, Dec 9, 2008 at 6:22 PM, Jeff Roberson wrote: > Hello, > > Nokia has graciously allowed me to release a patch which I developed to > improve general mbuf and cluster cache behavior. This is based on others > observations that due to simple alignment at 2k and 256k we achieve a poor > cache distribution for the header area of packets and the most heavily used > mbuf header fields. In addition, modern machines stripe memory access > across several memories and even memory controllers. Accessing heavily > aligned locations such as these can also create load imbalances among > memories. > > To solve this problem I have added two new features to UMA. The first is > the zone flag UMA_ZONE_CACHESPREAD. This flag modifies the meaning of the > alignment field such that start addresses are staggered by at least align + > 1 bytes. In the case of clusters and mbufs this means adding > uma_cache_align + 1 bytes to the amount of storage allocated. This creates > a certain constant amount of waste, 3% and 12% respectively. It also means > we must use contiguous physical and virtual memory consisting of several > pages to efficiently use the memory and land on as many cache lines as > possible. > > Because contiguous physical memory is not always available, the allocator > had to have a fallback mechanism. We don't simply want to have all mbuf > allocations check two zones as once we deplete available contiguous memory > the check on the first zone will always fail using the most expensive code > path. > > To resolve this issue, I added the ability for secondary zones to stack on > top of multiple primary zones. Secondary zones are zones which get their > storage from another zone but handle their own caching, ctors, dtors, etc. > By adding this feature a secondary zone can be created that can allocate > either from the contiguous memory pool or the non-contiguous single-page > pool depending on availability. It is also much faster to fail between them > deep in the allocator because it is only required when we exhaust the > already available mbuf memory. > > For mbufs and clusters there are now three zones each. A contigmalloc > backed zone, a single-page allocator zone, and a secondary zone with the > original zome_mbuf or zone_clust name. The packet zone also takes from both > available mbuf zones. The individual backend zones are not exposed outside > of kern_mbuf.c. > > Currently, each backend zone can have its own limit. The secondary zone > only blocks when both are full. Statistic wise the limit should be reported > as the sum of the backend limits, however, that isn't presently done. The > secondary zone can not have its own limit independent of the backends at > this time. I'm not sure if that's valuable or not. > > I have test results from nokia which show a dramatic improvement in several > workloads but which I am probably not at liberty to discuss. I'm in the > process of convincing Kip to help me get some benchmark data on our stack. > > Also as part of the patch I renamed a few functions since many were > non-obvious and grew new keg abstractions to tidy things up a bit. I > suspect those of you with UMA experience (robert, bosko) will find the > renaming a welcome improvement. > > The patch is available at: > http://people.freebsd.org/~jeff/mbuf_contig.diff > > I would love to hear any feedback you may have. I have been developing > this and testing various version off and on for months, however, this is a > fresh port to current and it is a little green so should be considered > experimental. > > In particular, I'm most nervous about how the vm will respond to new > pressure on contig physical pages. I'm also interested in hearing from > embedded/limited memory people about how we might want to limit or tune > this. > > Thanks, > Jeff > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > > From dchagin at freebsd.org Wed Dec 17 09:38:50 2008 From: dchagin at freebsd.org (Chagin Dmitry) Date: Wed Dec 17 09:38:58 2008 Subject: Removing some cruft... In-Reply-To: <20081216.161638.644659879.imp@bsdimp.com> References: <20081216.131845.-1739986974.imp@bsdimp.com> <20081216212746.GA28834@freebsd.org> <20081216.161638.644659879.imp@bsdimp.com> Message-ID: <20081217172047.GA2884@dchagin.dialup.corbina.ru> On Tue, Dec 16, 2008 at 04:16:38PM -0700, M. Warner Losh wrote: > In message: <20081216212746.GA28834@freebsd.org> > Roman Divacky writes: > : On Tue, Dec 16, 2008 at 01:18:45PM -0700, M. Warner Losh wrote: > : > I was looking at the MIPS elf stuff based on a submission of some > : > 64-bit support. In doing so, I discovered a number of 'unused' types > : > that appear to have comments that indicate that they can be removed > : > now and were just slavishly copied from arch to arch to arch. > : > > : > /* > : > * The following non-standard values are used for passing information > : > * from John Polstra's testbed program to the dynamic linker. These > : > * are expected to go away soon. > : > * > : > * Unfortunately, these overlap the Linux non-standard values, so they > : > * must not be used in the same context. > : > */ > : > #define AT_BRK 10 /* Starting point for sbrk and brk. */ > : > #define AT_DEBUG 11 /* Debugging level. */ > : > > : > These have be slavishly copied to arm, powerpc, sparc64, ia64, mips, > : > sun4v and amd64. All these files have nearly identical comments > : > (except powerpc, which changes the value). > : > > : > The only place these are used in the kernel is in the Linux! > : > emulation in i386/linux/linux_sysvec.c and > : > amd64/linux32/linux32_sysvec.c: > : > > : > if (args->trace) > : > AUXARGS_ENTRY(pos, AT_DEBUG, 1); > : > > : > Since AT_DEBUG and AT_UID have the same value, and we look at AT_UID > : > later, we wind up passing the wrong value for AT_UID. Fortunately, we > : > don't use AT_UID for anything in the tree.... > : > : I cannot find any reference of AT_DEBUG in linux 2.6.16 sources and it > : indeed looks bogus... > > What do you think of the following patch? > > Warner > Hi, I am ready to offer more radical patch :) Move all Linux aux entry types to a new file compat/linux/linux_elf.h Add two new aux entries which improve work of futexes. Please review. thnx! diff --git a/sys/amd64/include/elf.h b/sys/amd64/include/elf.h index a4c7f79..3c2cd20 100644 --- a/sys/amd64/include/elf.h +++ b/sys/amd64/include/elf.h @@ -81,16 +81,8 @@ __ElfType(Auxinfo); #define AT_BASE 7 /* Interpreter's base address. */ #define AT_FLAGS 8 /* Flags (unused for i386). */ #define AT_ENTRY 9 /* Where interpreter should transfer control. */ -/* - * The following non-standard values are used in Linux ELF binaries. - */ -#define AT_NOTELF 10 /* Program is not ELF ?? */ -#define AT_UID 11 /* Real uid. */ -#define AT_EUID 12 /* Effective uid. */ -#define AT_GID 13 /* Real gid. */ -#define AT_EGID 14 /* Effective gid. */ -#define AT_COUNT 15 /* Count of defined aux entry types. */ +#define AT_COUNT 10 /* Count of defined aux entry types. */ /* * Relocation types. diff --git a/sys/amd64/linux32/linux.h b/sys/amd64/linux32/linux.h index e0ffcdf..3f04555 100644 --- a/sys/amd64/linux32/linux.h +++ b/sys/amd64/linux32/linux.h @@ -108,6 +108,12 @@ typedef struct { #define LINUX_CTL_MAXNAME 10 +#define LINUX_AT_SYSINFO 32 +#define LINUX_AT_SYSINFO_EHDR 33 +#define LINUX_AT_COUNT 16 /* Count of used aux entry types. + * Keep this synchronized with + * elf_linux_fixup() code. + */ struct l___sysctl_args { l_uintptr_t name; diff --git a/sys/amd64/linux32/linux32_sysvec.c b/sys/amd64/linux32/linux32_sysvec.c index aaa7458..2777e84 100644 --- a/sys/amd64/linux32/linux32_sysvec.c +++ b/sys/amd64/linux32/linux32_sysvec.c @@ -76,6 +76,7 @@ __FBSDID("$FreeBSD$"); #include #include +#include #include #include #include @@ -106,6 +107,8 @@ MALLOC_DEFINE(M_LINUX, "linux", "Linux mode structures"); #define LINUX_SYS_linux_rt_sendsig 0 #define LINUX_SYS_linux_sendsig 0 +const char linux_platform[] = "i686"; +static int linux_szplatform; extern char linux_sigcode[]; extern int linux_szsigcode; @@ -246,7 +249,12 @@ elf_linux_fixup(register_t **stack_base, struct image_params *imgp) { Elf32_Auxargs *args; Elf32_Addr *base; - Elf32_Addr *pos; + Elf32_Addr *pos, *uplatform; + struct linux32_ps_strings *arginfo; + + arginfo = (struct linux32_ps_strings *)LINUX32_PS_STRINGS; + uplatform = (Elf32_Addr *)((caddr_t)arginfo - linux_szsigcode - + linux_szplatform); KASSERT(curthread->td_proc == imgp->proc, ("unsafe elf_linux_fixup(), should be curproc")); @@ -254,8 +262,8 @@ elf_linux_fixup(register_t **stack_base, struct image_params *imgp) args = (Elf32_Auxargs *)imgp->auxargs; pos = base + (imgp->args->argc + imgp->args->envc + 2); - if (args->execfd != -1) - AUXARGS_ENTRY_32(pos, AT_EXECFD, args->execfd); + AUXARGS_ENTRY_32(pos, LINUX_AT_HWCAP, cpu_feature); + AUXARGS_ENTRY_32(pos, LINUX_AT_CLKTCK, hz); AUXARGS_ENTRY_32(pos, AT_PHDR, args->phdr); AUXARGS_ENTRY_32(pos, AT_PHENT, args->phent); AUXARGS_ENTRY_32(pos, AT_PHNUM, args->phnum); @@ -263,10 +271,14 @@ elf_linux_fixup(register_t **stack_base, struct image_params *imgp) AUXARGS_ENTRY_32(pos, AT_FLAGS, args->flags); AUXARGS_ENTRY_32(pos, AT_ENTRY, args->entry); AUXARGS_ENTRY_32(pos, AT_BASE, args->base); - AUXARGS_ENTRY_32(pos, AT_UID, imgp->proc->p_ucred->cr_ruid); - AUXARGS_ENTRY_32(pos, AT_EUID, imgp->proc->p_ucred->cr_svuid); - AUXARGS_ENTRY_32(pos, AT_GID, imgp->proc->p_ucred->cr_rgid); - AUXARGS_ENTRY_32(pos, AT_EGID, imgp->proc->p_ucred->cr_svgid); + AUXARGS_ENTRY_32(pos, LINUX_AT_SECURE, 0); + AUXARGS_ENTRY_32(pos, LINUX_AT_UID, imgp->proc->p_ucred->cr_ruid); + AUXARGS_ENTRY_32(pos, LINUX_AT_EUID, imgp->proc->p_ucred->cr_svuid); + AUXARGS_ENTRY_32(pos, LINUX_AT_GID, imgp->proc->p_ucred->cr_rgid); + AUXARGS_ENTRY_32(pos, LINUX_AT_EGID, imgp->proc->p_ucred->cr_svgid); + AUXARGS_ENTRY_32(pos, LINUX_AT_PLATFORM, PTROUT(uplatform)); + if (args->execfd != -1) + AUXARGS_ENTRY_32(pos, AT_EXECFD, args->execfd); AUXARGS_ENTRY_32(pos, AT_NULL, 0); free(imgp->auxargs, M_TEMP); @@ -851,27 +863,31 @@ static register_t * linux_copyout_strings(struct image_params *imgp) { int argc, envc; - u_int32_t *vectp; + uint32_t *vectp; char *stringp, *destp; - u_int32_t *stack_base; + uint32_t *stack_base; struct linux32_ps_strings *arginfo; - int sigcodesz; /* * Calculate string base and vector table pointers. * Also deal with signal trampoline code for this exec type. */ arginfo = (struct linux32_ps_strings *)LINUX32_PS_STRINGS; - sigcodesz = *(imgp->proc->p_sysent->sv_szsigcode); - destp = (caddr_t)arginfo - sigcodesz - SPARE_USRSPACE - - roundup((ARG_MAX - imgp->args->stringspace), sizeof(char *)); + destp = (caddr_t)arginfo - linux_szsigcode - SPARE_USRSPACE - + linux_szplatform - roundup((ARG_MAX - imgp->args->stringspace), + sizeof(char *)); /* * install sigcode */ - if (sigcodesz) - copyout(imgp->proc->p_sysent->sv_sigcode, - ((caddr_t)arginfo - sigcodesz), sigcodesz); + copyout(imgp->proc->p_sysent->sv_sigcode, + ((caddr_t)arginfo - linux_szsigcode), linux_szsigcode); + + /* + * Install LINUX_PLATFORM + */ + copyout(linux_platform, ((caddr_t)arginfo - linux_szsigcode - + linux_szplatform), linux_szplatform); /* * If we have a valid auxargs ptr, prepare some room @@ -883,22 +899,22 @@ linux_copyout_strings(struct image_params *imgp) * lower compatibility. */ imgp->auxarg_size = (imgp->auxarg_size) ? imgp->auxarg_size - : (AT_COUNT * 2); + : (LINUX_AT_COUNT * 2); /* * The '+ 2' is for the null pointers at the end of each of * the arg and env vector sets,and imgp->auxarg_size is room * for argument of Runtime loader. */ - vectp = (u_int32_t *) (destp - (imgp->args->argc + imgp->args->envc + 2 + - imgp->auxarg_size) * sizeof(u_int32_t)); + vectp = (uint32_t *) (destp - (imgp->args->argc + + imgp->args->envc + 2 + imgp->auxarg_size) * sizeof(uint32_t)); } else /* * The '+ 2' is for the null pointers at the end of each of * the arg and env vector sets */ - vectp = (u_int32_t *) - (destp - (imgp->args->argc + imgp->args->envc + 2) * sizeof(u_int32_t)); + vectp = (uint32_t *) (destp - (imgp->args->argc + + imgp->args->envc + 2) * sizeof(uint32_t)); /* * vectp also becomes our initial stack base @@ -916,14 +932,14 @@ linux_copyout_strings(struct image_params *imgp) /* * Fill in "ps_strings" struct for ps, w, etc. */ - suword32(&arginfo->ps_argvstr, (u_int32_t)(intptr_t)vectp); + suword32(&arginfo->ps_argvstr, (uint32_t)(intptr_t)vectp); suword32(&arginfo->ps_nargvstr, argc); /* * Fill in argument portion of vector table. */ for (; argc > 0; --argc) { - suword32(vectp++, (u_int32_t)(intptr_t)destp); + suword32(vectp++, (uint32_t)(intptr_t)destp); while (*stringp++ != 0) destp++; destp++; @@ -932,14 +948,14 @@ linux_copyout_strings(struct image_params *imgp) /* a null vector table pointer separates the argp's from the envp's */ suword32(vectp++, 0); - suword32(&arginfo->ps_envstr, (u_int32_t)(intptr_t)vectp); + suword32(&arginfo->ps_envstr, (uint32_t)(intptr_t)vectp); suword32(&arginfo->ps_nenvstr, envc); /* * Fill in environment portion of vector table. */ for (; envc > 0; --envc) { - suword32(vectp++, (u_int32_t)(intptr_t)destp); + suword32(vectp++, (uint32_t)(intptr_t)destp); while (*stringp++ != 0) destp++; destp++; @@ -1088,6 +1104,8 @@ linux_elf_modevent(module_t mod, int type, void *data) NULL, 1000); if (bootverbose) printf("Linux ELF exec handler installed\n"); + linux_szplatform = roundup(strlen(linux_platform) + 1, + sizeof(char *)); } else printf("cannot insert Linux ELF brand handler\n"); break; diff --git a/sys/arm/include/elf.h b/sys/arm/include/elf.h index c516864..48260e1 100644 --- a/sys/arm/include/elf.h +++ b/sys/arm/include/elf.h @@ -70,13 +70,8 @@ __ElfType(Auxinfo); #define AT_BASE 7 /* Interpreter's base address. */ #define AT_FLAGS 8 /* Flags (unused). */ #define AT_ENTRY 9 /* Where interpreter should transfer control. */ -#define AT_NOTELF 10 /* Program is not ELF ?? */ -#define AT_UID 11 /* Real uid. */ -#define AT_EUID 12 /* Effective uid. */ -#define AT_GID 13 /* Real gid. */ -#define AT_EGID 14 /* Effective gid. */ -#define AT_COUNT 15 /* Count of defined aux entry types. */ +#define AT_COUNT 10 /* Count of defined aux entry types. */ #define R_ARM_COUNT 33 /* Count of defined relocation types. */ diff --git a/sys/compat/linux/linux_elf.h b/sys/compat/linux/linux_elf.h new file mode 100644 index 0000000..680e39b --- /dev/null +++ b/sys/compat/linux/linux_elf.h @@ -0,0 +1,50 @@ +/*- + * Copyright (c) 2008 Chagin Dmitry + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer + * in this position and unchanged. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR + * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. + * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, + * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT + * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF + * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * $FreeBSD$ + */ + +#ifndef _LINUX_ELF_H_ +#define _LINUX_ELF_H_ + +/* + * Non-standard aux entry types used in Linux ELF binaries. + */ + +#define LINUX_AT_NOTELF 10 /* Program is not ELF ?? */ +#define LINUX_AT_UID 11 /* Real uid. */ +#define LINUX_AT_EUID 12 /* Effective uid. */ +#define LINUX_AT_GID 13 /* Real gid. */ +#define LINUX_AT_EGID 14 /* Effective gid. */ +#define LINUX_AT_PLATFORM 15 /* String identifying CPU */ +#define LINUX_AT_HWCAP 16 /* CPU capabilities */ +#define LINUX_AT_CLKTCK 17 /* frequency at which times() increments */ +#define LINUX_AT_SECURE 23 /* secure mode boolean */ +#define LINUX_AT_BASE_PLATFORM 24 /* string identifying real platform, may + * differ from AT_PLATFORM. + */ +#define LINUX_AT_EXECFN 31 /* filename of program */ + +#endif /* !_LINUX_ELF_H_ */ diff --git a/sys/compat/linux/linux_misc.c b/sys/compat/linux/linux_misc.c index 93f4297..cf14da3 100644 --- a/sys/compat/linux/linux_misc.c +++ b/sys/compat/linux/linux_misc.c @@ -92,10 +92,6 @@ __FBSDID("$FreeBSD$"); #include #include -#ifdef __i386__ -#include -#endif - #define BSD_TO_LINUX_SIGNAL(sig) \ (((sig) <= LINUX_SIGTBLSZ) ? bsd_to_linux_signal[_SIG_IDX(sig)] : sig) @@ -731,34 +727,8 @@ linux_newuname(struct thread *td, struct linux_newuname_args *args) *p = '\0'; break; } -#ifdef __i386__ - { - const char *class; + strlcpy(utsname.machine, linux_platform, LINUX_MAX_UTSNAME); - switch (cpu_class) { - case CPUCLASS_686: - class = "i686"; - break; - case CPUCLASS_586: - class = "i586"; - break; - case CPUCLASS_486: - class = "i486"; - break; - default: - class = "i386"; - } - strlcpy(utsname.machine, class, LINUX_MAX_UTSNAME); - } -#elif defined(__amd64__) /* XXX: Linux can change 'personality'. */ -#ifdef COMPAT_LINUX32 - strlcpy(utsname.machine, "i686", LINUX_MAX_UTSNAME); -#else - strlcpy(utsname.machine, "x86_64", LINUX_MAX_UTSNAME); -#endif /* COMPAT_LINUX32 */ -#else /* something other than i386 or amd64 - assume we and Linux agree */ - strlcpy(utsname.machine, machine, LINUX_MAX_UTSNAME); -#endif /* __i386__ */ mtx_lock(&hostname_mtx); strlcpy(utsname.domainname, V_domainname, LINUX_MAX_UTSNAME); mtx_unlock(&hostname_mtx); diff --git a/sys/compat/linux/linux_misc.h b/sys/compat/linux/linux_misc.h index c80a432..2cdb3c3 100644 --- a/sys/compat/linux/linux_misc.h +++ b/sys/compat/linux/linux_misc.h @@ -45,4 +45,6 @@ #define LINUX_MREMAP_MAYMOVE 1 #define LINUX_MREMAP_FIXED 2 +extern const char linux_platform[]; + #endif /* _LINUX_MISC_H_ */ diff --git a/sys/compat/svr4/svr4.h b/sys/compat/svr4/svr4.h index 84ee720..261e3e9 100644 --- a/sys/compat/svr4/svr4.h +++ b/sys/compat/svr4/svr4.h @@ -36,4 +36,11 @@ extern struct sysentvec svr4_sysvec; #define COMPAT_SVR4_SOLARIS2 -#endif +#define SVR4_AT_UID 11 /* Real uid. */ +#define SVR4_AT_EUID 12 /* Effective uid. */ +#define SVR4_AT_GID 13 /* Real gid. */ +#define SVR4_AT_EGID 14 /* Effective gid. */ + +#define SVR4_AT_COUNT 15 /* Count of defined aux entry types. */ + +#endif /* !_LINUX_ELF_H_ */ diff --git a/sys/compat/svr4/svr4_sysvec.c b/sys/compat/svr4/svr4_sysvec.c index 0030e3a..24a742c 100644 --- a/sys/compat/svr4/svr4_sysvec.c +++ b/sys/compat/svr4/svr4_sysvec.c @@ -163,6 +163,115 @@ extern struct sysent svr4_sysent[]; extern int svr4_szsigcode; extern char svr4_sigcode[]; + +/* + * Copy strings out to the new process address space, constructing new arg + * and env vector tables. Return a pointer to the base so that it can be used + * as the initial stack pointer. + */ +static register_t * +svr4_copyout_strings(struct image_params *imgp) +{ + int argc, envc; + char **vectp; + char *stringp, *destp; + register_t *stack_base; + struct ps_strings *arginfo; + struct proc *p; + + /* + * Calculate string base and vector table pointers. + * Also deal with signal trampoline code for this exec type. + */ + p = imgp->proc; + arginfo = (struct ps_strings *)p->p_sysent->sv_psstrings; + destp = (caddr_t)arginfo - svr4_szsigcode - SPARE_USRSPACE - + roundup((ARG_MAX - imgp->args->stringspace), sizeof(char *)); + + copyout(p->p_sysent->sv_sigcode, ((caddr_t)arginfo - + svr4_szsigcode), svr4_szsigcode); + + /* + * If we have a valid auxargs ptr, prepare some room + * on the stack. + */ + if (imgp->auxargs) { + /* + * 'AT_COUNT*2' is size for the ELF Auxargs data. This is for + * lower compatibility. + */ + imgp->auxarg_size = (imgp->auxarg_size) ? imgp->auxarg_size : + (SVR4_AT_COUNT * 2); + /* + * The '+ 2' is for the null pointers at the end of each of + * the arg and env vector sets,and imgp->auxarg_size is room + * for argument of Runtime loader. + */ + vectp = (char **)(destp - (imgp->args->argc + + imgp->args->envc + 2 + imgp->auxarg_size) * + sizeof(char *)); + + } else { + /* + * The '+ 2' is for the null pointers at the end of each of + * the arg and env vector sets + */ + vectp = (char **)(destp - (imgp->args->argc + imgp->args->envc + 2) * + sizeof(char *)); + } + + /* + * vectp also becomes our initial stack base + */ + stack_base = (register_t *)vectp; + + stringp = imgp->args->begin_argv; + argc = imgp->args->argc; + envc = imgp->args->envc; + + /* + * Copy out strings - arguments and environment. + */ + copyout(stringp, destp, ARG_MAX - imgp->args->stringspace); + + /* + * Fill in "ps_strings" struct for ps, w, etc. + */ + suword(&arginfo->ps_argvstr, (long)(intptr_t)vectp); + suword(&arginfo->ps_nargvstr, argc); + + /* + * Fill in argument portion of vector table. + */ + for (; argc > 0; --argc) { + suword(vectp++, (long)(intptr_t)destp); + while (*stringp++ != 0) + destp++; + destp++; + } + + /* a null vector table pointer separates the argp's from the envp's */ + suword(vectp++, 0); + + suword(&arginfo->ps_envstr, (long)(intptr_t)vectp); + suword(&arginfo->ps_nenvstr, envc); + + /* + * Fill in environment portion of vector table. + */ + for (; envc > 0; --envc) { + suword(vectp++, (long)(intptr_t)destp); + while (*stringp++ != 0) + destp++; + destp++; + } + + /* end of vector table is a null pointer */ + suword(vectp, 0); + + return (stack_base); +} + struct sysentvec svr4_sysvec = { .sv_size = SVR4_SYS_MAXSYSCALL, .sv_table = svr4_sysent, @@ -187,7 +296,7 @@ struct sysentvec svr4_sysvec = { .sv_usrstack = USRSTACK, .sv_psstrings = PS_STRINGS, .sv_stackprot = VM_PROT_ALL, - .sv_copyout_strings = exec_copyout_strings, + .sv_copyout_strings = svr4_copyout_strings, .sv_setregs = exec_setregs, .sv_fixlimit = NULL, .sv_maxssiz = NULL, @@ -227,10 +336,10 @@ svr4_fixup(register_t **stack_base, struct image_params *imgp) AUXARGS_ENTRY(pos, AT_FLAGS, args->flags); AUXARGS_ENTRY(pos, AT_ENTRY, args->entry); AUXARGS_ENTRY(pos, AT_BASE, args->base); - AUXARGS_ENTRY(pos, AT_UID, imgp->proc->p_ucred->cr_ruid); - AUXARGS_ENTRY(pos, AT_EUID, imgp->proc->p_ucred->cr_svuid); - AUXARGS_ENTRY(pos, AT_GID, imgp->proc->p_ucred->cr_rgid); - AUXARGS_ENTRY(pos, AT_EGID, imgp->proc->p_ucred->cr_svgid); + AUXARGS_ENTRY(pos, SVR4_AT_UID, imgp->proc->p_ucred->cr_ruid); + AUXARGS_ENTRY(pos, SVR4_AT_EUID, imgp->proc->p_ucred->cr_svuid); + AUXARGS_ENTRY(pos, SVR4_AT_GID, imgp->proc->p_ucred->cr_rgid); + AUXARGS_ENTRY(pos, SVR4_AT_EGID, imgp->proc->p_ucred->cr_svgid); AUXARGS_ENTRY(pos, AT_NULL, 0); free(imgp->auxargs, M_TEMP); @@ -307,3 +416,4 @@ static moduledata_t svr4_elf_mod = { }; DECLARE_MODULE(svr4elf, svr4_elf_mod, SI_SUB_EXEC, SI_ORDER_ANY); MODULE_DEPEND(svr4elf, streams, 1, 1, 1); + diff --git a/sys/i386/linux/linux.h b/sys/i386/linux/linux.h index f9c7ee5..46478c0 100644 --- a/sys/i386/linux/linux.h +++ b/sys/i386/linux/linux.h @@ -102,6 +102,12 @@ typedef struct { #define LINUX_CTL_MAXNAME 10 +#define LINUX_AT_SYSINFO 32 +#define LINUX_AT_SYSINFO_EHDR 33 +#define LINUX_AT_COUNT 16 /* Count of used aux entry types. + * Keep this synchronized with + * elf_linux_fixup() code. + */ struct l___sysctl_args { l_int *name; diff --git a/sys/i386/linux/linux_sysvec.c b/sys/i386/linux/linux_sysvec.c index 42365fb..b44fb00 100644 --- a/sys/i386/linux/linux_sysvec.c +++ b/sys/i386/linux/linux_sysvec.c @@ -58,11 +58,13 @@ __FBSDID("$FreeBSD$"); #include #include +#include #include #include #include #include +#include #include #include #include @@ -107,6 +109,10 @@ static void linux_prepsyscall(struct trapframe *tf, int *args, u_int *code, static void linux_sendsig(sig_t catcher, ksiginfo_t *ksi, sigset_t *mask); static void exec_linux_setregs(struct thread *td, u_long entry, u_long stack, u_long ps_strings); +static register_t *linux_copyout_strings(struct image_params *imgp); + +static int linux_szplatform; +const char *linux_platform; extern LIST_HEAD(futex_list, futex) futex_list; extern struct sx futex_sx; @@ -231,22 +237,30 @@ linux_fixup(register_t **stack_base, struct image_params *imgp) **stack_base = (intptr_t)(void *)argv; (*stack_base)--; **stack_base = imgp->args->argc; - return 0; + return (0); } static int elf_linux_fixup(register_t **stack_base, struct image_params *imgp) { + struct proc *p; Elf32_Auxargs *args; + Elf32_Addr *uplatform; + struct ps_strings *arginfo; register_t *pos; KASSERT(curthread->td_proc == imgp->proc, ("unsafe elf_linux_fixup(), should be curproc")); + + p = imgp->proc; + arginfo = (struct ps_strings *)p->p_sysent->sv_psstrings; + uplatform = (Elf32_Addr *)((caddr_t)arginfo - linux_szsigcode - + linux_szplatform); args = (Elf32_Auxargs *)imgp->auxargs; pos = *stack_base + (imgp->args->argc + imgp->args->envc + 2); - if (args->execfd != -1) - AUXARGS_ENTRY(pos, AT_EXECFD, args->execfd); + AUXARGS_ENTRY(pos, LINUX_AT_HWCAP, cpu_feature); + AUXARGS_ENTRY(pos, LINUX_AT_CLKTCK, hz); AUXARGS_ENTRY(pos, AT_PHDR, args->phdr); AUXARGS_ENTRY(pos, AT_PHENT, args->phent); AUXARGS_ENTRY(pos, AT_PHNUM, args->phnum); @@ -254,10 +268,14 @@ elf_linux_fixup(register_t **stack_base, struct image_params *imgp) AUXARGS_ENTRY(pos, AT_FLAGS, args->flags); AUXARGS_ENTRY(pos, AT_ENTRY, args->entry); AUXARGS_ENTRY(pos, AT_BASE, args->base); - AUXARGS_ENTRY(pos, AT_UID, imgp->proc->p_ucred->cr_ruid); - AUXARGS_ENTRY(pos, AT_EUID, imgp->proc->p_ucred->cr_svuid); - AUXARGS_ENTRY(pos, AT_GID, imgp->proc->p_ucred->cr_rgid); - AUXARGS_ENTRY(pos, AT_EGID, imgp->proc->p_ucred->cr_svgid); + AUXARGS_ENTRY(pos, LINUX_AT_SECURE, 0); + AUXARGS_ENTRY(pos, LINUX_AT_UID, imgp->proc->p_ucred->cr_ruid); + AUXARGS_ENTRY(pos, LINUX_AT_EUID, imgp->proc->p_ucred->cr_svuid); + AUXARGS_ENTRY(pos, LINUX_AT_GID, imgp->proc->p_ucred->cr_rgid); + AUXARGS_ENTRY(pos, LINUX_AT_EGID, imgp->proc->p_ucred->cr_svgid); + AUXARGS_ENTRY(pos, LINUX_AT_PLATFORM, PTROUT(uplatform)); + if (args->execfd != -1) + AUXARGS_ENTRY(pos, AT_EXECFD, args->execfd); AUXARGS_ENTRY(pos, AT_NULL, 0); free(imgp->auxargs, M_TEMP); @@ -265,9 +283,125 @@ elf_linux_fixup(register_t **stack_base, struct image_params *imgp) (*stack_base)--; **stack_base = (register_t)imgp->args->argc; - return 0; + return (0); +} + +/* + * Copied from kern/kern_exec.c + */ +static register_t * +linux_copyout_strings(struct image_params *imgp) +{ + int argc, envc; + char **vectp; + char *stringp, *destp; + register_t *stack_base; + struct ps_strings *arginfo; + struct proc *p; + + /* + * Calculate string base and vector table pointers. + * Also deal with signal trampoline code for this exec type. + */ + p = imgp->proc; + arginfo = (struct ps_strings *)p->p_sysent->sv_psstrings; + destp = (caddr_t)arginfo - linux_szsigcode - SPARE_USRSPACE - + linux_szplatform - roundup((ARG_MAX - imgp->args->stringspace), + sizeof(char *)); + + /* + * install sigcode + */ + copyout(p->p_sysent->sv_sigcode, ((caddr_t)arginfo - + linux_szsigcode), linux_szsigcode); + + /* + * install LINUX_PLATFORM + */ + copyout(linux_platform, ((caddr_t)arginfo - linux_szsigcode - + linux_szplatform), linux_szplatform); + + /* + * If we have a valid auxargs ptr, prepare some room + * on the stack. + */ + if (imgp->auxargs) { + /* + * 'AT_COUNT*2' is size for the ELF Auxargs data. This is for + * lower compatibility. + */ + imgp->auxarg_size = (imgp->auxarg_size) ? imgp->auxarg_size : + (LINUX_AT_COUNT * 2); + /* + * The '+ 2' is for the null pointers at the end of each of + * the arg and env vector sets,and imgp->auxarg_size is room + * for argument of Runtime loader. + */ + vectp = (char **)(destp - (imgp->args->argc + + imgp->args->envc + 2 + imgp->auxarg_size) * sizeof(char *)); + } else { + /* + * The '+ 2' is for the null pointers at the end of each of + * the arg and env vector sets + */ + vectp = (char **)(destp - (imgp->args->argc + imgp->args->envc + 2) * + sizeof(char *)); + } + + /* + * vectp also becomes our initial stack base + */ + stack_base = (register_t *)vectp; + + stringp = imgp->args->begin_argv; + argc = imgp->args->argc; + envc = imgp->args->envc; + + /* + * Copy out strings - arguments and environment. + */ + copyout(stringp, destp, ARG_MAX - imgp->args->stringspace); + + /* + * Fill in "ps_strings" struct for ps, w, etc. + */ + suword(&arginfo->ps_argvstr, (long)(intptr_t)vectp); + suword(&arginfo->ps_nargvstr, argc); + + /* + * Fill in argument portion of vector table. + */ + for (; argc > 0; --argc) { + suword(vectp++, (long)(intptr_t)destp); + while (*stringp++ != 0) + destp++; + destp++; + } + + /* a null vector table pointer separates the argp's from the envp's */ + suword(vectp++, 0); + + suword(&arginfo->ps_envstr, (long)(intptr_t)vectp); + suword(&arginfo->ps_nenvstr, envc); + + /* + * Fill in environment portion of vector table. + */ + for (; envc > 0; --envc) { + suword(vectp++, (long)(intptr_t)destp); + while (*stringp++ != 0) + destp++; + destp++; + } + + /* end of vector table is a null pointer */ + suword(vectp, 0); + + return (stack_base); } + + extern int _ucodesel, _udatasel; extern unsigned long linux_sznonrtsigcode; @@ -808,6 +942,29 @@ exec_linux_setregs(struct thread *td, u_long entry, fldcw(&control); } +static int +linux_get_machine(const char **dst) +{ + const char *class; + + switch (cpu_class) { + case CPUCLASS_686: + class = "i686"; + break; + case CPUCLASS_586: + class = "i586"; + break; + case CPUCLASS_486: + class = "i486"; + break; + default: + class = "i386"; + } + *dst = class; + return (0); +} + + struct sysentvec linux_sysvec = { .sv_size = LINUX_SYS_MAXSYSCALL, .sv_table = linux_sysent, @@ -863,7 +1020,7 @@ struct sysentvec elf_linux_sysvec = { .sv_usrstack = USRSTACK, .sv_psstrings = PS_STRINGS, .sv_stackprot = VM_PROT_ALL, - .sv_copyout_strings = exec_copyout_strings, + .sv_copyout_strings = linux_copyout_strings, .sv_setregs = exec_linux_setregs, .sv_fixlimit = NULL, .sv_maxssiz = NULL, @@ -929,6 +1086,9 @@ linux_elf_modevent(module_t mod, int type, void *data) NULL, 1000); linux_exec_tag = EVENTHANDLER_REGISTER(process_exec, linux_proc_exec, NULL, 1000); + linux_get_machine(&linux_platform); + linux_szplatform = roundup(strlen(linux_platform) + 1, + sizeof(char *)); if (bootverbose) printf("Linux ELF exec handler installed\n"); } else diff --git a/sys/ia64/include/elf.h b/sys/ia64/include/elf.h index faab8d1..982629c 100644 --- a/sys/ia64/include/elf.h +++ b/sys/ia64/include/elf.h @@ -82,16 +82,8 @@ __ElfType(Auxinfo); #define AT_BASE 7 /* Interpreter's base address. */ #define AT_FLAGS 8 /* Flags (unused for i386). */ #define AT_ENTRY 9 /* Where interpreter should transfer control. */ -/* - * The following non-standard values are used in Linux ELF binaries. - */ -#define AT_NOTELF 10 /* Program is not ELF ?? */ -#define AT_UID 11 /* Real uid. */ -#define AT_EUID 12 /* Effective uid. */ -#define AT_GID 13 /* Real gid. */ -#define AT_EGID 14 /* Effective gid. */ -#define AT_COUNT 15 /* Count of defined aux entry types. */ +#define AT_COUNT 10 /* Count of defined aux entry types. */ /* * Values for e_flags. diff --git a/sys/powerpc/include/elf.h b/sys/powerpc/include/elf.h index 422a86a..d2b8e12 100644 --- a/sys/powerpc/include/elf.h +++ b/sys/powerpc/include/elf.h @@ -80,6 +80,9 @@ __ElfType(Auxinfo); #define AT_COUNT 13 /* Count of defined aux entry types. */ +/* Used in John Polstra's testbed stuff. */ +#define AT_DEBUG 14 /* Debugging level. */ + /* * Relocation types. */ diff --git a/sys/sparc64/include/elf.h b/sys/sparc64/include/elf.h index 108ade1..c826197 100644 --- a/sys/sparc64/include/elf.h +++ b/sys/sparc64/include/elf.h @@ -78,16 +78,8 @@ __ElfType(Auxinfo); #define AT_BASE 7 /* Interpreter's base address. */ #define AT_FLAGS 8 /* Flags (unused). */ #define AT_ENTRY 9 /* Where interpreter should transfer control. */ -/* - * The following non-standard values are used in Linux ELF binaries. - */ -#define AT_NOTELF 10 /* Program is not ELF ?? */ -#define AT_UID 11 /* Real uid. */ -#define AT_EUID 12 /* Effective uid. */ -#define AT_GID 13 /* Real gid. */ -#define AT_EGID 14 /* Effective gid. */ -#define AT_COUNT 15 /* Count of defined aux entry types. */ +#define AT_COUNT 10 /* Count of defined aux entry types. */ /* Define "machine" characteristics */ #if __ELF_WORD_SIZE == 32 diff --git a/sys/sun4v/include/elf.h b/sys/sun4v/include/elf.h index 108ade1..c826197 100644 --- a/sys/sun4v/include/elf.h +++ b/sys/sun4v/include/elf.h @@ -78,16 +78,8 @@ __ElfType(Auxinfo); #define AT_BASE 7 /* Interpreter's base address. */ #define AT_FLAGS 8 /* Flags (unused). */ #define AT_ENTRY 9 /* Where interpreter should transfer control. */ -/* - * The following non-standard values are used in Linux ELF binaries. - */ -#define AT_NOTELF 10 /* Program is not ELF ?? */ -#define AT_UID 11 /* Real uid. */ -#define AT_EUID 12 /* Effective uid. */ -#define AT_GID 13 /* Real gid. */ -#define AT_EGID 14 /* Effective gid. */ -#define AT_COUNT 15 /* Count of defined aux entry types. */ +#define AT_COUNT 10 /* Count of defined aux entry types. */ /* Define "machine" characteristics */ #if __ELF_WORD_SIZE == 32 -- Have fun! chd From imp at bsdimp.com Wed Dec 17 13:39:42 2008 From: imp at bsdimp.com (M. Warner Losh) Date: Wed Dec 17 13:39:49 2008 Subject: Removing some cruft... In-Reply-To: <20081217172047.GA2884@dchagin.dialup.corbina.ru> References: <20081216212746.GA28834@freebsd.org> <20081216.161638.644659879.imp@bsdimp.com> <20081217172047.GA2884@dchagin.dialup.corbina.ru> Message-ID: <20081217.143702.-1286520955.imp@bsdimp.com> In message: <20081217172047.GA2884@dchagin.dialup.corbina.ru> Chagin Dmitry writes: : Hi, I am ready to offer more radical patch :) : : Move all Linux aux entry types to a new file compat/linux/linux_elf.h : Add two new aux entries which improve work of futexes. : : Please review. thnx! : : diff --git a/sys/amd64/include/elf.h b/sys/amd64/include/elf.h : index a4c7f79..3c2cd20 100644 : --- a/sys/amd64/include/elf.h : +++ b/sys/amd64/include/elf.h : @@ -81,16 +81,8 @@ __ElfType(Auxinfo); : #define AT_BASE 7 /* Interpreter's base address. */ : #define AT_FLAGS 8 /* Flags (unused for i386). */ : #define AT_ENTRY 9 /* Where interpreter should transfer control. */ : -/* : - * The following non-standard values are used in Linux ELF binaries. : - */ : -#define AT_NOTELF 10 /* Program is not ELF ?? */ : -#define AT_UID 11 /* Real uid. */ : -#define AT_EUID 12 /* Effective uid. */ : -#define AT_GID 13 /* Real gid. */ : -#define AT_EGID 14 /* Effective gid. */ : : -#define AT_COUNT 15 /* Count of defined aux entry types. */ It turns out that these are not non-standard Linux values. SYSV also uses these, and the MIPS ABI defines them as well. Let's leave them in place. It may make sense to move these AT_ definitions to a more central location, however, since I think they are the same on all platforms. A quick grep of the Binutils directory seems to support this, but I only have the MIPS ABI specs. I also think that we should be exporting them in the normal path, but that may open up a can of warms so needs to be tested/reviewed carefully before we pull the trigger. Warner From jroberson at jroberson.net Wed Dec 17 13:58:51 2008 From: jroberson at jroberson.net (Jeff Roberson) Date: Wed Dec 17 13:58:57 2008 Subject: UMA & mbuf cache utilization. In-Reply-To: <5c0ff6a70812162349n38395f84o45020f334cd09853@mail.gmail.com> References: <20081209155714.K960@desktop> <5c0ff6a70812162349n38395f84o45020f334cd09853@mail.gmail.com> Message-ID: <20081217115632.V960@desktop> On Tue, 16 Dec 2008, Paul Saab wrote: > So far testing has shown in a pure transmit test, that this doesn't hurt > performance at all. It would appear that our stack at present is not concurrent enough for this optimization to help as it does elsewhere. Rather than needlessly complicate things at this time I'm going to go ahead and commit the uma portion so that this work isn't lost and leave the mbuf changes until such time that they are advantageous. This will allow for further experimentation as well. Thanks, Jeff > > On Tue, Dec 9, 2008 at 6:22 PM, Jeff Roberson wrote: > >> Hello, >> >> Nokia has graciously allowed me to release a patch which I developed to >> improve general mbuf and cluster cache behavior. This is based on others >> observations that due to simple alignment at 2k and 256k we achieve a poor >> cache distribution for the header area of packets and the most heavily used >> mbuf header fields. In addition, modern machines stripe memory access >> across several memories and even memory controllers. Accessing heavily >> aligned locations such as these can also create load imbalances among >> memories. >> >> To solve this problem I have added two new features to UMA. The first is >> the zone flag UMA_ZONE_CACHESPREAD. This flag modifies the meaning of the >> alignment field such that start addresses are staggered by at least align + >> 1 bytes. In the case of clusters and mbufs this means adding >> uma_cache_align + 1 bytes to the amount of storage allocated. This creates >> a certain constant amount of waste, 3% and 12% respectively. It also means >> we must use contiguous physical and virtual memory consisting of several >> pages to efficiently use the memory and land on as many cache lines as >> possible. >> >> Because contiguous physical memory is not always available, the allocator >> had to have a fallback mechanism. We don't simply want to have all mbuf >> allocations check two zones as once we deplete available contiguous memory >> the check on the first zone will always fail using the most expensive code >> path. >> >> To resolve this issue, I added the ability for secondary zones to stack on >> top of multiple primary zones. Secondary zones are zones which get their >> storage from another zone but handle their own caching, ctors, dtors, etc. >> By adding this feature a secondary zone can be created that can allocate >> either from the contiguous memory pool or the non-contiguous single-page >> pool depending on availability. It is also much faster to fail between them >> deep in the allocator because it is only required when we exhaust the >> already available mbuf memory. >> >> For mbufs and clusters there are now three zones each. A contigmalloc >> backed zone, a single-page allocator zone, and a secondary zone with the >> original zome_mbuf or zone_clust name. The packet zone also takes from both >> available mbuf zones. The individual backend zones are not exposed outside >> of kern_mbuf.c. >> >> Currently, each backend zone can have its own limit. The secondary zone >> only blocks when both are full. Statistic wise the limit should be reported >> as the sum of the backend limits, however, that isn't presently done. The >> secondary zone can not have its own limit independent of the backends at >> this time. I'm not sure if that's valuable or not. >> >> I have test results from nokia which show a dramatic improvement in several >> workloads but which I am probably not at liberty to discuss. I'm in the >> process of convincing Kip to help me get some benchmark data on our stack. >> >> Also as part of the patch I renamed a few functions since many were >> non-obvious and grew new keg abstractions to tidy things up a bit. I >> suspect those of you with UMA experience (robert, bosko) will find the >> renaming a welcome improvement. >> >> The patch is available at: >> http://people.freebsd.org/~jeff/mbuf_contig.diff >> >> I would love to hear any feedback you may have. I have been developing >> this and testing various version off and on for months, however, this is a >> fresh port to current and it is a little green so should be considered >> experimental. >> >> In particular, I'm most nervous about how the vm will respond to new >> pressure on contig physical pages. I'm also interested in hearing from >> embedded/limited memory people about how we might want to limit or tune >> this. >> >> Thanks, >> Jeff >> _______________________________________________ >> freebsd-arch@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-arch >> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" >> >> > From andy.lavr at reactor-xg.kiev.ua Wed Dec 17 22:19:11 2008 From: andy.lavr at reactor-xg.kiev.ua (Andrei V. Lavreniyuk) Date: Wed Dec 17 22:19:18 2008 Subject: UMA & mbuf cache utilization. Message-ID: <4949E7E1.8050802@reactor-xg.kiev.ua> Hi! ----------------------------- My system: # uname -a FreeBSD datacenter.technica-03.local 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #0: Wed Dec 17 17:39:46 EET 2008 root@datacenter.technica-03.local:/usr/obj/usr/src/sys/SMP-DATACENTER i386 + SSP +ZFS + mbuf_contig.diff --------------------------- The system works 1-2 hours and hangs up. Without a patch mbuf works stably. What information is yet needed for my hand? -- Best regards, Andrei V. Lavreniyuk. From pluknet at gmail.com Thu Dec 18 05:39:19 2008 From: pluknet at gmail.com (pluknet) Date: Thu Dec 18 05:39:26 2008 Subject: UMA & mbuf cache utilization. In-Reply-To: <4949E7E1.8050802@reactor-xg.kiev.ua> References: <4949E7E1.8050802@reactor-xg.kiev.ua> Message-ID: 2008/12/18 Andrei V. Lavreniyuk : > Hi! > > > > ----------------------------- > My system: > > > # uname -a > FreeBSD datacenter.technica-03.local 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE > #0: Wed Dec 17 17:39:46 EET 2008 > root@datacenter.technica-03.local:/usr/obj/usr/src/sys/SMP-DATACENTER i386 > > + SSP > > +ZFS > > + mbuf_contig.diff > --------------------------- > > > > The system works 1-2 hours and hangs up. Without a patch mbuf works stably. > > What information is yet needed for my hand? > See http://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug.html At first you need those options in the kernel config to extract a useful info. options KDB # Enable kernel debugger support. options DDB # Support DDB. options GDB # Support remote GDB. options INVARIANTS # Enable calls of extra sanity checking options INVARIANT_SUPPORT # Extra sanity checks of internal structures, required by INVARIANTS options WITNESS # Enable checks to detect deadlocks and cycles options WITNESS_SKIPSPIN # Don't run witness on spinlocks for speed -- wbr, pluknet From sales.gadgets.co.uk at cauquenes.cl Thu Dec 18 15:24:24 2008 From: sales.gadgets.co.uk at cauquenes.cl (GADGETS LIMITED (UK)) Date: Thu Dec 18 15:24:30 2008 Subject: Get Your Blackberry Storm9500/$350 or Apple iphone 16GB/$250 or Blackberry Bold/$300 Message-ID: <200812182003.mBIK3SJb029658@cauquenes.cl> eND OF YR bONUS: Blackberry Storm,Apple iPhone,Samsung Omnia GSM PHONES Apple iPhone 16GB............$250 USD Blackberry Bold..............$300 USD Blackberry Storm.............$350 USD Samsung Omnia i900 (16GB)....$470 USD HTC Touch Pro................$400 USD HTC Diamond .................$400 USD Nokia N96....................$350 USD Nokia N85....................$350 USD Nokia E71....................$300 USD Nokia E66....................$300 USD Motorola V3i D&G......$250 USD Nokia N95......... ...$320 USD Nokia N93......... ...$260 USD Nokia N93i ...........$280 USD Nokia N70 ............$160 USD Nokia N73 ............$250 USD Nokia N80 ............$200 USD Nokia N90 ............$200 USD Nokia N91 ............$200 USD BUY ANY 5 UNITS AND GET 2 FREE All GSM Phones,Brand New,Tri- Band and Video Games are also Brand new with Complete Accessories plus Int'l Warranty . e-mail us for more enquiry gadgetsltd2@gmail.com Robert Johnson GADGETS LIMITED (UK) LTD Registered No. 05881519 THE OLD STABLES, ARUNDEL ROAD, POLING, ARUNDEL, WEST SUSSEX, BN18 9QA, UK From jroberson at jroberson.net Thu Dec 18 22:47:30 2008 From: jroberson at jroberson.net (Jeff Roberson) Date: Thu Dec 18 22:47:38 2008 Subject: UMA & mbuf cache utilization. In-Reply-To: <4949E7E1.8050802@reactor-xg.kiev.ua> References: <4949E7E1.8050802@reactor-xg.kiev.ua> Message-ID: <20081218201934.Q960@desktop> On Thu, 18 Dec 2008, Andrei V. Lavreniyuk wrote: > Hi! > > > > ----------------------------- > My system: > > > # uname -a > FreeBSD datacenter.technica-03.local 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE > #0: Wed Dec 17 17:39:46 EET 2008 > root@datacenter.technica-03.local:/usr/obj/usr/src/sys/SMP-DATACENTER i386 > > + SSP > > +ZFS > > + mbuf_contig.diff > --------------------------- > > > > The system works 1-2 hours and hangs up. Without a patch mbuf works stably. > > What information is yet needed for my hand? Did you apply this to -CURRENT or 7.1? It is not safe for use in 7.1. There is also currently a problem with ipx that must be fixed. Thanks, Jeff > > > > > -- > Best regards, Andrei V. Lavreniyuk. > From bugmaster at FreeBSD.org Mon Dec 22 03:06:48 2008 From: bugmaster at FreeBSD.org (FreeBSD bugmaster) Date: Mon Dec 22 03:07:29 2008 Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org Message-ID: <200812221106.mBMB6lNe060502@freefall.freebsd.org> Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/120749 arch [request] Suggest upping the default kern.ps_arg_cache 1 problem total. From bugmaster at FreeBSD.org Mon Dec 29 03:06:51 2008 From: bugmaster at FreeBSD.org (FreeBSD bugmaster) Date: Mon Dec 29 03:07:25 2008 Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org Message-ID: <200812291106.mBTB6pBd024372@freefall.freebsd.org> Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/120749 arch [request] Suggest upping the default kern.ps_arg_cache 1 problem total. From Andre.Albsmeier at siemens.com Mon Dec 29 21:52:03 2008 From: Andre.Albsmeier at siemens.com (Andre Albsmeier) Date: Mon Dec 29 21:52:10 2008 Subject: Two drivers, one physical device: How to deal with that? Message-ID: <20081229212020.GA1809@curry.mchp.siemens.de> Hello, I have written a driver which attaches to the host bridge in order to periodically read the appropriate registers and inform the user about ECC errors (ECC-Monitor). No I have run across a mainboard where the host bridge is already taken by the agp driver. Of course, I can detach the agp driver and attach myself and everything is working but what is if someone does not want to loose the agp functionality? How does one deal with the case when two separate drivers have to access the same device (the host bridge in my case)? I assume, the correct way would be to join the AGP and ECC functionality in one driver but maybe there are other tricks I am not aware of? Thanks, -Andre From jroberson at jroberson.net Tue Dec 30 00:38:19 2008 From: jroberson at jroberson.net (Jeff Roberson) Date: Tue Dec 30 00:38:25 2008 Subject: Two drivers, one physical device: How to deal with that? In-Reply-To: <20081229212020.GA1809@curry.mchp.siemens.de> References: <20081229212020.GA1809@curry.mchp.siemens.de> Message-ID: <20081229143221.X1076@desktop> On Mon, 29 Dec 2008, Andre Albsmeier wrote: > Hello, > > I have written a driver which attaches to the host bridge in > order to periodically read the appropriate registers and > inform the user about ECC errors (ECC-Monitor). No I have > run across a mainboard where the host bridge is already > taken by the agp driver. Of course, I can detach the agp > driver and attach myself and everything is working but > what is if someone does not want to loose the agp > functionality? > > How does one deal with the case when two separate drivers > have to access the same device (the host bridge in my case)? > > I assume, the correct way would be to join the AGP and > ECC functionality in one driver but maybe there are other > tricks I am not aware of? Well I don't think it would be correct to merge two conceptually seperate drivers into one just to share the same device. It sounds like the right solution is to make a generic layer the attaches to the host bridge and arbitrates access to it. Then allow other device to find and communicate with this generic layer. For the host bridge this doesn't have to be particularly fancy. I am curious; how do you test the ECC functionality? Is there a way to induce an error? Thanks, Jeff > > Thanks, > > -Andre > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > From Andre.Albsmeier at siemens.com Tue Dec 30 14:23:56 2008 From: Andre.Albsmeier at siemens.com (Andre Albsmeier) Date: Tue Dec 30 14:24:02 2008 Subject: Two drivers, one physical device: How to deal with that? In-Reply-To: <20081229143221.X1076@desktop> References: <20081229212020.GA1809@curry.mchp.siemens.de> <20081229143221.X1076@desktop> Message-ID: <20081230135216.GA2182@curry.mchp.siemens.de> On Mon, 29-Dec-2008 at 14:35:21 -1000, Jeff Roberson wrote: > On Mon, 29 Dec 2008, Andre Albsmeier wrote: > > > Hello, > > > > I have written a driver which attaches to the host bridge in > > order to periodically read the appropriate registers and > > inform the user about ECC errors (ECC-Monitor). No I have > > run across a mainboard where the host bridge is already > > taken by the agp driver. Of course, I can detach the agp > > driver and attach myself and everything is working but > > what is if someone does not want to loose the agp > > functionality? > > > > How does one deal with the case when two separate drivers > > have to access the same device (the host bridge in my case)? > > > > I assume, the correct way would be to join the AGP and > > ECC functionality in one driver but maybe there are other > > tricks I am not aware of? > > Well I don't think it would be correct to merge two conceptually seperate > drivers into one just to share the same device. It sounds like the right > solution is to make a generic layer the attaches to the host bridge and > arbitrates access to it. Then allow other device to find and communicate I see, yes, that sounds as a good idea. I also didn't like the idea of uniting the two functionalities. However, I assume my kernel programming skills are not good enough to implement something like this ;-) > with this generic layer. For the host bridge this doesn't have to be > particularly fancy. > > I am curious; how do you test the ECC functionality? Is there a way to > induce an error? The most common method is to lower the voltage and heat up the DIMMs. Some chips react rather quickly, others nearly have to be molten down ;-). Another possibility is to use a not too weak radioactive source (an old Radiomir watch is not enough) to bomb the RAMs with betas and gammas (this is of course not for everybody ;-)). But the easiest and safest way is to buy an Asus P5W board and enable the "Quick Boot" option in the BIOS. With this setting, lots of ECC-errors are produced in a short time. The rate goes down as the uptime rises. I don't know why this happens but I assume the chipset reads memory cells which have never been written to and therefore the data is inconsistent. As soon as you disable the "Quick Boot" option (which implies a memory writing test being performed by the BIOS) the errors go away. You can then even enable "Quick Boot" again, as long as you don't switch of the power... Thanks, -Andre -- GNU is Not Unix / Linux Is Not UniX