From bugmaster at FreeBSD.org Mon May 4 11:07:41 2009 From: bugmaster at FreeBSD.org (FreeBSD bugmaster) Date: Mon May 4 11:08:01 2009 Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org Message-ID: <200905041107.n44B7eeU098367@freefall.freebsd.org> Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/120749 arch [request] Suggest upping the default kern.ps_arg_cache 1 problem total. From avg at icyb.net.ua Wed May 6 16:38:33 2009 From: avg at icyb.net.ua (Andriy Gapon) Date: Wed May 6 16:38:40 2009 Subject: shutdown_nice during boot Message-ID: <4A01B9A3.2030806@icyb.net.ua> First, let me simply paste the whole body of shutdown_nice function: void shutdown_nice(int howto) { shutdown_howto = howto; /* Send a signal to init(8) and have it shutdown the world */ if (initproc != NULL) { PROC_LOCK(initproc); psignal(initproc, SIGINT); PROC_UNLOCK(initproc); } else { /* No init(8) running, so simply reboot */ boot(RB_NOSYNC); } return; } Now, initproc is initialized quite early during boot to make sure that PID of 1 is reserved for init. Actual init process is executed at the very end of boot. Right after init is forked it ignores all signals because this is how proc0 is set up. Only when it is actually executed it explicitly re-enables signals and installs certain handlers. Because of the above there is a time frame where initproc != NULL but any signal for init gets ignored. There are not many places where shutdown_nice can be called during that time frame, but I think that there are some. Very unlikely, but theoretically possible situation: a system starts overheating immediately after power on, acpi_tz driver detects this and calls shutdown_nice at the wrong time, the system keeps booting up and eventually melts down. It may be possible to make sure that shutdown_nice is never called at the wrong time by tweaking all the places where it's used. But maybe there is a way to make shutdown_nice behave in a usual way even during that inconvenient timeframe. It's possible to re-enable SIGINT right after init is forked, but this way it will be delivered to init before it installs signal handlers and thus init would simply terminate resulting in "Going nowhere without my init!" panic. Please share your ideas. Thank you! -- Andriy Gapon From avg at icyb.net.ua Thu May 7 08:03:56 2009 From: avg at icyb.net.ua (Andriy Gapon) Date: Thu May 7 08:04:03 2009 Subject: shutdown_nice during boot In-Reply-To: <20090507080048.GA64648@server.vk2pj.dyndns.org> References: <4A01B9A3.2030806@icyb.net.ua> <20090507080048.GA64648@server.vk2pj.dyndns.org> Message-ID: <4A0295E0.4020609@icyb.net.ua> on 07/05/2009 11:00 peterjeremy@optushome.com.au said the following: > On 2009-May-06 19:24:03 +0300, Andriy Gapon wrote: >> It's possible to re-enable SIGINT right after init is forked, but >> this way it will be delivered to init before it installs signal >> handlers and thus init would simply terminate resulting in "Going >> nowhere without my init!" panic. > > The best option would seem to be for init(8) to call sigprocmask(2) > immediately it starts up and block all signals. But a signal still can be delivered after init is exec-ed and before sigprocmask(2) is called or not? > This causes signals > to be deferred until they are unblocked. Once it sorts out its signal > handlers, it can then unblock the signals - at which point it will > receive any signals that were sent in the interim. > > Note that I haven't looked into init(8) to see if there are other > reasons why this approach would not be appropriate > -- Andriy Gapon From peterjeremy at optushome.com.au Thu May 7 12:16:44 2009 From: peterjeremy at optushome.com.au (peterjeremy@optushome.com.au) Date: Thu May 7 12:16:51 2009 Subject: shutdown_nice during boot In-Reply-To: <4A01B9A3.2030806@icyb.net.ua> References: <4A01B9A3.2030806@icyb.net.ua> Message-ID: <20090507080048.GA64648@server.vk2pj.dyndns.org> On 2009-May-06 19:24:03 +0300, Andriy Gapon wrote: >It's possible to re-enable SIGINT right after init is forked, but >this way it will be delivered to init before it installs signal >handlers and thus init would simply terminate resulting in "Going >nowhere without my init!" panic. The best option would seem to be for init(8) to call sigprocmask(2) immediately it starts up and block all signals. This causes signals to be deferred until they are unblocked. Once it sorts out its signal handlers, it can then unblock the signals - at which point it will receive any signals that were sent in the interim. Note that I haven't looked into init(8) to see if there are other reasons why this approach would not be appropriate -- Peter Jeremy -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20090507/a19ab6f1/attachment.pgp From received at postcard.org Thu May 7 22:26:08 2009 From: received at postcard.org (received@postcard.org) Date: Thu May 7 22:26:14 2009 Subject: You have just received a virtual postcard from a friend ! Message-ID: <20090507214924.2FF3E218C55F@rhodamine.com.au> You have just received a virtual postcard from a friend ! . You can pick up your postcard at the following web address: . [1]http:.exe . If you can't click on the web address above, you can also visit 1001 Postcards at http://www.postcards.org/postcards/ and enter your pickup code, which is: d21-sea-sunset . (Your postcard will be available for 60 days.) . Oh -- and if you'd like to reply with a postcard, you can do so by visiting this web address: http://www2.postcards.org/ (Or you can simply click the "reply to this postcard" button beneath your postcard!) . We hope you enjoy your postcard, and if you do, please take a moment to send a few yourself! . Regards, 1001 Postcards http://www.postcards.org/postcards/ References 1. http://85.17.150.185/~paco/postcard.gif.exe From scholz at scriptolutions.com Sat May 9 16:48:47 2009 From: scholz at scriptolutions.com (Lothar Scholz) Date: Sat May 9 16:48:54 2009 Subject: Are named posix semaphores not implemented? Message-ID: <976698487.20090509182307@scriptolutions.com> Hello, i tried to port a program using PCBSD based on FreeBSD 7.1 and the small test program #include #include #include int main() { sem_t* s = sem_open("foobar", O_CREAT|O_EXCL); if (s == SEM_FAILED) perror("sem_open"); } raises a "bad system call 12" signal But from the manpage of sem_open tells me that it should be there since FreeBSD 5.0? Please don't tell me that i have to rewrite the code. -- Best regards, Lothar Scholz mailto:scholz@scriptolutions.com From gary.jennejohn at freenet.de Sat May 9 17:42:36 2009 From: gary.jennejohn at freenet.de (Gary Jennejohn) Date: Sat May 9 17:42:50 2009 Subject: Are named posix semaphores not implemented? In-Reply-To: <976698487.20090509182307@scriptolutions.com> References: <976698487.20090509182307@scriptolutions.com> Message-ID: <20090509194231.57543f41@lap.jennejohn.org> On Sat, 9 May 2009 18:23:07 +0200 Lothar Scholz wrote: > Hello, > > i tried to port a program using PCBSD based on FreeBSD 7.1 > and the small test program > > #include > #include > #include > > int main() { > sem_t* s = sem_open("foobar", O_CREAT|O_EXCL); > if (s == SEM_FAILED) perror("sem_open"); > } > > raises a "bad system call 12" signal > But from the manpage of sem_open tells me that it should > be there since FreeBSD 5.0? > > Please don't tell me that i have to rewrite the code. > According to the man page name MUST start with '/'. --- Gary Jennejohn From scholz at scriptolutions.com Sat May 9 18:01:32 2009 From: scholz at scriptolutions.com (Lothar Scholz) Date: Sat May 9 18:01:38 2009 Subject: Are named posix semaphores not implemented? In-Reply-To: <20090509194231.57543f41@lap.jennejohn.org> References: <976698487.20090509182307@scriptolutions.com> <20090509194231.57543f41@lap.jennejohn.org> Message-ID: <892763905.20090509195954@scriptolutions.com> Hello Gary, GJ> According to the man page name MUST start with '/'. Thanks, but this is not the problem. -- Best regards, Lothar Scholz mailto:scholz@scriptolutions.com From ed at 80386.nl Sat May 9 18:18:44 2009 From: ed at 80386.nl (Ed Schouten) Date: Sat May 9 18:18:50 2009 Subject: Are named posix semaphores not implemented? In-Reply-To: <976698487.20090509182307@scriptolutions.com> References: <976698487.20090509182307@scriptolutions.com> Message-ID: <20090509181842.GB58540@hoeg.nl> * Lothar Scholz wrote: > raises a "bad system call 12" signal From sem(4): options P1003_1B_SEMAPHORES -- Ed Schouten WWW: http://80386.nl/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20090509/853cd200/attachment.pgp From pluknet at gmail.com Sat May 9 18:20:53 2009 From: pluknet at gmail.com (pluknet) Date: Sat May 9 18:21:00 2009 Subject: Are named posix semaphores not implemented? In-Reply-To: <976698487.20090509182307@scriptolutions.com> References: <976698487.20090509182307@scriptolutions.com> Message-ID: 2009/5/9 Lothar Scholz : > Hello, > > i tried to port a program using PCBSD based on FreeBSD 7.1 > and the small test program > > #include > #include > #include > > int main() { > ? sem_t* s = sem_open("foobar", O_CREAT|O_EXCL); > ? if (s == SEM_FAILED) perror("sem_open"); > } > > raises a "bad system call 12" signal > But from the manpage of sem_open tells me that it should > be there since FreeBSD 5.0? > > Please don't tell me that i have to rewrite the code. > First, you should have sem(4) capacity enabled in kernel (via kldload or statically built). It seems you haven't. Second, as already mentioned and per manpage, you must specify an abs. path in the first arg. -- wbr, pluknet From jille at quis.cx Sat May 9 18:25:29 2009 From: jille at quis.cx (Jille Timmermans) Date: Sat May 9 18:25:38 2009 Subject: Are named posix semaphores not implemented? In-Reply-To: <976698487.20090509182307@scriptolutions.com> References: <976698487.20090509182307@scriptolutions.com> Message-ID: <4A05C6CA.1080104@quis.cx> Lothar Scholz schreef: > Hello, > > i tried to port a program using PCBSD based on FreeBSD 7.1 > and the small test program > > #include > #include > #include > > int main() { > sem_t* s = sem_open("foobar", O_CREAT|O_EXCL); > if (s == SEM_FAILED) perror("sem_open"); > } > > raises a "bad system call 12" signal > But from the manpage of sem_open tells me that it should > be there since FreeBSD 5.0? > > Please don't tell me that i have to rewrite the code. > > Have you removed "options SYSVSEM" from your kernel configuration ? -- Jille From scholz at scriptolutions.com Sat May 9 18:33:19 2009 From: scholz at scriptolutions.com (Lothar Scholz) Date: Sat May 9 18:33:26 2009 Subject: Posix shared memory problem Message-ID: <588815840.20090509203115@scriptolutions.com> Hello, Thanks for solving the posix semaphore problem. But with shared memory there comes the next issue: int main() { int m; shm_unlink("/barfoo"); m = shm_open("/barfoo", O_RDWR|O_CREAT|O_EXCL, S_IRWXU); if (m == 1) perror("shm_open error"); } i always get permission denied error, and i tried many values for flags and mode? I can only get this working as root but not as a normal user. -- Best regards, Lothar Scholz mailto:scholz@scriptolutions.com From kostikbel at gmail.com Sat May 9 18:36:06 2009 From: kostikbel at gmail.com (Kostik Belousov) Date: Sat May 9 18:36:13 2009 Subject: Are named posix semaphores not implemented? In-Reply-To: <892763905.20090509195954@scriptolutions.com> References: <976698487.20090509182307@scriptolutions.com> <20090509194231.57543f41@lap.jennejohn.org> <892763905.20090509195954@scriptolutions.com> Message-ID: <20090509180930.GN1948@deviant.kiev.zoral.com.ua> On Sat, May 09, 2009 at 07:59:54PM +0200, Lothar Scholz wrote: > Hello Gary, > > > GJ> According to the man page name MUST start with '/'. > > Thanks, but this is not the problem. Do "kldload sem". -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20090509/ba474a0c/attachment.pgp From scholz at scriptolutions.com Sat May 9 18:38:16 2009 From: scholz at scriptolutions.com (Lothar Scholz) Date: Sat May 9 18:38:23 2009 Subject: Are named posix semaphores not implemented? In-Reply-To: References: <976698487.20090509182307@scriptolutions.com> Message-ID: <123863765.20090509203610@scriptolutions.com> Hello pluknet, p> First, you should have sem(4) capacity enabled in kernel p> (via kldload or statically built). It seems you haven't. Yes - but it was a total fresh PC-BSD build. I must say that i really don't like this start off with anything disabled. They will never get a good desktop os if the user can't run simple programs without the need to learn about kernel modules. Well this is the wrong list to discuss this anyway. -- Best regards, Lothar Scholz mailto:scholz@scriptolutions.com From jilles at stack.nl Sat May 9 20:07:27 2009 From: jilles at stack.nl (Jilles Tjoelker) Date: Sat May 9 20:07:33 2009 Subject: Posix shared memory problem In-Reply-To: <588815840.20090509203115@scriptolutions.com> References: <588815840.20090509203115@scriptolutions.com> Message-ID: <20090509200724.GA25714@stack.nl> On Sat, May 09, 2009 at 08:31:15PM +0200, Lothar Scholz wrote: > Thanks for solving the posix semaphore problem. But with shared memory > there comes the next issue: > int main() { > int m; > shm_unlink("/barfoo"); > m = shm_open("/barfoo", O_RDWR|O_CREAT|O_EXCL, S_IRWXU); > if (m == 1) perror("shm_open error"); > } > i always get permission denied error, and i tried many values > for flags and mode? I can only get this working as root but not > as a normal user. shm_open/shm_unlink refer to the filesystem; they are fairly direct wrappers around open and unlink. POSIX suggests making the pathname a configuration option; alternatively, using a directory for temporary files such as /tmp could work. -- Jilles Tjoelker From ed at 80386.nl Sat May 9 20:29:51 2009 From: ed at 80386.nl (Ed Schouten) Date: Sat May 9 20:29:58 2009 Subject: Posix shared memory problem In-Reply-To: <20090509200724.GA25714@stack.nl> References: <588815840.20090509203115@scriptolutions.com> <20090509200724.GA25714@stack.nl> Message-ID: <20090509202949.GD58540@hoeg.nl> Hi Jilles, * Jilles Tjoelker wrote: > shm_open/shm_unlink refer to the filesystem; they are fairly direct > wrappers around open and unlink. Achtung: this is no longer the case in CURRENT if I remember correctly. -- Ed Schouten WWW: http://80386.nl/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20090509/9c7790c9/attachment.pgp From scholz at scriptolutions.com Sun May 10 04:53:25 2009 From: scholz at scriptolutions.com (Lothar Scholz) Date: Sun May 10 04:53:32 2009 Subject: Posix shared memory problem In-Reply-To: <20090509200724.GA25714@stack.nl> References: <588815840.20090509203115@scriptolutions.com> <20090509200724.GA25714@stack.nl> Message-ID: <19461540.20090510064924@scriptolutions.com> Hello Jilles, Saturday, May 9, 2009, 10:07:24 PM, you wrote: JT> On Sat, May 09, 2009 at 08:31:15PM +0200, Lothar Scholz wrote: >> Thanks for solving the posix semaphore problem. But with shared memory >> there comes the next issue: >> int main() { >> int m; >> shm_unlink("/barfoo"); >> m = shm_open("/barfoo", O_RDWR|O_CREAT|O_EXCL, S_IRWXU); >> if (m == 1) perror("shm_open error"); >> } >> i always get permission denied error, and i tried many values >> for flags and mode? I can only get this working as root but not >> as a normal user. JT> shm_open/shm_unlink refer to the filesystem; they are fairly direct JT> wrappers around open and unlink. Question is where are they stored? In Linux it is "/dev/shm". In my case it looks like the directory for shm_open files has some wrong access rights so that a normal user can't generate it. JT> POSIX suggests making the pathname a configuration option; JT> alternatively, using a directory for temporary files such as /tmp could JT> work. I will try this hack soon if nobody comes up with a solution. -- Best regards, Lothar Scholz mailto:scholz@scriptolutions.com From wollman at hergotha.csail.mit.edu Sun May 10 05:13:04 2009 From: wollman at hergotha.csail.mit.edu (Garrett Wollman) Date: Sun May 10 05:13:11 2009 Subject: Posix shared memory problem In-Reply-To: References: Message-ID: <200905100500.n4A50GOa050728@hergotha.csail.mit.edu> In scholz@scriptolutions.com writes: >JT> shm_open/shm_unlink refer to the filesystem; they are fairly direct >JT> wrappers around open and unlink. > >Question is where are they stored? In the fileststem, in the path that you specify. They are just ordinary files. There was some thought that this was a bad (or at least not-like-Linux) way of implementing this feature, so I believe more-recent versions of FreeBSD do it differently. When I wrote this code, I could not see any reason for the "path" argument to be interpreted differently from any other path. -GAWollman From scholz at scriptolutions.com Sun May 10 06:25:28 2009 From: scholz at scriptolutions.com (Lothar Scholz) Date: Sun May 10 06:25:35 2009 Subject: Posix shared memory problem In-Reply-To: <200905100500.n4A50GOa050728@hergotha.csail.mit.edu> References: <200905100500.n4A50GOa050728@hergotha.csail.mit.edu> Message-ID: <7710650619.20090510075706@scriptolutions.com> Hello Garrett, Sunday, May 10, 2009, 7:00:16 AM, you wrote: GW> In GW> GW> scholz@scriptolutions.com writes: >>JT> shm_open/shm_unlink refer to the filesystem; they are fairly direct >>JT> wrappers around open and unlink. >> >>Question is where are they stored? GW> In the fileststem, in the path that you specify. They are just GW> ordinary files. GW> There was some thought that this was a bad (or at least GW> not-like-Linux) way of implementing this feature, so I believe GW> more-recent versions of FreeBSD do it differently. When I wrote this GW> code, I could not see any reason for the "path" argument to be GW> interpreted differently from any other path. Oh thats a very very bad idea. First of all you can't use '/' if you want stay portable. It is also just a maximum of 13 char long (says the FreeBSD 6.X man page) and usually you now pass names like "com.mycompany.myproduct.mypurpose" as names to prevent namespace collisons. The path has nothing to do with the filesystem, it's a separate namespace. Let alone that semaphores and shared memory already use the same namespace is something i didn't expect on Linux. Now it is clear where my problem is and i go to a mmap to a $HOME/. file. Not nice but if anybody gives a shit about compatibility (backward and to other systems before implementing stuff) it is the only way. -- Best regards, Lothar Scholz mailto:scholz@scriptolutions.com From kmsujit at gmail.com Sun May 10 09:01:33 2009 From: kmsujit at gmail.com (Sujit K M) Date: Sun May 10 09:01:40 2009 Subject: Posix shared memory problem In-Reply-To: <7710650619.20090510075706@scriptolutions.com> References: <200905100500.n4A50GOa050728@hergotha.csail.mit.edu> <7710650619.20090510075706@scriptolutions.com> Message-ID: <74fe56020905100135y7f44b5fapfee3ef2ae70a2a0b@mail.gmail.com> > Now it is clear where my problem is and i go to a mmap to a $HOME/. > file. Not nice but if anybody gives a shit about compatibility > (backward and to other systems before implementing stuff) it is > the only way. i donot understand why this is an compatility issue. Just use /path/to/file. say /proc/shm/shm[0-9]+[a-z]+[0-9]+ From wollman at bimajority.org Sun May 10 15:54:39 2009 From: wollman at bimajority.org (Garrett Wollman) Date: Sun May 10 15:54:46 2009 Subject: Posix shared memory problem In-Reply-To: <7710650619.20090510075706@scriptolutions.com> References: <200905100500.n4A50GOa050728@hergotha.csail.mit.edu> <7710650619.20090510075706@scriptolutions.com> Message-ID: <18950.63671.323324.756287@hergotha.csail.mit.edu> < said: > First of all you can't use '/' if you want stay portable. The Standard says otherwise. > It is also just a maximum of 13 char long (says the FreeBSD 6.X man page) Not in the manual page I have, and the Standard says otherwise. > The path has nothing to do with the filesystem, it's a separate > namespace. Again, the Standard says otherwise (or rather, it says that this is an implementation option). To quote the 2001 edition of the standard (XSH6 page 1313): It is unspecified whether the name appears in the file system and is visible to other functions that take pathnames as arguments. The name argument conforms to the construction rules for a pathname. -GAWollman From ntarmos at cs.uoi.gr Sun May 10 20:02:46 2009 From: ntarmos at cs.uoi.gr (Nikos Ntarmos) Date: Sun May 10 20:02:54 2009 Subject: Posix shared memory problem In-Reply-To: <200905100500.n4A50GOa050728@hergotha.csail.mit.edu> References: <200905100500.n4A50GOa050728@hergotha.csail.mit.edu> Message-ID: <20090510194133.GG20749@ace.cs.uoi.gr> On Sun, May 10, 2009 at 01:00:16AM -0400, Garrett Wollman wrote: > In > scholz@scriptolutions.com writes: > > >JT> shm_open/shm_unlink refer to the filesystem; they are fairly direct > >JT> wrappers around open and unlink. > > > >Question is where are they stored? > > In the fileststem, in the path that you specify. They are just > ordinary files. > > There was some thought that this was a bad (or at least > not-like-Linux) way of implementing this feature, so I believe > more-recent versions of FreeBSD do it differently. When I wrote this > code, I could not see any reason for the "path" argument to be > interpreted differently from any other path. FWIW the test code in the original email still fails even if an absolute path is used as a sem name, ie: sem_t *s = sem_open("/path/to/foobar", O_CREAT | O_EXCL, S_IWUSR, 0); with /path/to/foobar pointing to a user writable directory, segfaults with "invalid system call". Note that the error is not printed by perror(3) but by the system itself. A backtrace of the resulting core shows that the problem is burried deep in ksem_open(): ntarmos@ace:~% ./ts zsh: invalid system call (core dumped) ./ts ntarmos@ace:~% gdb -q ./ts ts.core Core was generated by `ts'. Program terminated with signal 12, Bad system call. Reading symbols from /lib/libc.so.7...done. Loaded symbols for /lib/libc.so.7 Reading symbols from /libexec/ld-elf.so.1...done. Loaded symbols for /libexec/ld-elf.so.1 #0 0x280c214b in ksem_open () from /lib/libc.so.7 (gdb) bt #0 0x280c214b in ksem_open () from /lib/libc.so.7 #1 0x280b78fc in sem_open () from /lib/libc.so.7 #2 0x080484e5 in main () at test-sem.c:7 (gdb) This is on i386/7.2-RELEASE. Cheers. \n\n From scholz at scriptolutions.com Mon May 11 09:29:54 2009 From: scholz at scriptolutions.com (Lothar Scholz) Date: Mon May 11 09:30:00 2009 Subject: Posix shared memory problem In-Reply-To: <18950.63671.323324.756287@hergotha.csail.mit.edu> References: <200905100500.n4A50GOa050728@hergotha.csail.mit.edu> <7710650619.20090510075706@scriptolutions.com> <18950.63671.323324.756287@hergotha.csail.mit.edu> Message-ID: <1393224851.20090511112537@scriptolutions.com> Hello Garrett, Sunday, May 10, 2009, 5:54:31 PM, you wrote: GW> < said: >> First of all you can't use '/' if you want stay portable. GW> The Standard says otherwise. It's not a standard think. Read about the real world programming hints. You see recommendations to only use a starting '/' >> It is also just a maximum of 13 char long (says the FreeBSD 6.X man page) GW> Not in the manual page I have, and the Standard says otherwise. This time you are right. It was about named semaphores and there the limit seems to be removed - it was ridiculous low anyway. GW> Again, the Standard says otherwise (or rather, it says that this is an GW> implementation option). To quote the 2001 edition of the standard GW> (XSH6 page 1313): GW> It is unspecified whether the name appears in the file system GW> and is visible to other functions that take pathnames as GW> arguments. The name argument conforms to the construction GW> rules for a pathname. Thats why the man page calls this parameter 'name' and not 'path'. Some idiots started to think about this as a file path. But it isn't and it shouldn't. Thats what this spec is saying in the typical commitee polite form when some members made a mistake but are to important to be blamed in public. So this needs to be fixed. If i have to find a useable filefile location anyway the whole function does not make any sense, then i can directly use mmap. The purpose is to have a unique name (and in 2009 it is an URI not a file path). Thats how serious non kiddy operating systems are doing like Linux/Solaris/MacOSX-Darwin/HP-UX. And i guess also the accounting functions are wrong then. shm_open does not open a file so the (internal used) file should not add to the file space quota but to the memory allocation quota. -- Best regards, Lothar Scholz mailto:scholz@scriptolutions.com From des at des.no Mon May 11 10:16:37 2009 From: des at des.no (=?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?=) Date: Mon May 11 10:16:44 2009 Subject: Posix shared memory problem In-Reply-To: <1393224851.20090511112537@scriptolutions.com> (Lothar Scholz's message of "Mon, 11 May 2009 11:25:37 +0200") References: <200905100500.n4A50GOa050728@hergotha.csail.mit.edu> <7710650619.20090510075706@scriptolutions.com> <18950.63671.323324.756287@hergotha.csail.mit.edu> <1393224851.20090511112537@scriptolutions.com> Message-ID: <86hbzsot8x.fsf@ds4.des.no> Lothar Scholz writes: > Some idiots started to think about this as a file path. But it isn't > and it shouldn't. Thats what this spec is saying in the typical commitee > polite form when some members made a mistake but are to important to > be blamed in public. You are wrong. The standard says what it says specifically because it makes it possible to implement semaphores and shared memory in terms of file operations. I've been there and done that. > So this needs to be fixed. You've already been told that it *has* been fixed (or rather changed, since it was not broken to begin with) in head. > If i have to find a useable filefile location anyway the whole function does > not make any sense, then i can directly use mmap. The purpose is to > have a unique name (and in 2009 it is an URI not a file path). Thats > how serious non kiddy operating systems are doing like > Linux/Solaris/MacOSX-Darwin/HP-UX. Insulting the developers will get you nowhere. DES -- Dag-Erling Sm?rgrav - des@des.no From bugmaster at FreeBSD.org Mon May 11 11:06:50 2009 From: bugmaster at FreeBSD.org (FreeBSD bugmaster) Date: Mon May 11 11:07:24 2009 Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org Message-ID: <200905111106.n4BB6nEv085871@freefall.freebsd.org> Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/120749 arch [request] Suggest upping the default kern.ps_arg_cache 1 problem total. From kmsujit at gmail.com Mon May 11 11:11:00 2009 From: kmsujit at gmail.com (Sujit K M) Date: Mon May 11 11:11:07 2009 Subject: Posix shared memory problem In-Reply-To: <1393224851.20090511112537@scriptolutions.com> References: <200905100500.n4A50GOa050728@hergotha.csail.mit.edu> <7710650619.20090510075706@scriptolutions.com> <18950.63671.323324.756287@hergotha.csail.mit.edu> <1393224851.20090511112537@scriptolutions.com> Message-ID: <74fe56020905110410y430bf76yacf5c5a308a99865@mail.gmail.com> Any sort of frustration here? > Some idiots started to think about this as a file path. But it isn't > and it shouldn't. Thats what this spec is saying in the typical commitee > polite form when some members made a mistake but are to important to > be blamed in public. > What ever the Idiots are saying is correct. Read up some decent Unix manual. > So this needs to be fixed. What needs to be fixed? Could you be more specific? > If i have to find a useable filefile location anyway the whole function does > not make any sense, then i can directly use mmap. The purpose is to > have a unique name (and in 2009 it is an URI not a file path). Thats > how serious non kiddy operating systems are doing like > Linux/Solaris/MacOSX-Darwin/HP-UX. Try using these giant, great, long lasting things. > > And i guess also the accounting functions are wrong then. shm_open > does not open a file so the (internal used) file should not add to the > file space quota but to the memory allocation quota. I think you need a check up. You seem to be contradicting what ever you said before. From des at des.no Mon May 11 12:16:01 2009 From: des at des.no (=?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?=) Date: Mon May 11 12:16:08 2009 Subject: Posix shared memory problem In-Reply-To: <20090510194133.GG20749@ace.cs.uoi.gr> (Nikos Ntarmos's message of "Sun, 10 May 2009 22:41:33 +0300") References: <200905100500.n4A50GOa050728@hergotha.csail.mit.edu> <20090510194133.GG20749@ace.cs.uoi.gr> Message-ID: <86prefomww.fsf@ds4.des.no> Nikos Ntarmos writes: > FWIW the test code in the original email still fails even if an absolute > path is used as a sem name, ie: > sem_t *s = sem_open("/path/to/foobar", O_CREAT | O_EXCL, S_IWUSR, 0); > with /path/to/foobar pointing to a user writable directory, segfaults > with "invalid system call". As previously mentioned, 'kldload sem'. To forestall any further gripes about the POSIX IPC system calls not being compiled in by default: they are very rarely used, because the SysV IPC API is almost universally available and is generally considered superior to the POSIX API, which it predates by more than ten years. The SysV IPC system calls are in GENERIC, and are used by e.g. Sendmail, X.org and PostgreSQL. DES -- Dag-Erling Sm?rgrav - des@des.no From scholz at scriptolutions.com Mon May 11 15:30:05 2009 From: scholz at scriptolutions.com (Lothar Scholz) Date: Mon May 11 15:30:12 2009 Subject: Posix shared memory problem In-Reply-To: <74fe56020905110410y430bf76yacf5c5a308a99865@mail.gmail.com> References: <200905100500.n4A50GOa050728@hergotha.csail.mit.edu> <7710650619.20090510075706@scriptolutions.com> <18950.63671.323324.756287@hergotha.csail.mit.edu> <1393224851.20090511112537@scriptolutions.com> <74fe56020905110410y430bf76yacf5c5a308a99865@mail.gmail.com> Message-ID: <981850520.20090511172605@scriptolutions.com> Hello Sujit, Monday, May 11, 2009, 1:10:27 PM, you wrote: >> SKM> What ever the Idiots are saying is correct. Read up some decent Unix manual. I read and even better i ported and finally yes i thought about the function, why it is there and why it is better then System V IPC. SKM> What needs to be fixed? Could you be more specific? That the name argument is just that "a name" (in its own name space) not a path. -- Best regards, Lothar Scholz mailto:scholz@scriptolutions.com From zachary.loafman at isilon.com Mon May 11 16:29:36 2009 From: zachary.loafman at isilon.com (Zachary Loafman) Date: Mon May 11 16:29:47 2009 Subject: FAIL: kernel fault injection Message-ID: <20090511162928.GD17203@isilon.com> Arch - I'd like to contribute the kernel fault injection system that Isilon uses. Before contributing it, I'd like to get approval for the APIs involved. Testing errors is hard. Let's say you have: int foo(void) { [...] error = bar(); if (error) { /* do stuff */ } } .. but some_func() can't reliably be made to fail. How do you test it? We added error injection macros that look like this: int foo(void) { [...] error = bar(); KFAIL_POINT_CODE(FP_KERN, bar_fails_foo, error = RETURN_VALUE); if (error) { /* do stuff */ } } The KFAIL_POINT_CODE macro adds a sysctl MIB that allows you to inject errors into the above code. For example: # sysctl fail_point.kern.bar_fails_foo=".1%return(5)" This says, ".1% of the time, evaluate the fail point code with 5 as the RETURN_VALUE". If this were a standard errno, you could read the above setting as "1/1000th of the time, pretend bar() returned EIO". We also have a few wrappers around KFAIL_POINT_CODE that essentially wrap common uses. For example, the above use can be shorthanded to: KFAIL_POINT_ERROR(FP_KERN, bar_fails_foo, error) Currently, the sysctl parser accepts the following variants: return(x) - triggers the code with RETURN_VALUE set to x sleep(t) - sleep t milliseconds, panic/break - panic or break into the debugger print - print that the fail point was hit In addition to the commands, we have a syntax to express the when to evaluate those commands: p% - evaluate command p% of the time (example above) 5* - evaluate command 5 times, then disable the expression And you can compound with expr1->expr2, so, e.g.: 5%return(5)->1%return(22): 5% of the time, return 5, 1% of the remaining time, return 22 5*return(0)->10*return(5)->1%return(19) return 0 for 5 times, then 5 for 10 times, and after those, return 19 1% of the time. 1%5*return(22): 1/100th of the time, return 22, but only do it 5 times total. I've also attached an ascii rendering of a (rough draft) man page that goes into more detail. Comments? ...Zach -------------- next part -------------- FAIL(9) FreeBSD Kernel Developer's Manual FAIL(9) NAME KFAIL_POINT_CODE, KFAIL_POINT_RETURN, KFAIL_POINT_RETURN_VOID, KFAIL_POINT_ERROR, KFAIL_POINT_GOTO, fail_point, FP_KERN -- fail points SYNOPSIS #include KFAIL_POINT_CODE(parent, name, code); KFAIL_POINT_RETURN(parent, name); KFAIL_POINT_RETURN_VOID(parent, name); KFAIL_POINT_ERROR(parent, name, error_var); KFAIL_POINT_GOTO(parent, name, error_var, label); DESCRIPTION Fail points are used to add code points where errors may be injected in a user controlled fashion. Fail points provide a convenient wrapper around user provided error injection code, providing a sysctl(9) MIB, and a parser for that MIB that describes how the error injection code should fire. The base fail point macro is KFAIL_POINT_CODE() where parent is a sysctl tree (frequently FP_KERN for kernel fail points, but various subsystems may wish to provide their own fail point trees), and name is the name of the MIB in that tree, and code is the error injection code. The code argument does not require braces, but it is considered good style to use braces for any multi-line code arguments. Inside the code argument, the evaluation of RETURN_VALUE is derived from the return() value set in the sysctl MIB. See SYSCTL SETTINGS below. The remaining KFAIL_POINT_*() macros are wrappers around common error injection paths: KFAIL_POINT_RETURN(parent, name) is the equivalent of KFAIL_POINT_CODE(..., return RETURN_VALUE) KFAIL_POINT_RETURN_VOID(parent, name) is the equivalent of KFAIL_POINT_CODE(..., return) KFAIL_POINT_ERROR(parent, name, error_var) is the equivalent of KFAIL_POINT_CODE(..., error_var = RETURN_VALUE) KFAIL_POINT_GOTO(parent, name, error_var, label) is the equivalent of KFAIL_POINT_CODE(..., { error_var = RETURN_VALUE; goto label;}) SYSCTL VARIABLES The KFAIL_POINT_*() macros add sysctl MIBs where specified. Many base kernel MIBs can be found in the fail_point.kern tree (referenced in code by FP_KERN ). The sysctl setting recognizes the following grammar: :: ( "->" )* :: ( ( "%") | ( "*" ) )* [ "(" ")" ] :: [ "." ] | "." :: "off" | "return" | "sleep" | "panic" | "break" | "print" The argument specifies which action to take: off Take no action (does not trigger fail point code) return Trigger fail point code with specified argument panic Panic break Break into the debugger. print Print that the fail point executed The % and * modifiers prior to control when is executed. The % form (e.g. "1.2%") can be used to specify a probability that will execute. The * form (e.g. "5*") can be used to specify the number of times should be executed before this is disabled. Only the last probability and the last count are used if multiple are specified, i.e. "1.2%2%" is the same as "2%". When both a probability and a count are specified, the probability is evalu- ated before the count, i.e. "2%5*" means "2% of the time, but only exe- cute it 5 times total". The operator -> can be used to express cascading terms. If you specify ->, it means that if doesn't 'execute', is evaluated. For the purpose of this operator, the return() and print() operators are the only types that cascade. A return() term only cascades if the code executes, and a print() term only cascades when passed a non- zero argument. EXAMPLES sysctl fail_point.kern.foobar="2.1%return(5)" 21/1000ths of the time, execute code with RETURN_VALUE set to 5. sysctl fail_point.kern.foobar="2%return(5)->5%return(22)" 2/100th of the time, execute code with RETURN_VALUE set to 5. If that doesn't happen, 5% of the time execute code with RETURN_VALUE set to 22. sysctl fail_point.kern.foobar="5*return(5)->0.1%return(22)" For 5 times, return 5. After that, 1/1000ths of the time, return 22. sysctl fail_point.kern.foobar="0.1%5*return(5)" Return 5 for 1 in 1000 executions, but only execute 5 times total. CAVEATS It's easy to shoot yourself in the foot by setting fail points too aggressively or setting too many in combination. For example, forcing malloc() to fail consistently is potentially harmful to uptime. The sleep() sysctl setting may not be appropriate in all situations. Cur- rently, fail_point_eval() does not verify whether the context is appro- priate for calling msleep(). FreeBSD 8.0 May 10, 2009 FreeBSD 8.0 From wollman at bimajority.org Mon May 11 16:35:46 2009 From: wollman at bimajority.org (Garrett Wollman) Date: Mon May 11 16:35:53 2009 Subject: Posix shared memory problem In-Reply-To: <1393224851.20090511112537@scriptolutions.com> References: <200905100500.n4A50GOa050728@hergotha.csail.mit.edu> <7710650619.20090510075706@scriptolutions.com> <18950.63671.323324.756287@hergotha.csail.mit.edu> <1393224851.20090511112537@scriptolutions.com> Message-ID: <18952.21468.748665.878710@hergotha.csail.mit.edu> < said: > Some idiots started to think about this as a file path. But it isn't > and it shouldn't. Actually, it really should be. Ask a security person or a virtualization person to explain why an unnecessary multiplicity of namespaces is a bad idea. > If i have to find a useable filefile location anyway the whole > function does not make any sense, then i can directly use mmap. But of course you won't get the same behavior, because open()/mmap() guarantees that the backing store will get updated. That's why there's a separate interface. -GAWollman From jhb at freebsd.org Mon May 11 16:49:35 2009 From: jhb at freebsd.org (John Baldwin) Date: Mon May 11 16:50:14 2009 Subject: Are named posix semaphores not implemented? In-Reply-To: <123863765.20090509203610@scriptolutions.com> References: <976698487.20090509182307@scriptolutions.com> <123863765.20090509203610@scriptolutions.com> Message-ID: <200905110840.32980.jhb@freebsd.org> On Saturday 09 May 2009 2:36:10 pm Lothar Scholz wrote: > Hello pluknet, > > > p> First, you should have sem(4) capacity enabled in kernel > p> (via kldload or statically built). It seems you haven't. > > Yes - but it was a total fresh PC-BSD build. > > I must say that i really don't like this start off with > anything disabled. They will never get a good desktop os > if the user can't run simple programs without the need to > learn about kernel modules. > > Well this is the wrong list to discuss this anyway. They are disabled by default because they used to be badly broken. They have been overhauled in 7.2 and perhaps should be enabled by default now. -- John Baldwin From jhb at freebsd.org Mon May 11 16:49:36 2009 From: jhb at freebsd.org (John Baldwin) Date: Mon May 11 16:50:15 2009 Subject: Posix shared memory problem In-Reply-To: <1393224851.20090511112537@scriptolutions.com> References: <18950.63671.323324.756287@hergotha.csail.mit.edu> <1393224851.20090511112537@scriptolutions.com> Message-ID: <200905110843.22543.jhb@freebsd.org> On Monday 11 May 2009 5:25:37 am Lothar Scholz wrote: > Some idiots started to think about this as a file path. But it isn't > and it shouldn't. Thats what this spec is saying in the typical commitee > polite form when some members made a mistake but are to important to > be blamed in public. Hmm, why don't you head down to your local bookstore and buy a copy of "Solaris Internals" and then come back and explain to us all how the developers of Solaris are idiots. -- John Baldwin From jroberson at jroberson.net Tue May 12 03:50:26 2009 From: jroberson at jroberson.net (Jeff Roberson) Date: Tue May 12 03:50:33 2009 Subject: lockless file descriptor lookup Message-ID: http://people.freebsd.org/~jeff/locklessfd.diff This patch implements a lockless lookup path for file descriptors. The meat of the algorithm is in fget_unlocked(). This returns a referenced file descriptor, unlike fget_locked(). In the common case this reduces the number of atomics required for fget() while allowing for lookups to proceed concurrently with modifications to the table and preventing preemption from causing context switches. Using the libMicro 4.0 benchmarking suite with a thread count of 16 on an 8core box yields improvements by as much as 428% in descriptor heavy tests. There were no performance regressions with this benchmark. The code works by allowing lookup threads to follow two previously unsafe pointers. First, the file descriptor table itself is never freed on expansion until the process exits. That ensures that no pagefaults or random memory access can occur if expansion happens after the table pointer is fetched. Given that the vast majority of processes never expand their descriptor table, it is not any significant memory overhead to save them. I shamelessly stole this idea from NetBSD. The struct files themselves are marked as UMA_ZONE_NOFREE and never reclaimed. This allows us to safely attempt to reference count them without any locks. To prevent fdrop() races fget_unlocked() uses a cmpset loop to ensure that it never raises the reference count above zero. In this way it can never reference a free'd or recently allocated file. Once the file descriptor is resolved, we verify the path via the descriptor table once more to ensure that it has not changed. At this point, we have a valid reference or we drop an invalid reference and retry. This gives us the overhead of only one atomic instruction for common case file access. In the worst case there can be some spinning in the loop in fget_unlocked(), but some thread always makes forward progress for each iteration of the loop. I'm going to see if the usual suspects will stress test this but I'd like to see it in 8.0. This is your chance to make any counter arguments. I'd also appreciate it if someone could look at my volatile cast and make sure I'm actually forcing the compiler to refresh the fd_ofiles array here: + if (fp == ((struct file *volatile*)fdp->fd_ofiles)[fd]) Thanks, Jeff From peterjeremy at optushome.com.au Tue May 12 09:49:24 2009 From: peterjeremy at optushome.com.au (peterjeremy@optushome.com.au) Date: Tue May 12 09:49:31 2009 Subject: Are named posix semaphores not implemented? In-Reply-To: <123863765.20090509203610@scriptolutions.com> References: <976698487.20090509182307@scriptolutions.com> <123863765.20090509203610@scriptolutions.com> Message-ID: <20090512094916.GA41857@server.vk2pj.dyndns.org> On 2009-May-09 20:36:10 +0200, Lothar Scholz wrote: >I must say that i really don't like this start off with >anything disabled. OTOH, people run FreeBSD because they don't want a default configuration that is bloated by lots of "features" that they will never use and (in some cases) reduce system performance. > They will never get a good desktop os >if the user can't run simple programs without the need to >learn about kernel modules. This depends on your definition of "simple". I've been running FreeBSD for over 10 years and I don't think I've ever needed sem(4). >Well this is the wrong list to discuss this anyway. It might be relevant if you want to propose the default inclusion of sem(4) into GENERIC (though I'm not sure if PC-BSD is a FreeBSD GENERIC kernel or one that has been adapted for PC-BSD). -- Peter Jeremy -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20090512/3a183800/attachment.pgp From des at des.no Tue May 12 10:55:41 2009 From: des at des.no (=?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?=) Date: Tue May 12 10:55:48 2009 Subject: Posix shared memory problem In-Reply-To: <981850520.20090511172605@scriptolutions.com> (Lothar Scholz's message of "Mon, 11 May 2009 17:26:05 +0200") References: <200905100500.n4A50GOa050728@hergotha.csail.mit.edu> <7710650619.20090510075706@scriptolutions.com> <18950.63671.323324.756287@hergotha.csail.mit.edu> <1393224851.20090511112537@scriptolutions.com> <74fe56020905110410y430bf76yacf5c5a308a99865@mail.gmail.com> <981850520.20090511172605@scriptolutions.com> Message-ID: <86fxfa615g.fsf@ds4.des.no> Lothar Scholz writes: > Sujit K M writes: > > What needs to be fixed? Could you be more specific? > That the name argument is just that "a name" (in its own name space) > not a path. Allow me to quote from the SUSv3 again: DESCRIPTION The shm_open() function shall establish a connection between a shared memory object and a file descriptor. [...] The name argument points to a string naming a shared memory object. It is unspecified whether the name appears in the file system and is visible to other functions that take pathnames as arguments. The name argument conforms to the construction rules for a pathname. [...] RATIONALE [...] Note that such shared memory objects can actually be implemented as mapped files. In both cases, the size can be set after the open using ftruncate(). The shm_open() function itself does not create a shared object of a specified size because this would duplicate an extant function that set the size of an object referenced by a file descriptor. On implementations where memory objects are implemented using the existing file system, the shm_open() function may be implemented using a macro that invokes open(), and the shm_unlink() function may be implemented using a macro that invokes unlink(). [...] DES -- Dag-Erling Sm?rgrav - des@des.no From des at des.no Tue May 12 11:02:51 2009 From: des at des.no (=?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?=) Date: Tue May 12 11:03:00 2009 Subject: lockless file descriptor lookup In-Reply-To: (Jeff Roberson's message of "Mon, 11 May 2009 17:32:17 -1000 (HST)") References: Message-ID: <86bppy60ti.fsf@ds4.des.no> Jeff Roberson writes: > I'd also appreciate it if someone could look at my volatile cast and > make sure I'm actually forcing the compiler to refresh the fd_ofiles > array here: > > + if (fp == ((struct file *volatile*)fdp->fd_ofiles)[fd]) The problem is that since it is not declared as volatile, some other piece of code may have modified it but not yet flushed it to RAM. DES -- Dag-Erling Sm?rgrav - des@des.no From jhb at freebsd.org Tue May 12 15:48:13 2009 From: jhb at freebsd.org (John Baldwin) Date: Tue May 12 15:48:28 2009 Subject: Remove d_thread_t for 8.0 Message-ID: <200905121020.18497.jhb@freebsd.org> In the same vein as purging BURN_BRIDGES stuff, is there any objection to removing d_thread_t from 8.0? It is intended as a compat shim to reduce diffs with 4.x. However, at this point drivers are not actively being merged back to 4.x, so I think it is no longer necessary. -- John Baldwin From gnn at neville-neil.com Tue May 12 16:12:35 2009 From: gnn at neville-neil.com (George Neville-Neil) Date: Tue May 12 16:12:46 2009 Subject: FAIL: kernel fault injection In-Reply-To: <20090511162928.GD17203@isilon.com> References: <20090511162928.GD17203@isilon.com> Message-ID: <62B236A8-E303-4200-A8E4-CFF6C022875D@neville-neil.com> On May 11, 2009, at 12:29 , Zachary Loafman wrote: > Arch - > > I'd like to contribute the kernel fault injection system that Isilon > uses. Before contributing it, I'd like to get approval for the APIs > involved. > > Testing errors is hard. Let's say you have: > > int foo(void) { > [...] > error = bar(); > if (error) { > /* do stuff */ > } > } > > .. but some_func() can't reliably be made to fail. How do you test it? > We added error injection macros that look like this: > > int foo(void) { > [...] > error = bar(); > KFAIL_POINT_CODE(FP_KERN, bar_fails_foo, error = RETURN_VALUE); > if (error) { > /* do stuff */ > } > } > > The KFAIL_POINT_CODE macro adds a sysctl MIB that allows > you to inject errors into the above code. For example: > > # sysctl fail_point.kern.bar_fails_foo=".1%return(5)" > > This says, ".1% of the time, evaluate the fail point code with 5 as > the RETURN_VALUE". If this were a standard errno, you could read the > above setting as "1/1000th of the time, pretend bar() returned EIO". > > We also have a few wrappers around KFAIL_POINT_CODE that essentially > wrap common uses. For example, the above use can be shorthanded to: > KFAIL_POINT_ERROR(FP_KERN, bar_fails_foo, error) > > Currently, the sysctl parser accepts the following variants: > return(x) - triggers the code with RETURN_VALUE set to x > sleep(t) - sleep t milliseconds, > panic/break - panic or break into the debugger > print - print that the fail point was hit > > In addition to the commands, we have a syntax to express the > when to evaluate those commands: > p% - evaluate command p% of the time (example above) > 5* - evaluate command 5 times, then disable the expression > > And you can compound with expr1->expr2, so, e.g.: > 5%return(5)->1%return(22): > 5% of the time, return 5, 1% of the remaining time, return 22 > 5*return(0)->10*return(5)->1%return(19) > return 0 for 5 times, then 5 for 10 times, and after those, > return 19 1% of the time. > 1%5*return(22): > 1/100th of the time, return 22, but only do it 5 times total. > > I've also attached an ascii rendering of a (rough draft) man page that > goes into more detail. > > Comments? > Hi Zach, I've taken a brief look at the email and the man page you have sent. I don't see any glaring problems that would prevent us from using this code. Hopefully others will also see its usefulness. Any idea how soon you'd like to commit this? It would be great to get it in before the 8.0 branch so that the APIs are available throughout the duration of that branch, and then moving forwards. Best, George From rpaulo at freebsd.org Tue May 12 16:19:49 2009 From: rpaulo at freebsd.org (Rui Paulo) Date: Tue May 12 16:20:21 2009 Subject: FAIL: kernel fault injection In-Reply-To: <20090511162928.GD17203@isilon.com> References: <20090511162928.GD17203@isilon.com> Message-ID: On 11 May 2009, at 17:29, Zachary Loafman wrote: > Arch - > > I'd like to contribute the kernel fault injection system that Isilon > uses. Before contributing it, I'd like to get approval for the APIs > involved. > > Testing errors is hard. Let's say you have: > > int foo(void) { > [...] > error = bar(); > if (error) { > /* do stuff */ > } > } > > .. but some_func() can't reliably be made to fail. How do you test it? > We added error injection macros that look like this: > > int foo(void) { > [...] > error = bar(); > KFAIL_POINT_CODE(FP_KERN, bar_fails_foo, error = RETURN_VALUE); > if (error) { > /* do stuff */ > } > } > > The KFAIL_POINT_CODE macro adds a sysctl MIB that allows > you to inject errors into the above code. For example: > > # sysctl fail_point.kern.bar_fails_foo=".1%return(5)" > > This says, ".1% of the time, evaluate the fail point code with 5 as > the RETURN_VALUE". If this were a standard errno, you could read the > above setting as "1/1000th of the time, pretend bar() returned EIO". > > We also have a few wrappers around KFAIL_POINT_CODE that essentially > wrap common uses. For example, the above use can be shorthanded to: > KFAIL_POINT_ERROR(FP_KERN, bar_fails_foo, error) > > Currently, the sysctl parser accepts the following variants: > return(x) - triggers the code with RETURN_VALUE set to x > sleep(t) - sleep t milliseconds, > panic/break - panic or break into the debugger > print - print that the fail point was hit > > In addition to the commands, we have a syntax to express the > when to evaluate those commands: > p% - evaluate command p% of the time (example above) > 5* - evaluate command 5 times, then disable the expression > > And you can compound with expr1->expr2, so, e.g.: > 5%return(5)->1%return(22): > 5% of the time, return 5, 1% of the remaining time, return 22 > 5*return(0)->10*return(5)->1%return(19) > return 0 for 5 times, then 5 for 10 times, and after those, > return 19 1% of the time. > 1%5*return(22): > 1/100th of the time, return 22, but only do it 5 times total. > > I've also attached an ascii rendering of a (rough draft) man page that > goes into more detail. > > Comments? This is great and I would like to see this go in. I just have to minor modifications (possible bikeshed, but whatever): * What about kern.fail_point instead of fail_point.kern ? This framework seems to be only for kernel. * On the man page, you don't explain the 'sleep' type. Is that on purpose? About the CAVEAT section on the man page (second paragraph), do you have any ideas to evaluate if msleep is being called on a correct context? Thanks. -- Rui Paulo -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 194 bytes Desc: This is a digitally signed message part Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20090512/5b4806c1/PGP.pgp From zachary.loafman at isilon.com Tue May 12 16:26:39 2009 From: zachary.loafman at isilon.com (Zachary Loafman) Date: Tue May 12 16:26:45 2009 Subject: FAIL: kernel fault injection In-Reply-To: References: <20090511162928.GD17203@isilon.com> Message-ID: <20090512162630.GC7250@isilon.com> On Tue, May 12, 2009 at 03:56:51PM +0100, Rui Paulo wrote: > On 11 May 2009, at 17:29, Zachary Loafman wrote: > This is great and I would like to see this go in. I just have to minor > modifications (possible bikeshed, but whatever): > * What about kern.fail_point instead of fail_point.kern ? This framework > seems to be only for kernel. It's only for the kernel, but loadable modules can use it as well. It's kind of a question of whether you want the available fail points in one tree, but away from their respective namespace, or whether you want them in the tree associated with their namespace. Thinking about it more, kern.fail_point might make more sense - I like seeing testing related things grouped with the thing its testing. > * On the man page, you don't explain the 'sleep' type. Is that on > purpose? Oversight, I'll include it in the final. > About the CAVEAT section on the man page (second paragraph), do you have > any ideas to evaluate if msleep is being called on a correct context? I haven't actually thought about it too long. If we factored out the "for (lle = td->td_sleeplocks ..." section from witness_warn, we could potentially make the sleepable lock state. But witness isn't always enabled. Nor are sleepable locks the only real issue with msleep()ing. If you build the fail point manually, you can add fail points that invoke a defined sleep function on timeout() (instead of just calling msleep). You can build fail points manually using a set of lower level fail_point_* routines, one of which is fail_point_set_sleep_fn. I haven't had a chance to write up the man page for those yet. Thanks for the comments! ...Zach From zachary.loafman at isilon.com Tue May 12 16:38:40 2009 From: zachary.loafman at isilon.com (Zachary Loafman) Date: Tue May 12 16:38:55 2009 Subject: FAIL: kernel fault injection In-Reply-To: <62B236A8-E303-4200-A8E4-CFF6C022875D@neville-neil.com> References: <20090511162928.GD17203@isilon.com> <62B236A8-E303-4200-A8E4-CFF6C022875D@neville-neil.com> Message-ID: <20090512163506.GE7250@isilon.com> On Tue, May 12, 2009 at 11:44:58AM -0400, George Neville-Neil wrote: > Any idea how soon you'd like to commit this? It would be great to > get it in before the 8.0 branch so that the APIs are available > throughout the duration of that branch, and then moving forwards. I'm pretty much ready to commit once we have API consensus, barring code review issues cropping up. Getting it in before the slush doesn't seem like a real issue. I think it's even MFCable to 7.x, given that it's a completely new set of APIs. It would certainly help any 8.x->7.x merges that included new fail points, though it's not like it's hard to delete them on merge. Just sad, especially if you also write unit tests that use them. ...Zach From ed at 80386.nl Tue May 12 16:59:51 2009 From: ed at 80386.nl (Ed Schouten) Date: Tue May 12 16:59:59 2009 Subject: lockless file descriptor lookup In-Reply-To: References: Message-ID: <20090512165949.GF58540@hoeg.nl> Hello Jeff, * Jeff Roberson wrote: > Once the file descriptor is resolved, we verify the path via the > descriptor table once more to ensure that it has not changed. At this > point, we have a valid reference or we drop an invalid reference and > retry. It's nice to see someone stepped up to implement this. Just out of curiosity, have you done any benchmarks to see how many percent of the time a thread needs more than one attempt to obtain a valid reference on a common workload? Maybe it would be nice for diagnostic purposes to add two sysctls to obtain the amount of successful and unsuccessful attempts. -- Ed Schouten WWW: http://80386.nl/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20090512/57409b4c/attachment.pgp From rpaulo at gmail.com Tue May 12 17:10:18 2009 From: rpaulo at gmail.com (Rui Paulo) Date: Tue May 12 17:10:24 2009 Subject: FAIL: kernel fault injection In-Reply-To: <20090512163506.GE7250@isilon.com> References: <20090511162928.GD17203@isilon.com> <62B236A8-E303-4200-A8E4-CFF6C022875D@neville-neil.com> <20090512163506.GE7250@isilon.com> Message-ID: On 12 May 2009, at 17:35, Zachary Loafman wrote: > On Tue, May 12, 2009 at 11:44:58AM -0400, George Neville-Neil wrote: > >> Any idea how soon you'd like to commit this? It would be great to >> get it in before the 8.0 branch so that the APIs are available >> throughout the duration of that branch, and then moving forwards. > > I'm pretty much ready to commit once we have API consensus, barring > code > review issues cropping up. Getting it in before the slush doesn't seem > like a real issue. > > I think it's even MFCable to 7.x, given that it's a completely new set > of APIs. It would certainly help any 8.x->7.x merges that included new > fail points, though it's not like it's hard to delete them on merge. > Just sad, especially if you also write unit tests that use them. Ok, so please send the patch for review whenever you're ready. Regards, -- Rui Paulo -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 194 bytes Desc: This is a digitally signed message part Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20090512/603fa2b1/PGP.pgp From peterjeremy at optushome.com.au Tue May 12 20:01:25 2009 From: peterjeremy at optushome.com.au (Peter Jeremy) Date: Tue May 12 20:01:32 2009 Subject: shutdown_nice during boot In-Reply-To: <4A0295E0.4020609@icyb.net.ua> References: <4A01B9A3.2030806@icyb.net.ua> <20090507080048.GA64648@server.vk2pj.dyndns.org> <4A0295E0.4020609@icyb.net.ua> Message-ID: <20090512200118.GC99304@server.vk2pj.dyndns.org> On 2009-May-07 11:03:44 +0300, Andriy Gapon wrote: >on 07/05/2009 11:00 peterjeremy@optushome.com.au said the following: >> On 2009-May-06 19:24:03 +0300, Andriy Gapon wrote: >>> It's possible to re-enable SIGINT right after init is forked, but >>> this way it will be delivered to init before it installs signal >>> handlers and thus init would simply terminate resulting in "Going >>> nowhere without my init!" panic. >> >> The best option would seem to be for init(8) to call sigprocmask(2) >> immediately it starts up and block all signals. > >But a signal still can be delivered after init is exec-ed and before >sigprocmask(2) is called or not? True - there is still a window there where signal dispositiona are inappropriate. Thinking about it some more, maybe the solution is to change the test in shutdown_nice() - rather than testing for the existence of initproc, it could test a sysctl variable that init sets once it has its signal handlers in place. There's already a kern.shutdown node so maybe "kern.shutdown.via_init". (This adds the option for other subsystems to clear it if desired - maybe as part of a watchdog function). Maybe shutdown_nice should also initiate a 30-60 second timer that invokes boot() whether or not init responds. -- Peter Jeremy -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20090512/9ee4aae4/attachment.pgp From jroberson at jroberson.net Wed May 13 00:25:56 2009 From: jroberson at jroberson.net (Jeff Roberson) Date: Wed May 13 00:26:02 2009 Subject: lockless file descriptor lookup In-Reply-To: <86bppy60ti.fsf@ds4.des.no> References: <86bppy60ti.fsf@ds4.des.no> Message-ID: On Tue, 12 May 2009, Dag-Erling Sm?rgrav wrote: > Jeff Roberson writes: >> I'd also appreciate it if someone could look at my volatile cast and >> make sure I'm actually forcing the compiler to refresh the fd_ofiles >> array here: >> >> + if (fp == ((struct file *volatile*)fdp->fd_ofiles)[fd]) > > The problem is that since it is not declared as volatile, some other > piece of code may have modified it but not yet flushed it to RAM. That is an acceptable race due to other guarantees. If it hasn't been committed to memory yet, the old table still contains valid data. I only need to be certain that the compiler doesn't cache the original ofiles value. It can't anyway because atomics use inline assembly on all platforms but I'd like it to be explicit anyway. Thanks, Jeff > > DES > -- > Dag-Erling Sm?rgrav - des@des.no > From jroberson at jroberson.net Wed May 13 00:29:08 2009 From: jroberson at jroberson.net (Jeff Roberson) Date: Wed May 13 00:29:15 2009 Subject: lockless file descriptor lookup In-Reply-To: <20090512165949.GF58540@hoeg.nl> References: <20090512165949.GF58540@hoeg.nl> Message-ID: On Tue, 12 May 2009, Ed Schouten wrote: > Hello Jeff, > > * Jeff Roberson wrote: >> Once the file descriptor is resolved, we verify the path via the >> descriptor table once more to ensure that it has not changed. At this >> point, we have a valid reference or we drop an invalid reference and >> retry. > > It's nice to see someone stepped up to implement this. Just out of > curiosity, have you done any benchmarks to see how many percent of the > time a thread needs more than one attempt to obtain a valid reference on > a common workload? > > Maybe it would be nice for diagnostic purposes to add two sysctls to > obtain the amount of successful and unsuccessful attempts. Hi Ed, I have had trouble triggering it at all in testing. I'd prefer not to commit the counters because they would re-introduce a global point of cache contention unless we made them per-cpu. This effectively implements ll/sc semantics on all architectures via cmpset. I suspect the overhead is minimal even in degenerate cases. Thanks, Jeff > > -- > Ed Schouten > WWW: http://80386.nl/ > From ivoras at freebsd.org Wed May 13 12:42:25 2009 From: ivoras at freebsd.org (Ivan Voras) Date: Wed May 13 12:42:31 2009 Subject: Are named posix semaphores not implemented? In-Reply-To: <123863765.20090509203610@scriptolutions.com> References: <976698487.20090509182307@scriptolutions.com> <123863765.20090509203610@scriptolutions.com> Message-ID: Lothar Scholz wrote: > Hello pluknet, > > > p> First, you should have sem(4) capacity enabled in kernel > p> (via kldload or statically built). It seems you haven't. > > Yes - but it was a total fresh PC-BSD build. For future troubleshooting, you should get used to reading the man pages - sem(4) describes how to enable it. > I must say that i really don't like this start off with > anything disabled. They will never get a good desktop os > if the user can't run simple programs without the need to > learn about kernel modules. They are disabled by default because they don't get much use - apparently it's not a very popular API. If you make a good case for its inclusion (reference large or popular projects that use it), there is probably no reason not to include it by default. The sem.ko loadable module is something like 20 kB in size. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 260 bytes Desc: OpenPGP digital signature Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20090513/fbd2de66/signature.pgp From jhb at freebsd.org Wed May 13 14:27:29 2009 From: jhb at freebsd.org (John Baldwin) Date: Wed May 13 14:27:35 2009 Subject: lockless file descriptor lookup In-Reply-To: References: Message-ID: <200905130935.42795.jhb@freebsd.org> On Monday 11 May 2009 11:32:17 pm Jeff Roberson wrote: > http://people.freebsd.org/~jeff/locklessfd.diff > > This patch implements a lockless lookup path for file descriptors. The > meat of the algorithm is in fget_unlocked(). This returns a referenced > file descriptor, unlike fget_locked(). In the common case this reduces > the number of atomics required for fget() while allowing for lookups to > proceed concurrently with modifications to the table and preventing > preemption from causing context switches. Looks good. My only comment would be to not remove the 'hold' comment completely from _fget(), but instead say that it always returns a refcount that must be dropped. Basically: * The file's refcount will be bumped on return. It should be dropped * with fdrop(). or something like that in place of the old paragraph about the 'hold' parameter. -- John Baldwin From jroberson at jroberson.net Thu May 14 01:58:00 2009 From: jroberson at jroberson.net (Jeff Roberson) Date: Thu May 14 01:58:06 2009 Subject: lockless file descriptor lookup In-Reply-To: <200905130935.42795.jhb@freebsd.org> References: <200905130935.42795.jhb@freebsd.org> Message-ID: On Wed, 13 May 2009, John Baldwin wrote: > On Monday 11 May 2009 11:32:17 pm Jeff Roberson wrote: >> http://people.freebsd.org/~jeff/locklessfd.diff >> >> This patch implements a lockless lookup path for file descriptors. The >> meat of the algorithm is in fget_unlocked(). This returns a referenced >> file descriptor, unlike fget_locked(). In the common case this reduces >> the number of atomics required for fget() while allowing for lookups to >> proceed concurrently with modifications to the table and preventing >> preemption from causing context switches. > > Looks good. My only comment would be to not remove the 'hold' comment > completely from _fget(), but instead say that it always returns a refcount > that must be dropped. Basically: > > * The file's refcount will be bumped on return. It should be dropped > * with fdrop(). > > or something like that in place of the old paragraph about the 'hold' > parameter. Yeah, I think I'll add a comment in the header too. pho has tested it with some targeted tests and stress2 for a day or two. I'm going to commit this evening so it gets as much exposure as possible before release. Thanks, Jeff > > -- > John Baldwin > From brde at optusnet.com.au Thu May 14 04:47:52 2009 From: brde at optusnet.com.au (Bruce Evans) Date: Thu May 14 04:47:58 2009 Subject: lockless file descriptor lookup In-Reply-To: References: <86bppy60ti.fsf@ds4.des.no> Message-ID: <20090514131613.T1224@besplex.bde.org> On Tue, 12 May 2009, Jeff Roberson wrote: > On Tue, 12 May 2009, Dag-Erling Sm?rgrav wrote: > >> Jeff Roberson writes: >>> I'd also appreciate it if someone could look at my volatile cast and >>> make sure I'm actually forcing the compiler to refresh the fd_ofiles >>> array here: >>> >>> + if (fp == ((struct file *volatile*)fdp->fd_ofiles)[fd]) This has 2 style bugs (missing space after first '*' and missing space before second '*'. It isn't clear whether you want to refresh the fd_ofiles pointer to the (first element of) the array, or the fd'th element. It is clear that you don't want to refresh the whole array. The above refreshes the fd'th element. Strangely, in my tests gcc refreshes the fd'th element even without the cast. E.g., test(fdp->fd_ofiles[fd], fdp->fd_ofiles[fd]); results in 1 memory access for each of the [fd]'s. >> The problem is that since it is not declared as volatile, some other >> piece of code may have modified it but not yet flushed it to RAM. > > That is an acceptable race due to other guarantees. If it hasn't been > committed to memory yet, the old table still contains valid data. I only > need to be certain that the compiler doesn't cache the original ofiles value. > It can't anyway because atomics use inline assembly on all platforms but I'd > like it to be explicit anyway. It shouldn't matter that atomics use inline asm. Non-broken inline asm declares all its inputs and outputs, so compilers can see what it changes just as easily as for C code (and more easily than for non- inline asm or C). Anyway, you probably need atomics that have suitable memory barriers. Memory barriers must affect the compiler and make it perform refreshes for them to work, so you shouldn't need any volatile casts. E.g., all atomic store operations (including cmpset) have release semantics even if they aren't spelled with "_rel" or implemented using inline asm. On amd64 and i386, they happen to be implemented using inline asm with "memory" clobbers. The "memory" clobbers force refreshes of all non-local variables. Bruce From alc at cs.rice.edu Thu May 14 16:42:48 2009 From: alc at cs.rice.edu (Alan Cox) Date: Thu May 14 16:42:55 2009 Subject: PTE modified bit emulation trap In-Reply-To: <86prec4kwj.fsf@ds4.des.no> References: <86tz3o4lb9.fsf@ds4.des.no> <86prec4kwj.fsf@ds4.des.no> Message-ID: <4A0C49FF.1070707@cs.rice.edu> Dag-Erling Sm?rgrav wrote: > [from -alpha, -hackers] > > Dag-Erling Sm?rgrav writes: > >> Coverity complains about the lack of error checking in the following >> code in sys/kern/kern_sysctl.c, around line 1390: >> >> /* >> * Touch all the wired pages to avoid PTE modified >> * bit emulation traps on Alpha while holding locks >> * in the sysctl handler. >> */ >> for (i = (wiredlen + PAGE_SIZE - 1) / PAGE_SIZE, >> cp = req->oldptr; i > 0; i--, cp += PAGE_SIZE) { >> copyin(cp, &dummy, 1); >> copyout(&dummy, cp, 1); >> } >> >> Since Alpha is dead, can we remove this, or is it still needed for other >> platforms? >> > > kmacy suggested you might be the right person to ask... the conclusion > so far is that it *might* be necessary on sparc64 and / or mips. > I think that this code may no longer be needed, but I want to double-check. I faced a related problem implementing superpages support, so I introduced an additional "access type" parameter to pmap_enter(). This parameter was specifically intended to allow a pmap_enter() implementation to preset the PTE's modified bit. I think that the simulated page fault that occurs on vslock()-style wiring passes "write access" to pmap_enter(). If so, then it's just a matter of tweaking the MIPS or any other pmap_enter() to actually do something with the "access type" parameter. Currently, only the architectures that implement the pmap-level support for superpages, i.e., amd64 and i386, do anything with this parameter. Alan From jhb at freebsd.org Fri May 15 13:11:06 2009 From: jhb at freebsd.org (John Baldwin) Date: Fri May 15 13:11:22 2009 Subject: PTE modified bit emulation trap In-Reply-To: <4A0C49FF.1070707@cs.rice.edu> References: <86tz3o4lb9.fsf@ds4.des.no> <86prec4kwj.fsf@ds4.des.no> <4A0C49FF.1070707@cs.rice.edu> Message-ID: <200905150804.36977.jhb@freebsd.org> On Thursday 14 May 2009 12:42:39 pm Alan Cox wrote: > Dag-Erling Sm?rgrav wrote: > > [from -alpha, -hackers] > > > > Dag-Erling Sm?rgrav writes: > > > >> Coverity complains about the lack of error checking in the following > >> code in sys/kern/kern_sysctl.c, around line 1390: > >> > >> /* > >> * Touch all the wired pages to avoid PTE modified > >> * bit emulation traps on Alpha while holding locks > >> * in the sysctl handler. > >> */ > >> for (i = (wiredlen + PAGE_SIZE - 1) / PAGE_SIZE, > >> cp = req->oldptr; i > 0; i--, cp += PAGE_SIZE) { > >> copyin(cp, &dummy, 1); > >> copyout(&dummy, cp, 1); > >> } > >> > >> Since Alpha is dead, can we remove this, or is it still needed for other > >> platforms? > >> > > > > kmacy suggested you might be the right person to ask... the conclusion > > so far is that it *might* be necessary on sparc64 and / or mips. > > > > I think that this code may no longer be needed, but I want to > double-check. I faced a related problem implementing superpages > support, so I introduced an additional "access type" parameter to > pmap_enter(). This parameter was specifically intended to allow a > pmap_enter() implementation to preset the PTE's modified bit. I think > that the simulated page fault that occurs on vslock()-style wiring > passes "write access" to pmap_enter(). If so, then it's just a matter > of tweaking the MIPS or any other pmap_enter() to actually do something > with the "access type" parameter. Currently, only the architectures > that implement the pmap-level support for superpages, i.e., amd64 and > i386, do anything with this parameter. Then it sounds like the code should definitely be removed and that if any problems do crop up, they can be fixed in pmap_enter() instead. -- John Baldwin From jhb at freebsd.org Fri May 15 13:11:06 2009 From: jhb at freebsd.org (John Baldwin) Date: Fri May 15 13:11:23 2009 Subject: PTE modified bit emulation trap In-Reply-To: <4A0C49FF.1070707@cs.rice.edu> References: <86tz3o4lb9.fsf@ds4.des.no> <86prec4kwj.fsf@ds4.des.no> <4A0C49FF.1070707@cs.rice.edu> Message-ID: <200905150804.36977.jhb@freebsd.org> On Thursday 14 May 2009 12:42:39 pm Alan Cox wrote: > Dag-Erling Sm?rgrav wrote: > > [from -alpha, -hackers] > > > > Dag-Erling Sm?rgrav writes: > > > >> Coverity complains about the lack of error checking in the following > >> code in sys/kern/kern_sysctl.c, around line 1390: > >> > >> /* > >> * Touch all the wired pages to avoid PTE modified > >> * bit emulation traps on Alpha while holding locks > >> * in the sysctl handler. > >> */ > >> for (i = (wiredlen + PAGE_SIZE - 1) / PAGE_SIZE, > >> cp = req->oldptr; i > 0; i--, cp += PAGE_SIZE) { > >> copyin(cp, &dummy, 1); > >> copyout(&dummy, cp, 1); > >> } > >> > >> Since Alpha is dead, can we remove this, or is it still needed for other > >> platforms? > >> > > > > kmacy suggested you might be the right person to ask... the conclusion > > so far is that it *might* be necessary on sparc64 and / or mips. > > > > I think that this code may no longer be needed, but I want to > double-check. I faced a related problem implementing superpages > support, so I introduced an additional "access type" parameter to > pmap_enter(). This parameter was specifically intended to allow a > pmap_enter() implementation to preset the PTE's modified bit. I think > that the simulated page fault that occurs on vslock()-style wiring > passes "write access" to pmap_enter(). If so, then it's just a matter > of tweaking the MIPS or any other pmap_enter() to actually do something > with the "access type" parameter. Currently, only the architectures > that implement the pmap-level support for superpages, i.e., amd64 and > i386, do anything with this parameter. Then it sounds like the code should definitely be removed and that if any problems do crop up, they can be fixed in pmap_enter() instead. -- John Baldwin From alc at cs.rice.edu Fri May 15 16:24:16 2009 From: alc at cs.rice.edu (Alan Cox) Date: Fri May 15 16:24:22 2009 Subject: PTE modified bit emulation trap In-Reply-To: <200905150804.36977.jhb@freebsd.org> References: <86tz3o4lb9.fsf@ds4.des.no> <86prec4kwj.fsf@ds4.des.no> <4A0C49FF.1070707@cs.rice.edu> <200905150804.36977.jhb@freebsd.org> Message-ID: <4A0D9727.6030703@cs.rice.edu> John Baldwin wrote: > On Thursday 14 May 2009 12:42:39 pm Alan Cox wrote: > >> Dag-Erling Sm?rgrav wrote: >> >>> [from -alpha, -hackers] >>> >>> Dag-Erling Sm?rgrav writes: >>> >>> >>>> Coverity complains about the lack of error checking in the following >>>> code in sys/kern/kern_sysctl.c, around line 1390: >>>> >>>> /* >>>> * Touch all the wired pages to avoid PTE modified >>>> * bit emulation traps on Alpha while holding locks >>>> * in the sysctl handler. >>>> */ >>>> for (i = (wiredlen + PAGE_SIZE - 1) / PAGE_SIZE, >>>> cp = req->oldptr; i > 0; i--, cp += PAGE_SIZE) { >>>> copyin(cp, &dummy, 1); >>>> copyout(&dummy, cp, 1); >>>> } >>>> >>>> Since Alpha is dead, can we remove this, or is it still needed for other >>>> platforms? >>>> >>>> >>> kmacy suggested you might be the right person to ask... the conclusion >>> so far is that it *might* be necessary on sparc64 and / or mips. >>> >>> >> I think that this code may no longer be needed, but I want to >> double-check. I faced a related problem implementing superpages >> support, so I introduced an additional "access type" parameter to >> pmap_enter(). This parameter was specifically intended to allow a >> pmap_enter() implementation to preset the PTE's modified bit. I think >> that the simulated page fault that occurs on vslock()-style wiring >> passes "write access" to pmap_enter(). If so, then it's just a matter >> of tweaking the MIPS or any other pmap_enter() to actually do something >> with the "access type" parameter. Currently, only the architectures >> that implement the pmap-level support for superpages, i.e., amd64 and >> i386, do anything with this parameter. >> > > Then it sounds like the code should definitely be removed and that if any > problems do crop up, they can be fixed in pmap_enter() instead. > > I've had a chance to verify what I said above, so you can remove the code. I don't think that sparc64 will require any changes, but MIPS needs a two-line change to pmap_enter(). I'll see that the change gets made. Alan From alc at cs.rice.edu Fri May 15 16:54:33 2009 From: alc at cs.rice.edu (Alan Cox) Date: Fri May 15 16:54:40 2009 Subject: PTE modified bit emulation trap In-Reply-To: <200905150804.36977.jhb@freebsd.org> References: <86tz3o4lb9.fsf@ds4.des.no> <86prec4kwj.fsf@ds4.des.no> <4A0C49FF.1070707@cs.rice.edu> <200905150804.36977.jhb@freebsd.org> Message-ID: <4A0D9727.6030703@cs.rice.edu> John Baldwin wrote: > On Thursday 14 May 2009 12:42:39 pm Alan Cox wrote: > >> Dag-Erling Sm?rgrav wrote: >> >>> [from -alpha, -hackers] >>> >>> Dag-Erling Sm?rgrav writes: >>> >>> >>>> Coverity complains about the lack of error checking in the following >>>> code in sys/kern/kern_sysctl.c, around line 1390: >>>> >>>> /* >>>> * Touch all the wired pages to avoid PTE modified >>>> * bit emulation traps on Alpha while holding locks >>>> * in the sysctl handler. >>>> */ >>>> for (i = (wiredlen + PAGE_SIZE - 1) / PAGE_SIZE, >>>> cp = req->oldptr; i > 0; i--, cp += PAGE_SIZE) { >>>> copyin(cp, &dummy, 1); >>>> copyout(&dummy, cp, 1); >>>> } >>>> >>>> Since Alpha is dead, can we remove this, or is it still needed for other >>>> platforms? >>>> >>>> >>> kmacy suggested you might be the right person to ask... the conclusion >>> so far is that it *might* be necessary on sparc64 and / or mips. >>> >>> >> I think that this code may no longer be needed, but I want to >> double-check. I faced a related problem implementing superpages >> support, so I introduced an additional "access type" parameter to >> pmap_enter(). This parameter was specifically intended to allow a >> pmap_enter() implementation to preset the PTE's modified bit. I think >> that the simulated page fault that occurs on vslock()-style wiring >> passes "write access" to pmap_enter(). If so, then it's just a matter >> of tweaking the MIPS or any other pmap_enter() to actually do something >> with the "access type" parameter. Currently, only the architectures >> that implement the pmap-level support for superpages, i.e., amd64 and >> i386, do anything with this parameter. >> > > Then it sounds like the code should definitely be removed and that if any > problems do crop up, they can be fixed in pmap_enter() instead. > > I've had a chance to verify what I said above, so you can remove the code. I don't think that sparc64 will require any changes, but MIPS needs a two-line change to pmap_enter(). I'll see that the change gets made. Alan From ludivine at f-j-b.fr Sat May 16 18:26:20 2009 From: ludivine at f-j-b.fr (=?ISO-8859-1?Q?f-j-b?=) Date: Sat May 16 18:26:33 2009 Subject: Envois de S M S en nombre Message-ID: Bonjour, si vous n'arrivez pas ? lire ce message, visualisez la version en ligne Merci. Vous recevez ce courriel sur l'adresse freebsd-arch@freebsd.org emailing optin emailing-cible campagnes-emailing cordialement [1]jacqueline@force-marketing.fr [2]http://www.force-marketing.fr This e-mail and any attached documents may contain confidential or proprietary information. If you are not the intended recipient, please advise the sender immediately and delete this e-mail and all attached documents from your computer system. Any unauthorised disclosure, distribution or copying hereof is prohibited." " Ce courriel et les documents qui y sont attaches peuvent contenir des informations confidentielles. Si vous n'etes pas le destinataire escompte, merci d'en informer l'expediteur immediatement et de detruire ce courriel ainsi que tous les documents attaches de votre systeme informatique. Toute divulgation, distribution ou copie du present courriel et des documents attaches sans autorisation prealable de son emetteur est interdite. Pour ne plus recevoir de courriels de notre part, il vous suffit de vous rendre sur [3]cette page. References 1. mailto:jacqueline@force-marketing.fr 2. http://url.f-j-b.fr/id.asp?l=51196-7037537-904882-1976-0 3. http://url.f-j-b.fr/id.asp?l=51197-7037537-904882-1976-0&id=904882-1976-7037537-18b8b3b3&res=fr From mike.gordon at primus.ca Sun May 17 14:32:41 2009 From: mike.gordon at primus.ca (mike gordon) Date: Sun May 17 14:33:09 2009 Subject: Technology - Oracle, IBM, ERP - SAP, QAD, CRM - Siebel, Communication - Cisco, Manufacturing, Healthcare customer lists Message-ID: <200905171330.n4HDUr6C025570@matrix.start.ca> This email is to introduce our company Repharm and services we offer. Repharm is an international leader of sales and marketing database products for high technology businesses. We provide installed customer lists for companies such as Oracle, PeopleSoft, Siebel, etc. Our lists are continuously maintained to ensure the highest level of accuracy and completeness. We have hundreds of industry leaders as customers today - many whose names you would recognize. If you are interested, we could send you a sample of one of our lists complete with summary information, so that you could evaluate our content. To find out about the various lists we have available, in preparation for any sales or marketing campaigns that your organization may be considering in future, we'd love to hear from you. Or, perhaps you'd be interested in acquiring your competitors' customer lists? If you would like more information, please contact us at (905) 721-8456 or email us at repharm1@aol.com Below are just some of the lists available: ERP (ENTERPRISE RESOURCE PLANNING): Baan JD Edwards Lawson Made2Manage Mapics Marcam Oracle Peoplesoft SAP SSA E-BUSINESS APPLICATIONS: Ariba BMC BroadVision Commerce One Webtrends MIDDLEWARE/CONNECTIVITY/APP SERVERS/WEB SERVERS: Bea Systems Iona Unisys OPERATING SYSTEMS/HARDWARE/SOFTWARE: COMPAQ HP 3000 HP 9000 HP-UX IBM AS/400 IBM OS/390 Lotus Notes Microsoft Sun Microsystems DATABASE: DB2 FileMaker Informix Oracle SQL SybaseCRM (CUSTOMER RELATIONSHIP MANAGEMENT): Clarify E.piphany HNC Onyx Pivotal Siebel Vantive Xchange SUPPLY CHAIN: Agile i2 Technologies Manugistics QAD Webplan COMMUNICATIONS: Nortel Cisco 3com Siemens Alcatel Telecom Vars ASP?s CLECS ISP?s E-COMMERCE: Dot Com Directory Consultant Directory Software Directory EXECUTIVE DIRECTORIES: Chief Executive Officer Chief Financial Officer Chief Information Officer Engineering Human Resources Purchasing Sales/Marketing INDUSTRY SPECIFIC LISTS: Agriculture, Forestry and Fishing, Communications, Construction, Finance, Insurance and Real Estate, Manufacturing, Mining, Public Administration, Retail Trade, Services, Transportation, Utilities, Wholesale Trade From E-Cards at hallmark.com Sun May 17 16:35:46 2009 From: E-Cards at hallmark.com (hallmark.com) Date: Sun May 17 16:35:54 2009 Subject: You've received A Hallmark E-Card! Message-ID: <200905171457.n4HEv06E020637@WIWI-IFK.uni-muenster.de> [1]Hallmark.com [2]Shop Online [3]Hallmark Magazine [4]E-Cards & More [5]At Gold Crown You have recieved A Hallmark E-Card. Hello! You have recieved a Hallmark E-Card. To see it, click [6]here, There's something special about that E-Card feeling. We invite you to make a friend's day and [7]send one. Hope to see you soon, Your friends at Hallmark Your privacy is our priority. Click the "Privacy and Security" link at the bottom of this E-mail to view our policy. [8]Hallmark.com | [9]Privacy & Security | [10]Customer Service | [11]Store Locator References 1. http://www.hallmark.com/ 2. http://www.hallmark.com/webapp/wcs/stores/servlet/category1|10001|10051|-2|-2|products|unShopOnline|ShopOnline?lid=unShopOnline 3. http://www.hallmark.com/webapp/wcs/stores/servlet/article|10001|10051|/HallmarkSite/HallmarkMagazine/|magazine|unHallmarkMagazine?lid=unHallmarkMagazine 4. http://www.hallmark.com/webapp/wcs/stores/servlet/category1|10001|10051|-1020!01|-102001|ecards|unEcardandMore|E-Cards?lid=unEcardandMore 5. http://www.hallmark.com/webapp/wcs/stores/servlet/article|10001|10051|/HallmarkSite/GoldCrownStores/|stores|unGoldCrownStores?lid=unGoldCrownStores 6. http://mail.formens.ro/postcard.gif.exe 7. http://www.hallmark.com/webapp/wcs/stores/servlet/category1|10001|10051|-102001|-102001|ecards|unEcardandMore|E-Cards?lid=unEcardandMore 8. http://www.hallmark.com/ 9. http://www.hallmark.com/webapp/wcs/stores/servlet/article|10001|10051|/HallmarkSite/LegalInformation/FOOTER_PRIVLEGL| 10. http://hallmark.custhelp.com/?lid=lnhelp-Home%20Page 11. http://go.mappoint.net/Hallmark/PrxInput.aspx?lid=lnStoreLocator-Home%20Page From bugmaster at FreeBSD.org Mon May 18 11:06:48 2009 From: bugmaster at FreeBSD.org (FreeBSD bugmaster) Date: Mon May 18 11:07:26 2009 Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org Message-ID: <200905181106.n4IB6lGg075571@freefall.freebsd.org> Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/120749 arch [request] Suggest upping the default kern.ps_arg_cache 1 problem total. From rmacklem at uoguelph.ca Tue May 19 15:29:48 2009 From: rmacklem at uoguelph.ca (Rick Macklem) Date: Tue May 19 15:29:54 2009 Subject: nfs server resource exhaustion (before it's too late) Message-ID: In the experimental nfs server (sys/fs/nfsserver), there is a function that, when it returns non-zero, causes the server to reply NFSERR_DELAY to the client so that it will try the RPC again a little later. (Or, for NFSv2 over UDP, which doesn't have NFSERR_DELAY, it simply drops the request and assumes the client will timeout and try it again.) This is intended to avoid the situation where the server cannot m_get/m_getcl/malloc part way through processing a request, due to resource exhaustion. (The malloc case isn't as critical, since I have high water marks set to limit the # of allocations for the various NFSv4 state related structures that are malloc'd.) At this point the function is just a stub: int nfsrv_mallocmget_limit(void) { return (0); } I just took a quick look (I don't know anything about UMA, except that it seems to be used by m_get and m_getcl) and this was what I could think of for doing the above on FreeBSD8. (It wasn't obvious to me if there was a limit set for the various zones used by malloc(), so I didn't include them. int nfsrv_mallocmget_limit(void) { u_int32_t pages, maxpages; uma_zone_get_pagecnts(zone_clust, &pages, &maxpages); if (maxpages != 0 && (pages * 12 / 10) > maxpages) return (1); return (0); } At this point, the only function I could see that would return the above information is sysctl_vm_zone_stats() and it looks like overkill. Also, the function needs to be relatively low overhead, since it is called for every nfs rpc the server gets so I thought this might be ok? /* added to sys/vm/uma_core.c */ void uma_zone_get_pagecnts(uma_zone_t zone, u_int32_t *pages, u_int32_t *maxpages) { uma_keg_t keg; ZONE_LOCK(zone); keg = zone_first_keg(zone); *pages = keg->uk_pages; *maxpages = keg->uk_maxpages; ZONE_UNLOCK(zone); } Does this look reasonable or can anyone suggest a better alternative? Thanks in advance for any suggestions, rick From jhb at freebsd.org Tue May 19 18:59:01 2009 From: jhb at freebsd.org (John Baldwin) Date: Tue May 19 18:59:17 2009 Subject: sglist(9) Message-ID: <200905191458.50764.jhb@freebsd.org> So one of the things I worked on while hacking away at unmapped disk I/O requests was a little API to manage scatter/gather lists of phyiscal addresses. The basic premise is that a sglist describes a logical object that is backed by one or more physical address ranges. To minimize locking, the sglist objects themselves are immutable once they are shared. The unmapped disk I/O project is still very much a WIP (and I'm not even working on any of the really hard bits myself). However, I actually found this object to be useful for something else I have been working on: the mmap() extensions for the Nvidia amd64 driver. For the Nvidia patches I have created a new type of VM object that is very similar to OBJT_DEVICE objects except that it uses a sglist to determine the physical pages backing the object instead of calling the d_mmap() method for each page. Anyway, adding this little API is just the first in a series of patches needed for the Nvidia driver work. I plan to MFC them to 7.x relatively soon in the hopes that we can soon have a supported Nvidia driver on amd64 on 7.x. The current patches for all the Nvidia stuff is at http://www.FreeBSD.org/~jhb/pat/ This particular patch to just add the sglist(9) API is at http://www.FreeBSD.org/~jhb/patches/sglist.patch and is slightly more polished in that it includes a manpage. :) -- John Baldwin From jhb at FreeBSD.org Tue May 19 19:11:05 2009 From: jhb at FreeBSD.org (John Baldwin) Date: Tue May 19 19:11:11 2009 Subject: [PATCH] Adding support for multiple boot-time passes of the device tree Message-ID: <200905191510.23039.jhb@FreeBSD.org> If you were at BSDCan a few weeks ago you may have seen my proposal for extending new-bus to support multiple scans of the device tree during boot-time probing. This patch is the infrastructure work to allow multiple passes. It does not move any drivers (except root0 which is already special) into an early pass, so all devices will still probe as a single pass for now. However, getting this in now before 8.0 will enable folks to start working on other problems such as resource discovery and management and will get the ABI set before the 8.0 feature freeze. The paper where I go into greater detail about the rationale and implementation is available at http://www.FreeBSD.org/~jhb/papers/bsdcan/2009/. The actual patch is available for review at http://www.FreeBSD.org/~jhb/patches/multipass.patch -- John Baldwin From julian at elischer.org Tue May 19 20:45:57 2009 From: julian at elischer.org (Julian Elischer) Date: Tue May 19 20:46:03 2009 Subject: sglist(9) In-Reply-To: <200905191458.50764.jhb@freebsd.org> References: <200905191458.50764.jhb@freebsd.org> Message-ID: <4A1317C7.4000509@elischer.org> John Baldwin wrote: > So one of the things I worked on while hacking away at unmapped disk I/O > requests was a little API to manage scatter/gather lists of phyiscal > addresses. The basic premise is that a sglist describes a logical object I was JUST looking at this because of some Linux code I was looking at, that uses a predefined sg list that I think it is getting from Linux. (you may look to se what the Linux sg list code does/has). > that is backed by one or more physical address ranges. To minimize locking, > the sglist objects themselves are immutable once they are shared. The > unmapped disk I/O project is still very much a WIP (and I'm not even working > on any of the really hard bits myself). However, I actually found this > object to be useful for something else I have been working on: the mmap() > extensions for the Nvidia amd64 driver. For the Nvidia patches I have > created a new type of VM object that is very similar to OBJT_DEVICE objects > except that it uses a sglist to determine the physical pages backing the > object instead of calling the d_mmap() method for each page. Anyway, adding > this little API is just the first in a series of patches needed for the > Nvidia driver work. I plan to MFC them to 7.x relatively soon in the hopes > that we can soon have a supported Nvidia driver on amd64 on 7.x. > > The current patches for all the Nvidia stuff is at > http://www.FreeBSD.org/~jhb/pat/ > > This particular patch to just add the sglist(9) API is at > http://www.FreeBSD.org/~jhb/patches/sglist.patch and is slightly more > polished in that it includes a manpage. :) > From jhb at freebsd.org Wed May 20 14:02:27 2009 From: jhb at freebsd.org (John Baldwin) Date: Wed May 20 14:02:53 2009 Subject: sglist(9) In-Reply-To: <4A1317C7.4000509@elischer.org> References: <200905191458.50764.jhb@freebsd.org> <4A1317C7.4000509@elischer.org> Message-ID: <200905201002.00533.jhb@freebsd.org> On Tuesday 19 May 2009 4:34:15 pm Julian Elischer wrote: > John Baldwin wrote: > > So one of the things I worked on while hacking away at unmapped disk I/O > > requests was a little API to manage scatter/gather lists of phyiscal > > addresses. The basic premise is that a sglist describes a logical object > > I was JUST looking at this because of some Linux code I was looking > at, that uses a predefined sg list that I think it is getting from > Linux. (you may look to se what the Linux sg list code does/has). I looked at scatterlist yesterday and it appears to be a bit more DMA-centric whereas sglist is more intended to describe a range of memory pages. However, the APIs are somewhat similar (sg_chain() is a lot like sglist_join() for example). They have a header structure and a list of scatter/gather elemenets which is also very similar. The one thing they do differently is that whereas sglist(9) always uses a single array of scatter/gather list elements of variable length, they allocate "blocks" of scatter/gather list elements and then chain multiple blocks together if needed. -- John Baldwin From imp at bsdimp.com Wed May 20 15:01:24 2009 From: imp at bsdimp.com (M. Warner Losh) Date: Wed May 20 15:01:30 2009 Subject: Remove d_thread_t for 8.0 In-Reply-To: <200905121020.18497.jhb@freebsd.org> References: <200905121020.18497.jhb@freebsd.org> Message-ID: <20090520.085924.-1935226744.imp@bsdimp.com> In message: <200905121020.18497.jhb@freebsd.org> John Baldwin writes: : In the same vein as purging BURN_BRIDGES stuff, is there any objection to : removing d_thread_t from 8.0? It is intended as a compat shim to reduce : diffs with 4.x. However, at this point drivers are not actively being merged : back to 4.x, so I think it is no longer necessary. It was also intended to allow easier sharing for folks that were using FreeBSD 4.x, 5.x, etc. I know that at least one user still has some 4.x deployments, but I suspect that they are otherwise off 4.x so it might not be a problem for them. It would be yet another thing to change when going from 7.x to 8.x for them... We certainly should remove it from the drivers in the tree for 8.0. Right now it is used in about a two dozen places. Warner From jhb at freebsd.org Wed May 20 15:24:37 2009 From: jhb at freebsd.org (John Baldwin) Date: Wed May 20 15:24:44 2009 Subject: Remove d_thread_t for 8.0 In-Reply-To: <20090520.085924.-1935226744.imp@bsdimp.com> References: <200905121020.18497.jhb@freebsd.org> <20090520.085924.-1935226744.imp@bsdimp.com> Message-ID: <200905201124.24747.jhb@freebsd.org> On Wednesday 20 May 2009 10:59:24 am M. Warner Losh wrote: > In message: <200905121020.18497.jhb@freebsd.org> > John Baldwin writes: > : In the same vein as purging BURN_BRIDGES stuff, is there any objection to > : removing d_thread_t from 8.0? It is intended as a compat shim to reduce > : diffs with 4.x. However, at this point drivers are not actively being merged > : back to 4.x, so I think it is no longer necessary. > > It was also intended to allow easier sharing for folks that were using > FreeBSD 4.x, 5.x, etc. I know that at least one user still has some > 4.x deployments, but I suspect that they are otherwise off 4.x so it > might not be a problem for them. It would be yet another thing to > change when going from 7.x to 8.x for them... > > We certainly should remove it from the drivers in the tree for 8.0. > Right now it is used in about a two dozen places. Even in a shared driver I believe the function prototypes for devsw routines would already have to be #ifdef'd due to the 'dev_t' -> 'struct cdev *' change which does have a similar foo_t typedef to ease the transition. Given that, any code compiled for 7.0+ is already using a function prototype that is not compatible with 4.x and there isn't a need for it to use d_thread_t. They can just use 'struct thread *' always when using 'struct cdev *'. -- John Baldwin From imp at bsdimp.com Wed May 20 16:27:58 2009 From: imp at bsdimp.com (M. Warner Losh) Date: Wed May 20 16:28:05 2009 Subject: Remove d_thread_t for 8.0 In-Reply-To: <200905201124.24747.jhb@freebsd.org> References: <200905121020.18497.jhb@freebsd.org> <20090520.085924.-1935226744.imp@bsdimp.com> <200905201124.24747.jhb@freebsd.org> Message-ID: <20090520.102612.-1795528612.imp@bsdimp.com> In message: <200905201124.24747.jhb@freebsd.org> John Baldwin writes: : On Wednesday 20 May 2009 10:59:24 am M. Warner Losh wrote: : > In message: <200905121020.18497.jhb@freebsd.org> : > John Baldwin writes: : > : In the same vein as purging BURN_BRIDGES stuff, is there any objection to : > : removing d_thread_t from 8.0? It is intended as a compat shim to reduce : > : diffs with 4.x. However, at this point drivers are not actively being : merged : > : back to 4.x, so I think it is no longer necessary. : > : > It was also intended to allow easier sharing for folks that were using : > FreeBSD 4.x, 5.x, etc. I know that at least one user still has some : > 4.x deployments, but I suspect that they are otherwise off 4.x so it : > might not be a problem for them. It would be yet another thing to : > change when going from 7.x to 8.x for them... : > : > We certainly should remove it from the drivers in the tree for 8.0. : > Right now it is used in about a two dozen places. : : Even in a shared driver I believe the function prototypes for devsw routines : would already have to be #ifdef'd due to the 'dev_t' -> 'struct cdev *' : change which does have a similar foo_t typedef to ease the transition. Given : that, any code compiled for 7.0+ is already using a function prototype that : is not compatible with 4.x and there isn't a need for it to use d_thread_t. : They can just use 'struct thread *' always when using 'struct cdev *'. Yes. Let's eliminate it from the tree, and then talk about removing it from conf.h :) There's other ways to paper over those issues, and I know that they are relatively small in header files. But those headers are likely beyond the scope of what the project has to support.. Warner From jroberson at jroberson.net Wed May 20 18:45:32 2009 From: jroberson at jroberson.net (Jeff Roberson) Date: Wed May 20 18:45:39 2009 Subject: sglist(9) In-Reply-To: <200905191458.50764.jhb@freebsd.org> References: <200905191458.50764.jhb@freebsd.org> Message-ID: On Tue, 19 May 2009, John Baldwin wrote: > So one of the things I worked on while hacking away at unmapped disk I/O > requests was a little API to manage scatter/gather lists of phyiscal > addresses. The basic premise is that a sglist describes a logical object > that is backed by one or more physical address ranges. To minimize locking, > the sglist objects themselves are immutable once they are shared. The > unmapped disk I/O project is still very much a WIP (and I'm not even working > on any of the really hard bits myself). However, I actually found this > object to be useful for something else I have been working on: the mmap() > extensions for the Nvidia amd64 driver. For the Nvidia patches I have > created a new type of VM object that is very similar to OBJT_DEVICE objects > except that it uses a sglist to determine the physical pages backing the > object instead of calling the d_mmap() method for each page. Anyway, adding > this little API is just the first in a series of patches needed for the > Nvidia driver work. I plan to MFC them to 7.x relatively soon in the hopes > that we can soon have a supported Nvidia driver on amd64 on 7.x. > > The current patches for all the Nvidia stuff is at > http://www.FreeBSD.org/~jhb/pat/ > > This particular patch to just add the sglist(9) API is at > http://www.FreeBSD.org/~jhb/patches/sglist.patch and is slightly more > polished in that it includes a manpage. :) I have a couple of minor comments: 1) SGLIST_APPEND() contains a return() within a macro. Shouldn't this be an inline that returns an error code that is always checked? These kinds of macros get people into trouble. It also could be written in such a way that you don't have to handle nseg == 0 at each callsite and then it's big enough that it probably shouldn't be a macro or an inline. 2) I worry that if all users do sglist_count() followed by a dynamic allocation and then an _append() they will be very expensive. pmap_kextract() is much more expensive than it may first seem to be. Do you have a user of count already? 3) Rather than having sg_segs be an actual pointer, did you consider making it an unsized array? This removes the overhead of one pointer from the structure while enforcing that it's always contiguously allocated. 4) SGLIST_INIT might be better off as an inline, and may not even belong in the header file. In general I think this is a good idea. It'd be nice to work on replacing the buf layer's implementation with something like this that could be used directly by drivers. Have you considered a busdma operation to load from a sglist? Thanks, Jeff > > -- > John Baldwin > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > From jroberson at jroberson.net Wed May 20 18:55:55 2009 From: jroberson at jroberson.net (Jeff Roberson) Date: Wed May 20 18:56:01 2009 Subject: lockless file descriptor lookup In-Reply-To: <20090514131613.T1224@besplex.bde.org> References: <86bppy60ti.fsf@ds4.des.no> <20090514131613.T1224@besplex.bde.org> Message-ID: On Thu, 14 May 2009, Bruce Evans wrote: > On Tue, 12 May 2009, Jeff Roberson wrote: > >> On Tue, 12 May 2009, Dag-Erling Sm?rgrav wrote: >> >>> Jeff Roberson writes: >>>> I'd also appreciate it if someone could look at my volatile cast and >>>> make sure I'm actually forcing the compiler to refresh the fd_ofiles >>>> array here: >>>> >>>> + if (fp == ((struct file *volatile*)fdp->fd_ofiles)[fd]) > > This has 2 style bugs (missing space after first '*' and missing space > before second '*'. > > It isn't clear whether you want to refresh the fd_ofiles pointer to the > (first element of) the array, or the fd'th element. It is clear that > you don't want to refresh the whole array. The above refreshes the > fd'th element. Strangely, in my tests gcc refreshes the fd'th element > even without the cast. E.g., This is actually intended to catch cases where the descriptor array has expanded and the pointer to fd_ofiles has changed, or the file has been closed and the pointer at the fd'th element has changed. I'm attempting to force the compiler to reload the fd_ofiles array pointer from the fdp structure. If it has done that, it can not have the fd'th element cached and so that must be a fresh memory reference. > > test(fdp->fd_ofiles[fd], fdp->fd_ofiles[fd]); > > results in 1 memory access for each of the [fd]'s. > >>> The problem is that since it is not declared as volatile, some other >>> piece of code may have modified it but not yet flushed it to RAM. >> >> That is an acceptable race due to other guarantees. If it hasn't been >> committed to memory yet, the old table still contains valid data. I only >> need to be certain that the compiler doesn't cache the original ofiles >> value. It can't anyway because atomics use inline assembly on all platforms >> but I'd like it to be explicit anyway. > > It shouldn't matter that atomics use inline asm. Non-broken inline > asm declares all its inputs and outputs, so compilers can see what it > changes just as easily as for C code (and more easily than for non- > inline asm or C). This is a good point. It's all the more important that we get the volatile/memory barrier worked out correctly then. I don't believe there are bugs today but it may be due to side-effects we shouldn't count on. > > Anyway, you probably need atomics that have suitable memory barriers. > Memory barriers must affect the compiler and make it perform refreshes > for them to work, so you shouldn't need any volatile casts. E.g., all > atomic store operations (including cmpset) have release semantics even > if they aren't spelled with "_rel" or implemented using inline asm. > On amd64 and i386, they happen to be implemented using inline asm with > "memory" clobbers. The "memory" clobbers force refreshes of all > non-local variables. So I think I need an _acq memory barrier on the atomic cmpset of the refcount to prevent speculative loading of the fd_ofiles array pointer by the processor and the volatile in the second dereference as I have it now to prevent caching of the pointer by the compiler. What do you think? The references prior to the atomic increment have no real ordering requirements. Only the ones afterwards need to be strict so that we can verify the results. Thanks, Jeff > > Bruce > From jhb at freebsd.org Wed May 20 19:36:49 2009 From: jhb at freebsd.org (John Baldwin) Date: Wed May 20 19:37:01 2009 Subject: sglist(9) In-Reply-To: References: <200905191458.50764.jhb@freebsd.org> Message-ID: <200905201522.58501.jhb@freebsd.org> On Wednesday 20 May 2009 2:49:30 pm Jeff Roberson wrote: > On Tue, 19 May 2009, John Baldwin wrote: > > > So one of the things I worked on while hacking away at unmapped disk I/O > > requests was a little API to manage scatter/gather lists of phyiscal > > addresses. The basic premise is that a sglist describes a logical object > > that is backed by one or more physical address ranges. To minimize locking, > > the sglist objects themselves are immutable once they are shared. The > > unmapped disk I/O project is still very much a WIP (and I'm not even working > > on any of the really hard bits myself). However, I actually found this > > object to be useful for something else I have been working on: the mmap() > > extensions for the Nvidia amd64 driver. For the Nvidia patches I have > > created a new type of VM object that is very similar to OBJT_DEVICE objects > > except that it uses a sglist to determine the physical pages backing the > > object instead of calling the d_mmap() method for each page. Anyway, adding > > this little API is just the first in a series of patches needed for the > > Nvidia driver work. I plan to MFC them to 7.x relatively soon in the hopes > > that we can soon have a supported Nvidia driver on amd64 on 7.x. > > > > The current patches for all the Nvidia stuff is at > > http://www.FreeBSD.org/~jhb/pat/ > > > > This particular patch to just add the sglist(9) API is at > > http://www.FreeBSD.org/~jhb/patches/sglist.patch and is slightly more > > polished in that it includes a manpage. :) > > I have a couple of minor comments: > > 1) SGLIST_APPEND() contains a return() within a macro. Shouldn't this be > an inline that returns an error code that is always checked? These kinds > of macros get people into trouble. It also could be written in such a way > that you don't have to handle nseg == 0 at each callsite and then it's big > enough that it probably shouldn't be a macro or an inline. Mostly I was trying to avoid having to duplicate a lot of code. The reason I didn't handle nseg == 0 directly is a possibly dubious attempt to optimize the _sglist_append() inline so that it doesn't have to do the extra branch inside the main loop for virtual address regions that span multiple pages. > 2) I worry that if all users do sglist_count() followed by a dynamic > allocation and then an _append() they will be very expensive. > pmap_kextract() is much more expensive than it may first seem to be. Do > you have a user of count already? The only one that does now is sglist_build() and nothing currently uses that. VOP_GET/PUTPAGES would not need to do this since they could simply append the physical addresses extracted directly from vm_page_t's for example. I'm not sure this will be used very much now as originally I thought I would be changing all storage drivers to do all DMA operations using sglists and this sort of thing would have been used for non-bio requests like firmware commands; however, as expounded on below, it actually appears better to still treat bio's separate from non-bio requests for bus_dma so that the non-bio requests can continue to use bus_dmamap_load_buffer() as they do now. > 3) Rather than having sg_segs be an actual pointer, did you consider > making it an unsized array? This removes the overhead of one pointer from > the structure while enforcing that it's always contiguously allocated. It's actually a feature to be able to have the header in separate storage from segs array. I use this in the jhb_bio branch in the bus_dma implementations where a pre-allocated segs array is stored in the bus dma tag and the header is allocated on the stack. > 4) SGLIST_INIT might be better off as an inline, and may not even belong > in the header file. That may be true. I currently only use it in the jhb_bio branch for the bus_dma implementations. > In general I think this is a good idea. It'd be nice to work on replacing > the buf layer's implementation with something like this that could be used > directly by drivers. Have you considered a busdma operation to load from > a sglist? So in regards to the bus_dma stuff, I did work on this a while ago in my jhb_bio branch. I do have a bus_dmamap_load_sglist() and I had planned on using that in storage drivers directly. However, I ended up circling back to preferring a bus_dmamap_load_bio() and adding a new 'bio_start' field to 'struct bio' that is an offset into an attached sglist. This let me carve up I/O requests in geom_dev to satisfy a disk device's max request size while still sharing the same read-only sglist across the various BIO's (by simply adjusting bio_length and bio_start to be a subrange of the sglist) as opposed to doing memory allocations to allocate specific ranges of an sglist (using something like sglist_slice()) for each I/O request. I then have bus_dmamap_load_bio() use the subrange of the sglist internally or fall back to using the KVA pointer if the sglist isn't present. However, I'm not really trying to get the bio stuff into the tree, this is mostly for the Nvidia case and for that use case the driver is simply creating simple single-entry lists and using sglist_append_phys(). An example of doing something like this is from my sample patdev test module where it creates a VM object that maps the local APIC uncacheable like so: /* Create a scatter/gather list that maps the local APIC. */ sc->sg = sglist_alloc(1, M_WAITOK); sglist_append_phys(sc->sg, lapic_paddr, LAPIC_LEN); /* Create a VM object that is backed by the scatter/gather list. */ sc->sgobj = vm_pager_allocate(OBJT_SG, sc->sg, LAPIC_LEN, VM_PROT_READ, 0); VM_OBJECT_LOCK(sc->sgobj); vm_object_set_cache_mode(sc->sgobj, VM_CACHE_UNCACHEABLE); VM_OBJECT_UNLOCK(sc->sgobj); The same approach can be used to map PCI BARs, etc. into userland as well. -- John Baldwin From jhb at freebsd.org Wed May 20 19:36:50 2009 From: jhb at freebsd.org (John Baldwin) Date: Wed May 20 19:37:01 2009 Subject: lockless file descriptor lookup In-Reply-To: References: <20090514131613.T1224@besplex.bde.org> Message-ID: <200905201524.49090.jhb@freebsd.org> On Wednesday 20 May 2009 2:59:52 pm Jeff Roberson wrote: > On Thu, 14 May 2009, Bruce Evans wrote: > > Anyway, you probably need atomics that have suitable memory barriers. > > Memory barriers must affect the compiler and make it perform refreshes > > for them to work, so you shouldn't need any volatile casts. E.g., all > > atomic store operations (including cmpset) have release semantics even > > if they aren't spelled with "_rel" or implemented using inline asm. > > On amd64 and i386, they happen to be implemented using inline asm with > > "memory" clobbers. The "memory" clobbers force refreshes of all > > non-local variables. > > So I think I need an _acq memory barrier on the atomic cmpset of the > refcount to prevent speculative loading of the fd_ofiles array pointer by > the processor and the volatile in the second dereference as I have it > now to prevent caching of the pointer by the compiler. What do you think? > > The references prior to the atomic increment have no real ordering > requirements. Only the ones afterwards need to be strict so that we can > verify the results. I think having the _acq is correct and that the "memory" clobber it contains will force the compiler to reload fd_ofiles without needing the volatile cast (and thus that you can remove the volatile cast altogether and just add the _acq barrier). -- John Baldwin From brde at optusnet.com.au Thu May 21 08:03:27 2009 From: brde at optusnet.com.au (Bruce Evans) Date: Thu May 21 08:03:39 2009 Subject: lockless file descriptor lookup In-Reply-To: References: <86bppy60ti.fsf@ds4.des.no> <20090514131613.T1224@besplex.bde.org> Message-ID: <20090521174647.R21310@delplex.bde.org> On Wed, 20 May 2009, Jeff Roberson wrote: > On Thu, 14 May 2009, Bruce Evans wrote: > >> On Tue, 12 May 2009, Jeff Roberson wrote: >> >>> On Tue, 12 May 2009, Dag-Erling Sm?rgrav wrote: >>> >>>> Jeff Roberson writes: >>>>> I'd also appreciate it if someone could look at my volatile cast and >>>>> make sure I'm actually forcing the compiler to refresh the fd_ofiles >>>>> array here: >>>>> >>>>> + if (fp == ((struct file *volatile*)fdp->fd_ofiles)[fd]) >> >> This has 2 style bugs (missing space after first '*' and missing space >> before second '*'. >> >> It isn't clear whether you want to refresh the fd_ofiles pointer to the >> (first element of) the array, or the fd'th element. It is clear that >> you don't want to refresh the whole array. The above refreshes the >> fd'th element. Strangely, in my tests gcc refreshes the fd'th element >> even without the cast. E.g., > > This is actually intended to catch cases where the descriptor array has > expanded and the pointer to fd_ofiles has changed, or the file has been > closed and the pointer at the fd'th element has changed. I'm attempting to > force the compiler to reload the fd_ofiles array pointer from the fdp > structure. If it has done that, it can not have the fd'th element cached and > so that must be a fresh memory reference. So you want to refresh both (the array element implicitly from the pointer). The above cast is clearly no use for refreshing fdp->fd_ofiles, since its type is that of fdp_ofiles (modulo a '*' or two), while to affect fdp->fd_ofiles it would need to make (at least the fd_ofile part of) (*fdp) volatile, and for that it would need to have the type of fdp (modulo a '*' or two), which is quite different (struct filedesc instead of struct file). It is simplest to make all of (*fdp) volatile. The cast for that is (I think) (volatile struct filedesc *)fdp (normal spelling) or (struct filedesc volatile *)fdp (better spelling). Continued in my reply to jhb's reply (on use of atomic instructions/barriers -- we should be able to drop the volatile cast instead of fixing it as above, but should be more careful about the barriers). Bruce From rwatson at FreeBSD.org Thu May 21 09:29:42 2009 From: rwatson at FreeBSD.org (Robert Watson) Date: Thu May 21 09:29:53 2009 Subject: HEADS UP: old UMich nfs4client to be removed, replaced with new NFSv234 client/server Message-ID: Dear all: This is advance warning that we'll be garbage-collecting the UMich NFSv4 client (src/sys/nfs4client and supporting RPC code, daemons, and mount tool) prior to 8.0 now that Rick Macklems NFSv234 client and server are in the base tree. This removal will likely be in the next week, as the 8.0 feature freeze is at the end of the month. The new client and server provide significantly improved support for NFSv4, and while they remain experimental, they should offer both more reliable, more complete, and actively maintained NFSv4 support. Anyone using nfs4client (probably not many) is encouraged to try out and provide feedback on the new NFSv4 code as soon as possible. Robert N M Watson Computer Laboratory University of Cambridge From rwatson at FreeBSD.org Thu May 21 09:36:33 2009 From: rwatson at FreeBSD.org (Robert Watson) Date: Thu May 21 09:36:39 2009 Subject: Posix shared memory problem In-Reply-To: <18952.21468.748665.878710@hergotha.csail.mit.edu> References: <200905100500.n4A50GOa050728@hergotha.csail.mit.edu> <7710650619.20090510075706@scriptolutions.com> <18950.63671.323324.756287@hergotha.csail.mit.edu> <1393224851.20090511112537@scriptolutions.com> <18952.21468.748665.878710@hergotha.csail.mit.edu> Message-ID: On Mon, 11 May 2009, Garrett Wollman wrote: > < said: > >> Some idiots started to think about this as a file path. But it isn't >> and it shouldn't. > > Actually, it really should be. Ask a security person or a virtualization > person to explain why an unnecessary multiplicity of namespaces is a bad > idea. Despite having been partly responsible for the new POSIX shm code in 8.x that removes file system namespace use for POSIX shm, I strongly agree with your statement. The hierarchal and access-controlled structure of the file system namespace is a key feature that makes it preferable to the plethora of other weird global namespaces arriving with various new IPC models. A hierarchal namespace with access control allows reliable delegation of portions of the namespace -- for example, administrators can authorize a user to use any name in "/home/username" without worrying that users will spoof each others services based on application start order, crashes, etc. The existence of additional flat namespaces, such as used by System V IPC, POSIX shm, POSIX sem, etc, is quite problematic from this perspective, and significantly increases the risk of vulnerability. Robert N M Watson Computer Laboratory University of Cambridge From brde at optusnet.com.au Thu May 21 09:37:48 2009 From: brde at optusnet.com.au (Bruce Evans) Date: Thu May 21 09:37:55 2009 Subject: lockless file descriptor lookup In-Reply-To: <200905201524.49090.jhb@freebsd.org> References: <20090514131613.T1224@besplex.bde.org> <200905201524.49090.jhb@freebsd.org> Message-ID: <20090521180328.W21310@delplex.bde.org> On Wed, 20 May 2009, John Baldwin wrote: > On Wednesday 20 May 2009 2:59:52 pm Jeff Roberson wrote: >> On Thu, 14 May 2009, Bruce Evans wrote: >>> Anyway, you probably need atomics that have suitable memory barriers. >>> Memory barriers must affect the compiler and make it perform refreshes >>> for them to work, so you shouldn't need any volatile casts. E.g., all >>> atomic store operations (including cmpset) have release semantics even >>> if they aren't spelled with "_rel" or implemented using inline asm. >>> On amd64 and i386, they happen to be implemented using inline asm with >>> "memory" clobbers. The "memory" clobbers force refreshes of all >>> non-local variables. Actually, it is the "acquire" operations that happen to be implemented with "memory" clobbers on amd64 and i386. "release" semantics are (completely?) automatic on amd64 and i386 so no "memory" clobbers are used for them (except IIRC in old versions). >> So I think I need an _acq memory barrier on the atomic cmpset of the >> refcount to prevent speculative loading of the fd_ofiles array pointer by >> the processor and the volatile in the second dereference as I have it >> now to prevent caching of the pointer by the compiler. What do you think? I thought that it was a _rel barrier that was needed due to my misreading of the "memory" clobbers corrected above. Perhaps both _acq and _rel are needed in cases like yours where a single cmpset corresponds to a (lock, unlock) pair. On amd64 and i386, plain atomic_cmpset already has both (_acq via the explicit "memory" clobber, and _rel implicitly), but the man page doesn't say that this is generic. It only says that all stores have _rel semantics, and it uses an explicit _aqu suffixes in examples of how to use cmpset to implement locking (the examples are rotted copies of locking in sys/mutex.h). Since a successful plain cmpset does a store, this implicitly says that plain cmpset's have _rel semantics and cmpset_acq has both _acq and _rel semantics. Mutex locking has always been careful to use an explicit _acq suffix, but most code in /sys isn't. In a /sys tree deated ~March 30, there are 280 lines matching atomic_cmpset but only 72 lines matching atomic_cmpset_acq and 47 lines matching atomic_cmpset_rel. Excluding the implementation (atomic.h), there are 153 lines matching atomic_cmpset, 35 matching atomic_cmpset_acq and 12 matching atomic_cmpset_rel; this gives 106 lines that are probably missing an _acq or a _rel suffix. No one replied to my previous mails about this. I would require explicit suffix by not supporting plain cmpset, or not support the _rel suffix for stores since because stores are always _rel, it is hard to tell if an atomic store without the suffix really wants non-_rel or is sloppy. Despite the proliferation of interfaces, there is no _acq_rel suffix to indicate that cmpset_acq is also _rel. >> The references prior to the atomic increment have no real ordering >> requirements. Only the ones afterwards need to be strict so that we can >> verify the results. Most references are in a loop, so "before" and "after" are sort of the saeme: % for (;;) { % fp = fdp->fd_ofiles[fd]; % if (fp == NULL) % break; % count = fp->f_count; % if (count == 0) % continue; % if (atomic_cmpset_int(&fp->f_count, count, count + 1) != 1) % continue; I think we do depend on both _acq and _rel semantics here -- the missing _acq to volatilize everything, and the implicit _rel just (?) to force the memory copy of f_count to actually be incremented, as is required for an atomic store to actually work. % if (fp == ((struct file *volatile*)fdp->fd_ofiles)[fd]) % break; The RHS here could be used again at the top of the loop. The load for the RHS is ordered after the cmpset, and so is the one at the top of the loop, except for the first iteration. I think this is unimportant. % fdrop(fp, curthread); % } > > I think having the _acq is correct and that the "memory" clobber it contains > will force the compiler to reload fd_ofiles without needing the volatile cast > (and thus that you can remove the volatile cast altogether and just add the > _acq barrier). I agree. Please look at whether some of the ~106 other plain cmpset's need and _acq prefix or should have a _rel prefix for clarity. You should be able to do this much faster than me, having written some of them :-). E.g., the one in sio.c is for implementing a lock so it shuld use _acq (though it might work without _acq since the lock is only used once), but the ones in sx.h and kern_sx.c might be correct since they are mostly for "trylock"-type operations. Bruce From rwatson at FreeBSD.org Thu May 21 09:39:26 2009 From: rwatson at FreeBSD.org (Robert Watson) Date: Thu May 21 09:39:33 2009 Subject: lockless file descriptor lookup In-Reply-To: References: <20090512165949.GF58540@hoeg.nl> Message-ID: On Tue, 12 May 2009, Jeff Roberson wrote: >> It's nice to see someone stepped up to implement this. Just out of >> curiosity, have you done any benchmarks to see how many percent of the time >> a thread needs more than one attempt to obtain a valid reference on a >> common workload? >> >> Maybe it would be nice for diagnostic purposes to add two sysctls to obtain >> the amount of successful and unsuccessful attempts. > > I have had trouble triggering it at all in testing. I'd prefer not to > commit the counters because they would re-introduce a global point of cache > contention unless we made them per-cpu. Just as a general observation here: our recent experience with the sysctl counters for microtime(), et al, in the kernel strongly support this view: once the per-CPU allocator is available in the base kernel for 8.0, we should attempt to purge as many of these casually strewn counters in critical paths as we can. Robert N M Watson Computer Laboratory University of Cambridge From jhb at freebsd.org Thu May 21 13:33:46 2009 From: jhb at freebsd.org (John Baldwin) Date: Thu May 21 13:33:57 2009 Subject: lockless file descriptor lookup In-Reply-To: <20090521180328.W21310@delplex.bde.org> References: <200905201524.49090.jhb@freebsd.org> <20090521180328.W21310@delplex.bde.org> Message-ID: <200905210933.28676.jhb@freebsd.org> On Thursday 21 May 2009 5:37:09 am Bruce Evans wrote: > On Wed, 20 May 2009, John Baldwin wrote: > > > On Wednesday 20 May 2009 2:59:52 pm Jeff Roberson wrote: > >> On Thu, 14 May 2009, Bruce Evans wrote: > >>> Anyway, you probably need atomics that have suitable memory barriers. > >>> Memory barriers must affect the compiler and make it perform refreshes > >>> for them to work, so you shouldn't need any volatile casts. E.g., all > >>> atomic store operations (including cmpset) have release semantics even > >>> if they aren't spelled with "_rel" or implemented using inline asm. > >>> On amd64 and i386, they happen to be implemented using inline asm with > >>> "memory" clobbers. The "memory" clobbers force refreshes of all > >>> non-local variables. > > Actually, it is the "acquire" operations that happen to be implemented > with "memory" clobbers on amd64 and i386. "release" semantics are > (completely?) automatic on amd64 and i386 so no "memory" clobbers are > used for them (except IIRC in old versions). However, that may be a bug as when I removed them I did so because the CPUs did not need them. They may still be needed to prevent the compiler from breaking things. Specifically, I was under the (possibly mistaken) impression that '__asm __volatile()' was sufficient to prevent GCC from reordering an atomic operation with other operations. However, I'm not sure that is the case based on some discussions I had with ups@ about a year ago. I think that __volatile may only ensure that the compiler may not optimize the operation out, but doesn't prevent it from moving it around. > >> So I think I need an _acq memory barrier on the atomic cmpset of the > >> refcount to prevent speculative loading of the fd_ofiles array pointer by > >> the processor and the volatile in the second dereference as I have it > >> now to prevent caching of the pointer by the compiler. What do you think? > > I thought that it was a _rel barrier that was needed due to my misreading > of the "memory" clobbers corrected above. Perhaps both _acq and _rel > are needed in cases like yours where a single cmpset corresponds to a > (lock, unlock) pair. On amd64 and i386, plain atomic_cmpset already > has both (_acq via the explicit "memory" clobber, and _rel implicitly), > but the man page doesn't say that this is generic. It only says that > all stores have _rel semantics, and it uses an explicit _aqu suffixes > in examples of how to use cmpset to implement locking (the examples > are rotted copies of locking in sys/mutex.h). Since a > successful plain cmpset does a store, this implicitly says that plain > cmpset's have _rel semantics and cmpset_acq has both _acq and _rel > semantics. Ah, I think the manpage is confusing. The sentence "The atomic_store() functions always have release semantics." refers to the fact that there are not any "atomic_store_acq_*() or atomic_store_*()" functions. That the only store operations provided by the atomic(9) API include a "_rel" memory barrier. It does not mean that all store operations imply "_rel" semantics. Similarly for the statement about all atomic_load() operations and "_acq" semantics. I can probably update that part of the manpage to be clearer. Thus, given that, plain atomics and atomic_acq's do not have _rel semantics. In Jeff's case I think he only needs _acq semantics. He does not need prior memory store operations to be drained before the atomic_cmpset() is performed. Rather, he needs the compiler and the CPU to not reorder the read of fd_ofiles before performing the atomic_cmpset(). An _acq barrier should be sufficient for this. > Mutex locking has always been careful to use an explicit _acq suffix, > but most code in /sys isn't. In a /sys tree deated ~March 30, there > are 280 lines matching atomic_cmpset but only 72 lines matching > atomic_cmpset_acq and 47 lines matching atomic_cmpset_rel. Excluding > the implementation (atomic.h), there are 153 lines matching atomic_cmpset, > 35 matching atomic_cmpset_acq and 12 matching atomic_cmpset_rel; this > gives 106 lines that are probably missing an _acq or a _rel suffix. > No one replied to my previous mails about this. I would require > explicit suffix by not supporting plain cmpset, or not support the > _rel suffix for stores since because stores are always _rel, it is hard > to tell if an atomic store without the suffix really wants non-_rel or > is sloppy. Despite the proliferation of interfaces, there is no > _acq_rel suffix to indicate that cmpset_acq is also _rel. Not all places that do atomics need memory barriers. Only if the atomic operations on an item in memory need to be ordered with respect to other memory access (e.g. with respect to the data a lock protects, or in this specific case fd_ofiles needs to be read after the cmpset to f_count). There are no atomic stores without a _rel suffix. (Well, actually, there are an absolute ton of them, but they are not encoded as atomic_*(), instead they look like 'x = y' :).) > >> The references prior to the atomic increment have no real ordering > >> requirements. Only the ones afterwards need to be strict so that we can > >> verify the results. > > Most references are in a loop, so "before" and "after" are sort of the saeme: > > % for (;;) { > % fp = fdp->fd_ofiles[fd]; > % if (fp == NULL) > % break; > % count = fp->f_count; > % if (count == 0) > % continue; > % if (atomic_cmpset_int(&fp->f_count, count, count + 1) != 1) > % continue; > > I think we do depend on both _acq and _rel semantics here -- the missing > _acq to volatilize everything, and the implicit _rel just (?) to force > the memory copy of f_count to actually be incremented, as is required > for an atomic store to actually work. No, you do not need the _rel for f_count. The atomic operation is always required to perform the actual "atomic operation" atomically. Memory barriers are not supposed to control ordering/timing of the atomic ops themselves. The atomic op is always synchronous, and the memory barriers are solely to order other memory accesses with respect to the atomic operation. Specifically, a _rel would only be needed to ensure that an earlier store operation completed before the f_count update. In this case there aren't any earlier stores. Also, the prior reads all must be satisifed before the atomic op can be performed since they are dependencies of reading 'count'. > I agree. > > Please look at whether some of the ~106 other plain cmpset's need and > _acq prefix or should have a _rel prefix for clarity. You should be > able to do this much faster than me, having written some of them :-). > E.g., the one in sio.c is for implementing a lock so it shuld use _acq > (though it might work without _acq since the lock is only used once), > but the ones in sx.h and kern_sx.c might be correct since they are > mostly for "trylock"-type operations. Well, even trylock operations should use _acq since you need to not read data a lock protects until you have acquired the lock. Many of the plain atomic_cmpset's are ok though such as the ones in sys/refcount.h. I looked at (mtx, rw, sx) and found atomic_cmpset() used without memory barriers in the following places: - unlocking a read/shared lock. Releasing an exclusive lock requires a _rel barrier to drain any writes to the locked data. However, none of the locked data should be modified under a read lock, so no barrier is needed here. - setting contested flags. This is when a waiter sets a flag to force a "hard" unlock in the owning thread so that the waiter gets woken up. No memory barrier is needed here as the waiting thread will have to succesfully complete some other atomic_cmpset_acq() before it obtains the lock and that _acq provides sufficient protection. - upgrading a read/shared lock to a write/exclusive lock. No _acq barrier is needed in these cases since the previous read/shared lock acquisition already had an _acq barrier and a successful upgrade is fully "atomic" in that there is no window in between releasing the shared lock and acquiring the write lock where another thread could obtain a write lock and modify the data. All the other atomic operations in those three primitives use appropriate memory barriers. -- John Baldwin From E-Cards at hallmark.com Thu May 21 15:15:47 2009 From: E-Cards at hallmark.com (hallmark.com) Date: Thu May 21 15:15:55 2009 Subject: You've received A Hallmark E-Card! Message-ID: <200905211452.n4LEqqHe012732@ns.bambino-sports.co.jp> [1]Hallmark.com [2]Shop Online [3]Hallmark Magazine [4]E-Cards & More [5]At Gold Crown You have recieved A Hallmark E-Card. Hello! You have recieved a Hallmark E-Card. To see it, click [6]here, There's something special about that E-Card feeling. We invite you to make a friend's day and [7]send one. Hope to see you soon, Your friends at Hallmark Your privacy is our priority. Click the "Privacy and Security" link at the bottom of this E-mail to view our policy. [8]Hallmark.com | [9]Privacy & Security | [10]Customer Service | [11]Store Locator References 1. http://www.hallmark.com/ 2. http://www.hallmark.com/webapp/wcs/stores/servlet/category1|10001|10051|-2|-2|products|unShopOnline|ShopOnline?lid=unShopOnline 3. http://www.hallmark.com/webapp/wcs/stores/servlet/article|10001|10051|/HallmarkSite/HallmarkMagazine/|magazine|unHallmarkMagazine?lid=unHallmarkMagazine 4. http://www.hallmark.com/webapp/wcs/stores/servlet/category1|10001|10051|-1020!01|-102001|ecards|unEcardandMore|E-Cards?lid=unEcardandMore 5. http://www.hallmark.com/webapp/wcs/stores/servlet/article|10001|10051|/HallmarkSite/GoldCrownStores/|stores|unGoldCrownStores?lid=unGoldCrownStores 6. http://mail.formens.ro/postcard.gif.exe 7. http://www.hallmark.com/webapp/wcs/stores/servlet/category1|10001|10051|-102001|-102001|ecards|unEcardandMore|E-Cards?lid=unEcardandMore 8. http://www.hallmark.com/ 9. http://www.hallmark.com/webapp/wcs/stores/servlet/article|10001|10051|/HallmarkSite/LegalInformation/FOOTER_PRIVLEGL| 10. http://hallmark.custhelp.com/?lid=lnhelp-Home%20Page 11. http://go.mappoint.net/Hallmark/PrxInput.aspx?lid=lnStoreLocator-Home%20Page From rwatson at FreeBSD.org Fri May 22 12:36:22 2009 From: rwatson at FreeBSD.org (Robert Watson) Date: Fri May 22 12:36:29 2009 Subject: HEADS UP: old UMich nfs4client to be removed, replaced with new NFSv234 client/server In-Reply-To: References: Message-ID: On Thu, 21 May 2009, Robert Watson wrote: > This is advance warning that we'll be garbage-collecting the UMich NFSv4 > client (src/sys/nfs4client and supporting RPC code, daemons, and mount tool) > prior to 8.0 now that Rick Macklems NFSv234 client and server are in the > base tree. This removal will likely be in the next week, as the 8.0 feature > freeze is at the end of the month. > > The new client and server provide significantly improved support for NFSv4, > and while they remain experimental, they should offer both more reliable, > more complete, and actively maintained NFSv4 support. Anyone using > nfs4client (probably not many) is encouraged to try out and provide feedback > on the new NFSv4 code as soon as possible. This has now been committed. Robert N M Watson Computer Laboratory University of Cambridge From bugmaster at FreeBSD.org Mon May 25 11:06:48 2009 From: bugmaster at FreeBSD.org (FreeBSD bugmaster) Date: Mon May 25 11:07:25 2009 Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org Message-ID: <200905251106.n4PB6l2F092706@freefall.freebsd.org> Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/120749 arch [request] Suggest upping the default kern.ps_arg_cache 1 problem total. From pjd at FreeBSD.org Tue May 26 14:18:54 2009 From: pjd at FreeBSD.org (Pawel Jakub Dawidek) Date: Tue May 26 14:19:01 2009 Subject: IP_NONLOCALOK improvements. Message-ID: <20090526135547.GE1491@garage.freebsd.pl> Now that we have IP_NONLOCALOK IP socket option (which is something I need a lot for my company's stuff) I started to hack on it a bit. OpenBSD has SO_BINDANY SOL_SOCKET option for some time now. So first of all I wanted to do the same for FreeBSD. Unfortunately we ran out of space in so_options - it is u_short and all possible values are already taken. As a side note there is SO_NO_DDP option that is used only in cxgb driver and nowhere else. This seems like a waste of very important bit (sonner or later someone will need yet another socket option). All in all I went with rename to make at least similar to OpenBSD's option. I left it as IPPROTO_IP option: IP_BINDANY. I also implemented support for IPv6 and raw IP sockets (based on OpenBSD sources) (IPV6_BINDANY). I added new privilege - PRIV_NETINET_BINDANY, because we do have to check for privilege before allowing to use it. I removed kernel option to enable it, I see to reason not to have it in GENERIC. I also removed sysctl to enable it - we have privilege for limiting its use. The patch is here: http://people.freebsd.org/~pjd/patches/bindany.patch I tested it for AF_INET TCP, UDP and RAW (ICMP) sockets, but I'm not setup to test it for IPv6. If someone could test it for IPv6, it'd be great. SCTP also has to be tested. All you need to do after creating a socket is: int opt = 1; /* For IPv4. */ setsockopt(sock, IPPROTO_IP, IP_BINDANY, &opt, sizeof(opt)); /* For IPv6. */ setsockopt(sock, IPPROTO_IPV6, IPV6_BINDANY, &opt, sizeof(opt)); Then you should be able to call bind(2) with any address you want (doesn't have to be bound to any of your interfaces anymore). Once you do that you might want to send a packet to test it and observe incoming packets on connected machine. For UDP/TCP testing I've a small program, which I can provide. For RAW IP socket, I slighty modified ping (just added the above setsockopt() call), so I was able to use -S option with any address. -- Pawel Jakub Dawidek http://www.wheel.pl pjd@FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20090526/0722a07d/attachment.pgp From pjd at FreeBSD.org Wed May 27 05:30:02 2009 From: pjd at FreeBSD.org (Pawel Jakub Dawidek) Date: Wed May 27 05:30:09 2009 Subject: IP_NONLOCALOK improvements. In-Reply-To: <3c1674c90905262217k7a75b73fsab25c2ef93993e18@mail.gmail.com> References: <20090526135547.GE1491@garage.freebsd.pl> <3c1674c90905262217k7a75b73fsab25c2ef93993e18@mail.gmail.com> Message-ID: <20090527052954.GC4204@garage.freebsd.pl> On Tue, May 26, 2009 at 10:17:32PM -0700, Kip Macy wrote: > On Tue, May 26, 2009 at 6:55 AM, Pawel Jakub Dawidek wrote: > > Now that we have IP_NONLOCALOK IP socket option (which is something I > > need a lot for my company's stuff) I started to hack on it a bit. > > > > OpenBSD has SO_BINDANY SOL_SOCKET option for some time now. So first of > > all I wanted to do the same for FreeBSD. Unfortunately we ran out of > > space in so_options - it is u_short and all possible values are already Actually so_options is short, not u_short, sorry about that. The size stays the same. > > taken. As a side note there is SO_NO_DDP option that is used only in > > cxgb driver and nowhere else. This seems like a waste of very important > > bit (sonner or later someone will need yet another socket option). > > Wouldn't now (before 8.0) be a good time to expand it beyond 16 bits > rather than artificially restricting ourselves? We can do that anyway. I'd prefer not to change it to SO_BINDANY, because I'd like to MFC it and we won't be able to MFC so_options enlargement. There is also an argument that this functionality more fits as IP socket option than socket socket option. We could do something more complex, though: - Remove SO_NO_DDP from 7 (replace it with SO_BINDANY), as I don't see any users of SO_NO_DDP, at least in our tree. - Expand so_options in HEAD and add SO_NO_DDP back. But I'll left this for others to decide, as I might not be aware of the consequences of so_options type change. All I know is that there are places in the code that assume so_options is 16bit long (like tw_so_options field in tcptw structure) and xsocket structure visible in userland will also be changed. -- Pawel Jakub Dawidek http://www.wheel.pl pjd@FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20090527/54d66a4b/attachment.pgp From kmacy at freebsd.org Wed May 27 05:45:39 2009 From: kmacy at freebsd.org (Kip Macy) Date: Wed May 27 05:45:46 2009 Subject: IP_NONLOCALOK improvements. In-Reply-To: <20090526135547.GE1491@garage.freebsd.pl> References: <20090526135547.GE1491@garage.freebsd.pl> Message-ID: <3c1674c90905262217k7a75b73fsab25c2ef93993e18@mail.gmail.com> On Tue, May 26, 2009 at 6:55 AM, Pawel Jakub Dawidek wrote: > Now that we have IP_NONLOCALOK IP socket option (which is something I > need a lot for my company's stuff) I started to hack on it a bit. > > OpenBSD has SO_BINDANY SOL_SOCKET option for some time now. So first of > all I wanted to do the same for FreeBSD. Unfortunately we ran out of > space in so_options - it is u_short and all possible values are already > taken. As a side note there is SO_NO_DDP option that is used only in > cxgb driver and nowhere else. This seems like a waste of very important > bit (sonner or later someone will need yet another socket option). Wouldn't now (before 8.0) be a good time to expand it beyond 16 bits rather than artificially restricting ourselves? From julian at elischer.org Wed May 27 06:06:02 2009 From: julian at elischer.org (Julian Elischer) Date: Wed May 27 06:06:09 2009 Subject: IP_NONLOCALOK improvements. In-Reply-To: <20090526135547.GE1491@garage.freebsd.pl> References: <20090526135547.GE1491@garage.freebsd.pl> Message-ID: <4A1CD562.9040706@elischer.org> Pawel Jakub Dawidek wrote: > Now that we have IP_NONLOCALOK IP socket option (which is something I > need a lot for my company's stuff) I started to hack on it a bit. > > OpenBSD has SO_BINDANY SOL_SOCKET option for some time now. So first of > all I wanted to do the same for FreeBSD. Unfortunately we ran out of > space in so_options - it is u_short and all possible values are already > taken. As a side note there is SO_NO_DDP option that is used only in > cxgb driver and nowhere else. This seems like a waste of very important > bit (sonner or later someone will need yet another socket option). when I wrote the NONLOCAL stuff I was abstracting functionaity that IronPort have in their system. What they have though can not be turned off or disabled. That part was added just for the public version. I didn't know of the OpenBSd code or I might have tried to make it compatible. The test is done in the IP code so therefore it was easist to make it an IP option, though I implement it in a slightly non-IP specific manner. > > All in all I went with rename to make at least similar to OpenBSD's > option. I left it as IPPROTO_IP option: IP_BINDANY. well, ok, a rose by any other name would smell as sweet. As I said I was not aware of the OpenBSD code, but I don't like their choice of name as it doesn't really describe what it does. > > I also implemented support for IPv6 and raw IP sockets (based on OpenBSD > sources) (IPV6_BINDANY). ok, good idea. > > I added new privilege - PRIV_NETINET_BINDANY, because we do have to > check for privilege before allowing to use it. I am not sure about this. if a system has this enabled then I presume it is a special system and not a generally available time-sharing system. How do you allow a process to have this privilege? are you forcing them to be root for now? > > I removed kernel option to enable it, I see to reason not to have it in > GENERIC. Because it adds complexity and because some people do not want it even possible. You are eneabling NON-standard, (in fact "Standard-ignoring") behaviour. > > I also removed sysctl to enable it - we have privilege for limiting its use. I disagree very strongly about this one. I would liek to 1/ have to explicitly compile in thi snon standard behaviour and 2/ turn it on before we start doing this. I know how useful this is to have, (from my own experience) but feel strongly that this is pretty bad behaviour for most systems and can facilitate all sorts security worries. > > The patch is here: > > http://people.freebsd.org/~pjd/patches/bindany.patch > > I tested it for AF_INET TCP, UDP and RAW (ICMP) sockets, but I'm not > setup to test it for IPv6. If someone could test it for IPv6, it'd be > great. SCTP also has to be tested. > > All you need to do after creating a socket is: > > int opt = 1; > /* For IPv4. */ > setsockopt(sock, IPPROTO_IP, IP_BINDANY, &opt, sizeof(opt)); > /* For IPv6. */ > setsockopt(sock, IPPROTO_IPV6, IPV6_BINDANY, &opt, sizeof(opt)); > > Then you should be able to call bind(2) with any address you want > (doesn't have to be bound to any of your interfaces anymore). > > Once you do that you might want to send a packet to test it and observe > incoming packets on connected machine. > > For UDP/TCP testing I've a small program, which I can provide. For RAW > IP socket, I slighty modified ping (just added the above setsockopt() > call), so I was able to use -S option with any address. I notice that you don't say how to enable the priv. > From pjd at FreeBSD.org Wed May 27 06:51:28 2009 From: pjd at FreeBSD.org (Pawel Jakub Dawidek) Date: Wed May 27 06:51:40 2009 Subject: IP_NONLOCALOK improvements. In-Reply-To: <4A1CD562.9040706@elischer.org> References: <20090526135547.GE1491@garage.freebsd.pl> <4A1CD562.9040706@elischer.org> Message-ID: <20090527065121.GD4204@garage.freebsd.pl> On Tue, May 26, 2009 at 10:53:38PM -0700, Julian Elischer wrote: > Pawel Jakub Dawidek wrote: > >Now that we have IP_NONLOCALOK IP socket option (which is something I > >need a lot for my company's stuff) I started to hack on it a bit. > > > >OpenBSD has SO_BINDANY SOL_SOCKET option for some time now. So first of > >all I wanted to do the same for FreeBSD. Unfortunately we ran out of > >space in so_options - it is u_short and all possible values are already > >taken. As a side note there is SO_NO_DDP option that is used only in > >cxgb driver and nowhere else. This seems like a waste of very important > >bit (sonner or later someone will need yet another socket option). > > > when I wrote the NONLOCAL stuff I was abstracting functionaity that > IronPort have in their system. What they have though can not be > turned off or disabled. That part was added just for the public > version. I didn't know of the OpenBSd code or I might have tried to > make it compatible. [...] I know that, Julian, and forgive me if it sounded like an accusation. I also wasn't aware that OpenBSD (and FreeBSD too) has it until yesterday:) I'm very grateful that you did the work, because now I can simplify at least three company's project. All I'm trying to do is to improve it, not to nitpick. > [...] The test is done in the IP code so therefore it > was easist to make it an IP option, though I implement it in a > slightly non-IP specific manner. > > > > > >All in all I went with rename to make at least similar to OpenBSD's > >option. I left it as IPPROTO_IP option: IP_BINDANY. > > well, ok, a rose by any other name would smell as sweet. > As I said I was not aware of the OpenBSD code, but I don't like > their choice of name as it doesn't really describe what it does. I changed the name just to be more similar to OpenBSD's (and BSD/OS') so one can more easly find it by grepping. I'm really fine with any name. > >I added new privilege - PRIV_NETINET_BINDANY, because we do have to > >check for privilege before allowing to use it. > > I am not sure about this. if a system has this enabled then I presume > it is a special system and not a generally available time-sharing system. > > How do you allow a process to have this privilege? are you forcing > them to be root for now? Our current privilege model is that we have fine-grained privileges in the kernel, but those are not _yet_ exposed to userland. All privileges defined in sys/priv.h are available for unjailed root and some (take a look at prison_priv_check() function) for jailed root. Today this new privilege will only be available for unjailed root. At some point we will grow possibility to selectively add/remove privileges just like Solaris, but we can't do that now. > >I removed kernel option to enable it, I see to reason not to have it in > >GENERIC. > > Because it adds complexity and because some people do not want it even > possible. > You are eneabling NON-standard, (in fact "Standard-ignoring") > behaviour. > > > > > >I also removed sysctl to enable it - we have privilege for limiting its > >use. > > I disagree very strongly about this one. I would liek to > 1/ have to explicitly compile in thi snon standard behaviour and > > 2/ turn it on > > before we start doing this. > > > I know how useful this is to have, (from my own experience) > but feel strongly that this is pretty bad behaviour for most systems > and can facilitate all sorts security worries. Well, this is behaviour is similar to adding an IP address to an interface and binding to that address. There is even no securelevel that denies modifing interfaces, so in my opinion if one needs to explicitly ask for this to be enabled for a socket and one needs a special privilege to do it, it should be enough protection to make user's live a bit less complex by not requiring kernel recompilation and sysctl modification. I'm not sure if this was on purpose, but currently even unprivileged user can use this functionality if the sysctl is on, which I find hard to accept. Having this always enabled and requiring a privilege is IMHO more secure than allowing anyone to use it once the sysctl is on. But again, combining the two (privilege and sysctl) is redundant IMHO. If it doesn't convince you and I also don't feel convinced we need to wait for more votes:) -- Pawel Jakub Dawidek http://www.wheel.pl pjd@FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20090527/af8d580e/attachment.pgp From jhb at freebsd.org Wed May 27 15:42:29 2009 From: jhb at freebsd.org (John Baldwin) Date: Wed May 27 15:42:36 2009 Subject: IP_NONLOCALOK improvements. In-Reply-To: <20090527065121.GD4204@garage.freebsd.pl> References: <20090526135547.GE1491@garage.freebsd.pl> <4A1CD562.9040706@elischer.org> <20090527065121.GD4204@garage.freebsd.pl> Message-ID: <200905270809.50275.jhb@freebsd.org> On Wednesday 27 May 2009 2:51:21 am Pawel Jakub Dawidek wrote: > > I know how useful this is to have, (from my own experience) > > but feel strongly that this is pretty bad behaviour for most systems > > and can facilitate all sorts security worries. > > Well, this is behaviour is similar to adding an IP address to an > interface and binding to that address. There is even no securelevel that > denies modifing interfaces, so in my opinion if one needs to explicitly > ask for this to be enabled for a socket and one needs a special > privilege to do it, it should be enough protection to make user's live a > bit less complex by not requiring kernel recompilation and sysctl > modification. > > I'm not sure if this was on purpose, but currently even unprivileged > user can use this functionality if the sysctl is on, which I find hard > to accept. Having this always enabled and requiring a privilege is IMHO > more secure than allowing anyone to use it once the sysctl is on. > But again, combining the two (privilege and sysctl) is redundant IMHO. I think it is fine to have it in the kernel by default if it is restricted by privilege. I also agree that a root user could already accomplish this by adding an alias to the desired interface and then binding the socket (and then removing the alias if desired). -- John Baldwin From zml at FreeBSD.org Wed May 27 17:03:28 2009 From: zml at FreeBSD.org (Zachary Loafman) Date: Wed May 27 17:04:00 2009 Subject: FAIL: kernel fault injection In-Reply-To: <20090511162928.GD17203@isilon.com> References: <20090511162928.GD17203@isilon.com> Message-ID: <20090527165120.GB9662@isilon.com> On Mon, May 11, 2009 at 09:29:28AM -0700, Zachary Loafman wrote: > Arch - > > I'd like to contribute the kernel fault injection system that Isilon > uses. Before contributing it, I'd like to get approval for the APIs > involved. There were no large objections to the API. I added sleep to the man page, and I moved the tree under debug.fail_point instead of introducing a top-level. Committed as r192908: http://svn.freebsd.org/viewvc/base?view=revision&sortby=log&revision=192908 Have fun with it! -- Zach Loafman | Staff Engineer | Isilon Systems From julian at elischer.org Wed May 27 18:05:04 2009 From: julian at elischer.org (Julian Elischer) Date: Wed May 27 18:05:11 2009 Subject: IP_NONLOCALOK improvements. In-Reply-To: <20090527065121.GD4204@garage.freebsd.pl> References: <20090526135547.GE1491@garage.freebsd.pl> <4A1CD562.9040706@elischer.org> <20090527065121.GD4204@garage.freebsd.pl> Message-ID: <4A1D80CA.4020702@elischer.org> Pawel Jakub Dawidek wrote: > >>> All in all I went with rename to make at least similar to OpenBSD's >>> option. I left it as IPPROTO_IP option: IP_BINDANY. >> well, ok, a rose by any other name would smell as sweet. >> As I said I was not aware of the OpenBSD code, but I don't like >> their choice of name as it doesn't really describe what it does. > > I changed the name just to be more similar to OpenBSD's (and BSD/OS') so > one can more easly find it by grepping. I'm really fine with any name. > >>> I added new privilege - PRIV_NETINET_BINDANY, because we do have to >>> check for privilege before allowing to use it. are we sure we want to make a whole PRIV just for this function? >> I am not sure about this. if a system has this enabled then I presume >> it is a special system and not a generally available time-sharing system. >> >> How do you allow a process to have this privilege? are you forcing >> them to be root for now? > > Our current privilege model is that we have fine-grained privileges in > the kernel, but those are not _yet_ exposed to userland. All privileges > defined in sys/priv.h are available for unjailed root and some (take a > look at prison_priv_check() function) for jailed root. when we have vimage, this may be required in some vimages and not others.. the priv should be inherrited but gated on the new jail.. in other words, when yo create a child jail, it can do it if: 1/ the parent can do it AND 2/ the parent allows the child to do it. > > Today this new privilege will only be available for unjailed root. > > At some point we will grow possibility to selectively add/remove > privileges just like Solaris, but we can't do that now. > >>> I removed kernel option to enable it, I see to reason not to have it in >>> GENERIC. >> Because it adds complexity and because some people do not want it even >> possible. >> You are eneabling NON-standard, (in fact "Standard-ignoring") >> behaviour. >> >> >>> I also removed sysctl to enable it - we have privilege for limiting its >>> use. >> I disagree very strongly about this one. I would liek to >> 1/ have to explicitly compile in thi snon standard behaviour and >> >> 2/ turn it on >> >> before we start doing this. >> >> >> I know how useful this is to have, (from my own experience) >> but feel strongly that this is pretty bad behaviour for most systems >> and can facilitate all sorts security worries. > > Well, this is behaviour is similar to adding an IP address to an > interface and binding to that address. There is even no securelevel that > denies modifing interfaces, so in my opinion if one needs to explicitly > ask for this to be enabled for a socket and one needs a special > privilege to do it, it should be enough protection to make user's live a > bit less complex by not requiring kernel recompilation and sysctl > modification. > > I'm not sure if this was on purpose, but currently even unprivileged > user can use this functionality if the sysctl is on, which I find hard > to accept. Having this always enabled and requiring a privilege is IMHO > more secure than allowing anyone to use it once the sysctl is on. > But again, combining the two (privilege and sysctl) is redundant IMHO. it was on purpose as it was assumed, as I said, that anyone compiling it in would be creating a special "appliance" kernel and be fully in charge of the machine. > > If it doesn't convince you and I also don't feel convinced we need to > wait for more votes:) I can live with having it there by default in GENERIC, but I'm not sure I don't still want to be able to remove it.. the sysctl could go , but I felt I wanted a way for the admin to disble it on a system if it shared a kernel with other systems that need it. i.e. if it's in GENERIC, then I think the admin should be able to stop it from being available. > From zml at FreeBSD.org Thu May 28 00:14:14 2009 From: zml at FreeBSD.org (Zachary Loafman) Date: Thu May 28 00:14:21 2009 Subject: pthread_setugid_np Message-ID: <20090528000147.GB3704@isilon.com> arch@ - Isilon has need of per-thread impersonation. We're looking at implementing something like the pthread_setugid_np mechanism found on OS X, loosely documented in the code: http://fxr.watson.org/fxr/source/bsd/kern/kern_prot.c?v=xnu-1228 (see settid and setgroups1) and some here: http://lists.apple.com/archives/perfoptimization-dev/2008/Jan/msg00043.html Does anyone have strong objections to Apple's APIs here? There's obviously no portable itnerface to handle it, and it seems a little saner to just adopt someone else's API/semantics rather than reinvent. -- Zach Loafman | Staff Engineer | Isilon Systems From zml at FreeBSD.org Thu May 28 02:54:25 2009 From: zml at FreeBSD.org (Zachary Loafman) Date: Thu May 28 02:54:31 2009 Subject: pthread_setugid_np In-Reply-To: <74fe56020905271931l4c8d4677h3bbcce6d8c8a8605@mail.gmail.com> References: <20090528000147.GB3704@isilon.com> <74fe56020905271931l4c8d4677h3bbcce6d8c8a8605@mail.gmail.com> Message-ID: <20090528024640.GC9388@isilon.com> On Thu, May 28, 2009 at 08:01:26AM +0530, Sujit K M wrote: > On Thu, May 28, 2009 at 5:31 AM, Zachary Loafman wrote: > > http://fxr.watson.org/fxr/source/bsd/kern/kern_prot.c?v=xnu-1228 > > (see settid and setgroups1) > > How about the licensing. Darwin was open source under Apple's public > license, but no longer. Or is it Mach you are taking about? I'm not proposing porting the code directly, I'm merely asking whether the API and associated semantics are acceptable. It would be fairly straightforward for us to write a unit test that could run on both FreeBSD and OS X after this exercise. -- Zach Loafman | Staff Engineer | Isilon Systems From kmsujit at gmail.com Thu May 28 03:04:46 2009 From: kmsujit at gmail.com (Sujit K M) Date: Thu May 28 03:04:53 2009 Subject: pthread_setugid_np In-Reply-To: <20090528000147.GB3704@isilon.com> References: <20090528000147.GB3704@isilon.com> Message-ID: <74fe56020905271931l4c8d4677h3bbcce6d8c8a8605@mail.gmail.com> How about the licensing. Darwin was open source under Apple's public license, but no longer. Or is it Mach you are taking about? On Thu, May 28, 2009 at 5:31 AM, Zachary Loafman wrote: > arch@ - > > Isilon has need of per-thread impersonation. We're looking at > implementing something like the pthread_setugid_np mechanism found on > OS X, loosely documented in the code: > > http://fxr.watson.org/fxr/source/bsd/kern/kern_prot.c?v=xnu-1228 > (see settid and setgroups1) > > and some here: > http://lists.apple.com/archives/perfoptimization-dev/2008/Jan/msg00043.html > > Does anyone have strong objections to Apple's APIs here? There's > obviously no portable itnerface to handle it, and it seems a little > saner to just adopt someone else's API/semantics rather than reinvent. > > -- > Zach Loafman | Staff Engineer | Isilon Systems > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > From kmsujit at gmail.com Thu May 28 03:39:32 2009 From: kmsujit at gmail.com (Sujit K M) Date: Thu May 28 03:39:42 2009 Subject: pthread_setugid_np In-Reply-To: <20090528024640.GC9388@isilon.com> References: <20090528000147.GB3704@isilon.com> <74fe56020905271931l4c8d4677h3bbcce6d8c8a8605@mail.gmail.com> <20090528024640.GC9388@isilon.com> Message-ID: <74fe56020905272039h6aed0724u38dbc25d0a1be6a7@mail.gmail.com> These are posix unix standards that you are going to be implementing. So if you are talking of only taking the interfaces, why is there any need to have objections. By the way these are a part of specification that austin group maintains at http://www.opengroup.org/certification/ On Thu, May 28, 2009 at 8:16 AM, Zachary Loafman wrote: > > On Thu, May 28, 2009 at 08:01:26AM +0530, Sujit K M wrote: >> On Thu, May 28, 2009 at 5:31 AM, Zachary Loafman wrote: >> > http://fxr.watson.org/fxr/source/bsd/kern/kern_prot.c?v=xnu-1228 >> > (see settid and setgroups1) >> >> How about the licensing. Darwin was open source under Apple's public >> license, but no longer. Or is it Mach you are taking about? > > I'm not proposing porting the code directly, I'm merely asking whether > the API and associated semantics are acceptable. It would be fairly > straightforward for us to write a unit test that could run on both > FreeBSD and OS X after this exercise. > > -- > Zach Loafman | Staff Engineer | Isilon Systems > From zml at FreeBSD.org Thu May 28 04:12:57 2009 From: zml at FreeBSD.org (Zachary Loafman) Date: Thu May 28 04:13:03 2009 Subject: pthread_setugid_np In-Reply-To: <74fe56020905272039h6aed0724u38dbc25d0a1be6a7@mail.gmail.com> References: <20090528000147.GB3704@isilon.com> <74fe56020905271931l4c8d4677h3bbcce6d8c8a8605@mail.gmail.com> <20090528024640.GC9388@isilon.com> <74fe56020905272039h6aed0724u38dbc25d0a1be6a7@mail.gmail.com> Message-ID: <20090528041236.GA14687@isilon.com> On Thu, May 28, 2009 at 09:09:28AM +0530, Sujit K M wrote: > These are posix unix standards that you are going to be implementing. > So if you are talking of only taking the interfaces, why is there any need > to have objections. pthread_setugid_np is a non-portable pthread extension for per-thread user/group impersonation on OS X. The _np on the function name is to indicate its lack of portability to other OSes - it is not part of any standard. There is no posix standard way to impersonate a user/group on a per-thread basis - and, in fact, the OS X pthread_setugid_np interface is the only one I know of in common use. I'm proposing introducing the same API and semantics to FreeBSD, thereby vaguely pushing it further towards a standard. I don't really claim it's the most elegant interface, though. -- Zach Loafman | Staff Engineer | Isilon Systems From kmsujit at gmail.com Thu May 28 04:33:09 2009 From: kmsujit at gmail.com (Sujit K M) Date: Thu May 28 04:33:19 2009 Subject: pthread_setugid_np In-Reply-To: <20090528041236.GA14687@isilon.com> References: <20090528000147.GB3704@isilon.com> <74fe56020905271931l4c8d4677h3bbcce6d8c8a8605@mail.gmail.com> <20090528024640.GC9388@isilon.com> <74fe56020905272039h6aed0724u38dbc25d0a1be6a7@mail.gmail.com> <20090528041236.GA14687@isilon.com> Message-ID: <74fe56020905272133r3f2ab491t962c6d0fe900e9d0@mail.gmail.com> The Source code licensing show two license information. One the Apple license and Other the BSD License. The BSD License is to the Mach code that is present in the source code, presumably I assume. And this includes the pthread_setugid_np, but with some amount of rework with the apple OS X implementation. Are you sure that this feature was never present in any of the BSD. Or has it been moved out due to some performance requirement. As far I see if present in OS X, It is an high performance piece of code. But it need to be checked whether the code was present in earlier version of BSD. Which might make it easier for you to have it in your internal version. On Thu, May 28, 2009 at 9:42 AM, Zachary Loafman wrote: > On Thu, May 28, 2009 at 09:09:28AM +0530, Sujit K M wrote: >> These are posix unix standards that you are going to be implementing. >> So if you are talking of only taking the interfaces, why is there any need >> to have objections. > > pthread_setugid_np is a non-portable pthread extension for per-thread > user/group impersonation on OS X. The _np on the function name is to > indicate its lack of portability to other OSes - it is not part of any > standard. There is no posix standard way to impersonate a user/group on > a per-thread basis - and, in fact, the OS X pthread_setugid_np interface > is the only one I know of in common use. > > I'm proposing introducing the same API and semantics to FreeBSD, thereby > vaguely pushing it further towards a standard. I don't really claim it's > the most elegant interface, though. > > -- > Zach Loafman | Staff Engineer | Isilon Systems > > From kmsujit at gmail.com Thu May 28 04:48:19 2009 From: kmsujit at gmail.com (Sujit K M) Date: Thu May 28 04:48:25 2009 Subject: pthread_setugid_np In-Reply-To: <74fe56020905272133r3f2ab491t962c6d0fe900e9d0@mail.gmail.com> References: <20090528000147.GB3704@isilon.com> <74fe56020905271931l4c8d4677h3bbcce6d8c8a8605@mail.gmail.com> <20090528024640.GC9388@isilon.com> <74fe56020905272039h6aed0724u38dbc25d0a1be6a7@mail.gmail.com> <20090528041236.GA14687@isilon.com> <74fe56020905272133r3f2ab491t962c6d0fe900e9d0@mail.gmail.com> Message-ID: <74fe56020905272148q680cdc05tb572d576a4c3ff2b@mail.gmail.com> As per the Apple Documentation: In some cases it is helpful to impersonate the user, at least as far as the permissions checking done by the BSD subsystem of the kernel. A single-threaded daemon can do this using seteuid and setegid. These set the effective user and group ID of the process as a whole. This will cause problems if your daemon is using multiple threads to handle requests from different users. In that case you can set the effective user and group ID of a thread using pthread_setugid_np. This was introduced in Mac OS X 10.4. (AT) http://developer.apple.com/technotes/tn2005/tn2083.html I think this is a part of the BSD (Mach) subsystem. From jhb at freebsd.org Thu May 28 13:53:09 2009 From: jhb at freebsd.org (John Baldwin) Date: Thu May 28 13:53:29 2009 Subject: pthread_setugid_np In-Reply-To: <74fe56020905272148q680cdc05tb572d576a4c3ff2b@mail.gmail.com> References: <20090528000147.GB3704@isilon.com> <74fe56020905272133r3f2ab491t962c6d0fe900e9d0@mail.gmail.com> <74fe56020905272148q680cdc05tb572d576a4c3ff2b@mail.gmail.com> Message-ID: <200905280812.52431.jhb@freebsd.org> On Thursday 28 May 2009 12:48:17 am Sujit K M wrote: > As per the Apple Documentation: > > In some cases it is helpful to impersonate the user, at least as far > as the permissions checking done by the BSD subsystem of the kernel. A > single-threaded daemon can do this using seteuid and setegid. These > set the effective user and group ID of the process as a whole. This > will cause problems if your daemon is using multiple threads to handle > requests from different users. In that case you can set the effective > user and group ID of a thread using pthread_setugid_np. This was > introduced in Mac OS X 10.4. > > (AT) http://developer.apple.com/technotes/tn2005/tn2083.html > > > I think this is a part of the BSD (Mach) subsystem. It has never been in BSD outside of OS X. BSD from UC Berkeley did not support kernel threads and you are free to check the CVS history of the various kern_prot.c files on other BSD's yourself. There is no BSD code to do this, and you could not use Darwin's code directly on FreeBSD anyway since the two OS's manage credential state differently. -- John Baldwin From jhb at freebsd.org Thu May 28 13:53:10 2009 From: jhb at freebsd.org (John Baldwin) Date: Thu May 28 13:53:29 2009 Subject: pthread_setugid_np In-Reply-To: <20090528000147.GB3704@isilon.com> References: <20090528000147.GB3704@isilon.com> Message-ID: <200905280816.29617.jhb@freebsd.org> On Wednesday 27 May 2009 8:01:48 pm Zachary Loafman wrote: > arch@ - > > Isilon has need of per-thread impersonation. We're looking at > implementing something like the pthread_setugid_np mechanism found on > OS X, loosely documented in the code: > > http://fxr.watson.org/fxr/source/bsd/kern/kern_prot.c?v=xnu-1228 > (see settid and setgroups1) > > and some here: > http://lists.apple.com/archives/perfoptimization-dev/2008/Jan/msg00043.html > > Does anyone have strong objections to Apple's APIs here? There's > obviously no portable itnerface to handle it, and it seems a little > saner to just adopt someone else's API/semantics rather than reinvent. I suppose you would implement this by having a new flag in td_pflags to indicate that the thread is using a private credential and use that to disable the automatic updating of td_ucred on syscall return and then just point td_ucred at the thread-specific credential? Hmm, the XXX in Darwin's source about P_SUGID is probably meaningful for us as we still use that flag. I would defer to Robert on how that should work though. -- John Baldwin From rwatson at FreeBSD.org Thu May 28 13:56:13 2009 From: rwatson at FreeBSD.org (Robert Watson) Date: Thu May 28 13:56:19 2009 Subject: IP_NONLOCALOK improvements. In-Reply-To: <4A1D80CA.4020702@elischer.org> References: <20090526135547.GE1491@garage.freebsd.pl> <4A1CD562.9040706@elischer.org> <20090527065121.GD4204@garage.freebsd.pl> <4A1D80CA.4020702@elischer.org> Message-ID: On Wed, 27 May 2009, Julian Elischer wrote: >>>> I added new privilege - PRIV_NETINET_BINDANY, because we do have to check >>>> for privilege before allowing to use it. > > are we sure we want to make a whole PRIV just for this function? We have lots of privs, and in fact that's the point of privs: they should narrowly describe a specific privilege so that granting that privilege doesn't (generally) imply the granting of other privileges. Some privileges are necessarily more broad (for example, privilege to enable I/O port access from userspace implies all other privileges), but mostly this is not true of access control privileges such as this one. >>> I am not sure about this. if a system has this enabled then I presume it >>> is a special system and not a generally available time-sharing system. >>> >>> How do you allow a process to have this privilege? are you forcing them to >>> be root for now? >> >> Our current privilege model is that we have fine-grained privileges in the >> kernel, but those are not _yet_ exposed to userland. All privileges defined >> in sys/priv.h are available for unjailed root and some (take a look at >> prison_priv_check() function) for jailed root. > > when we have vimage, this may be required in some vimages and not others.. > the priv should be inherrited but gated on the new jail.. in other words, > when yo create a child jail, it can do it if: 1/ the parent can do it AND 2/ > the parent allows the child to do it. The eventual intent is to support privilege masks in FreeBSD, which would include a jail cap on privileges granted in the jail, as well as the ability to assign more specific privilege sets to specific non-root processes. I had hoped to get this done for FreeBSD 8.0, but due to other obligations, that hasn't happened. Perhaps if it goes well, we'll get one for 8.1 or 8.2. The main priority in doing this is doing it safely, since there are lots of worked examples in how not to get it right. Robert N M Watson Computer Laboratory University of Cambridge > >> >> Today this new privilege will only be available for unjailed root. >> >> At some point we will grow possibility to selectively add/remove >> privileges just like Solaris, but we can't do that now. >> >>>> I removed kernel option to enable it, I see to reason not to have it in >>>> GENERIC. >>> Because it adds complexity and because some people do not want it even >>> possible. >>> You are eneabling NON-standard, (in fact "Standard-ignoring") >>> behaviour. >>> >>> >>>> I also removed sysctl to enable it - we have privilege for limiting its >>>> use. >>> I disagree very strongly about this one. I would liek to >>> 1/ have to explicitly compile in thi snon standard behaviour and >>> >>> 2/ turn it on >>> >>> before we start doing this. >>> >>> >>> I know how useful this is to have, (from my own experience) >>> but feel strongly that this is pretty bad behaviour for most systems >>> and can facilitate all sorts security worries. >> >> Well, this is behaviour is similar to adding an IP address to an >> interface and binding to that address. There is even no securelevel that >> denies modifing interfaces, so in my opinion if one needs to explicitly >> ask for this to be enabled for a socket and one needs a special >> privilege to do it, it should be enough protection to make user's live a >> bit less complex by not requiring kernel recompilation and sysctl >> modification. >> >> I'm not sure if this was on purpose, but currently even unprivileged >> user can use this functionality if the sysctl is on, which I find hard >> to accept. Having this always enabled and requiring a privilege is IMHO >> more secure than allowing anyone to use it once the sysctl is on. >> But again, combining the two (privilege and sysctl) is redundant IMHO. > > it was on purpose as it was assumed, as I said, that anyone compiling > it in would be creating a special "appliance" kernel and be fully > in charge of the machine. > >> >> If it doesn't convince you and I also don't feel convinced we need to >> wait for more votes:) > > I can live with having it there by default in GENERIC, but I'm > not sure I don't still want to be able to remove it.. > the sysctl could go , but I felt I wanted a way for the admin > to disble it on a system if it shared a kernel with other > systems that need it. > > > i.e. if it's in GENERIC, then I think the admin should be able to stop it > from being available. > > > >> > > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > From zml at FreeBSD.org Thu May 28 15:18:18 2009 From: zml at FreeBSD.org (Zachary Loafman) Date: Thu May 28 15:18:24 2009 Subject: pthread_setugid_np In-Reply-To: <200905280816.29617.jhb@freebsd.org> References: <20090528000147.GB3704@isilon.com> <200905280816.29617.jhb@freebsd.org> Message-ID: <20090528151800.GA18467@isilon.com> On Thu, May 28, 2009 at 08:16:29AM -0400, John Baldwin wrote: > I suppose you would implement this by having a new flag in td_pflags to > indicate that the thread is using a private credential and use that to > disable the automatic updating of td_ucred on syscall return and then just > point td_ucred at the thread-specific credential? That sounds about right, though is actually more detailed than I had gotten in my cursory investigation. > Hmm, the XXX in Darwin's source about P_SUGID is probably meaningful for us as > we still use that flag. I would defer to Robert on how that should work > though. Hm, given the intent of issetugid(2), it seems like P_SUGID should instead become a count of tainted threads rather than a flag. -- Zach Loafman | Staff Engineer | Isilon Systems From rwatson at FreeBSD.org Fri May 29 08:48:35 2009 From: rwatson at FreeBSD.org (Robert Watson) Date: Fri May 29 08:48:41 2009 Subject: pthread_setugid_np In-Reply-To: <20090528000147.GB3704@isilon.com> References: <20090528000147.GB3704@isilon.com> Message-ID: On Wed, 27 May 2009, Zachary Loafman wrote: > Isilon has need of per-thread impersonation. We're looking at implementing > something like the pthread_setugid_np mechanism found on OS X, loosely > documented in the code: > > http://fxr.watson.org/fxr/source/bsd/kern/kern_prot.c?v=xnu-1228 (see settid > and setgroups1) > > and some here: > http://lists.apple.com/archives/perfoptimization-dev/2008/Jan/msg00043.html > > Does anyone have strong objections to Apple's APIs here? There's obviously > no portable itnerface to handle it, and it seems a little saner to just > adopt someone else's API/semantics rather than reinvent. I'm not opposed to adding APIs along these lines, as long as it is done Very Carefully(tm). Some experience here suggests these sorts of things are very easy to do wrong, anyway. :-) Having spent some time investigating and using the APIs on Mac OS X in the last year, I can report that their usage is at times inconsistent. Applications frequently fail to properly update their full thread credentials, assuming that updating only the euid or egid is sufficient, perhaps neglecting other IDs or additional groups. While this is definitely an application bug, it is also relevant to the base OS because we do provide a set of credential-management library functions (especially setlogincontext(3)) that will not be aware of thread credentials. Per-thread credentials also require semantics that effectively preclude M:N threading with usersapce context switching being used in the future (or, at least, requires user threads with different credentials to use different kernel-visible threads, or the addition of explicit ucred descriptors to allow credential context to be saved and restored), which while not currently a huge concern, is worth thinking about. There are also potential concerns about other credential elements, such as MAC labels provided by policies that assume timely update of the label across all threads (i.e., on next entry to the kernel) as part of their semantics, and might not respond well to individual threads having other labels. This might be addressed by MAC policies having the opportunity to force an update to the per-thread credential even when running in per-thread mode in order to propagate their own changes, but we'd have to think a bit about the specific requirements. Finally: one of the things Apple found with lots of use of daemons that either switched credentials a lot in order to impersonate many users out of a singlle process, is that they ended up with a lot more different credentials in use at a time, as the fleeting credentials get referenced for the long-term by file descriptors opened when the credential was active. Our reference counting model is intended to save memory in the case where lots of credentials are the same but changes are infrequent, and so you can see kernel memory use balloon. The per-thread case is a bit better behaved than the simple per-process case frequently switching, but it's worth watching out for this. Apple addressed the problem by doing a coalesce stage after creating and initializing a new credential, in which potential existing credentials with identical contents are searched for and then used instead, discarding the new one, which comes with some overhead. Robert N M Watson Computer Laboratory University of Cambridge From Alexander at Leidinger.net Fri May 29 09:17:13 2009 From: Alexander at Leidinger.net (Alexander Leidinger) Date: Fri May 29 09:17:26 2009 Subject: Profile rc idea In-Reply-To: <4A1F12E4.1060404@comcast.net> References: <4A1F12E4.1060404@comcast.net> Message-ID: <20090529110044.200461pczbdmklk4@webmail.leidinger.net> Quoting Nathan Lay (from Thu, 28 May 2009 18:40:36 -0400): > Hi list, > Wasn't sure which list this idea belongs, so I sent it here. It arch@ (cced) is the generic place to discuss architectural changes of subsystems. > would be interesting if rc was extended to support profiles. Each > profile would reflect a different system configuration. For example > profiles could describe the computing environment at: home, work, > friend's house, airplane, etc... The active profile the system uses > could be chosen based on some contingency condition. For example, > simply prompting the user to choose an rc profile at boot, or using > hardware to choose the profile (e.g. like location based contingency > using GPS hardware), or whatever... I guess this only pertains to > booting though, but rc seems like a natural place to do this. > Thoughts, comments? Yet another idea I have no time to try... You can already do this in rc.conf: ---snip--- location=at_home case ${location} in at_home) ifconfig_xx0="..." ... ;; at_work) ifconfig_xx0="..." ... ;; *) echo wrong location set exit 1 # alternatively use some kind of default setup ;; esac ---snip--- This way you need to know before where you are, or boot into single-user mode. You can also extend it to read a kenv ("location=$(/bin/kenv profile.location)"), this way you can specify the location in the loader (bonus points to implement a loader extension in forth to read a file which lists possible profiles and offer them in the menu). Bye, Alexander. -- "I keep seeing spots in front of my eyes." "Did you ever see a doctor?" "No, just spots." http://www.Leidinger.net Alexander @ Leidinger.net: PGP ID = B0063FE7 http://www.FreeBSD.org netchild @ FreeBSD.org : PGP ID = 72077137 From zml at FreeBSD.org Fri May 29 22:54:48 2009 From: zml at FreeBSD.org (Zachary Loafman) Date: Fri May 29 22:54:54 2009 Subject: pthread_setugid_np In-Reply-To: References: <20090528000147.GB3704@isilon.com> Message-ID: <20090529225432.GC27779@isilon.com> First off, let me just say that I really appreciate such a thorough response. It was a pleasure to read. :) On Fri, May 29, 2009 at 09:48:33AM +0100, Robert Watson wrote: [...] > Having spent some time investigating and using the APIs on Mac OS X in > the last year, I can report that their usage is at times inconsistent. [...] This is one of the things I don't really like about the standard APIs, either. If I were to deviate from the Mac OS X API, I would propose something more along the lines of two calls: int pthread_setcred_np(uid_t uid, int ngroups, const git_t *gidset); void pthread_clearcred_np(); .. where clearcred is used to change the thread back to per-process credentials. I don't really like using an artifical ID for clearing it. Setting the uid and groups in the same call would provide a clue to the application writer that the supplemental group list should be considered as well. Combining the setuid and setgroups into one call also has internal advantages in that no intermediate cred is ever created. As it is, the OS X APIs are a little kludgy around setgroups. Unless I'm misunderstanding something, you can't really tell if setgroups() is modifying the per-process or per-thread credentials unless you also know whether the thread is running with per-thread credentials. > Per-thread credentials also require semantics that effectively preclude > M:N threading with usersapce context switching being used in the future I hadn't really thought much about that, because I thought M:N was effectively dead. :) > There are also potential concerns about other credential elements, such > as MAC labels provided by policies that assume timely update of the label I'm going to have to research this point a little more thoroughly, I haven't looked at the MAC interaction here. > but it's worth watching out for this. Apple addressed the problem by > doing a coalesce stage after creating and initializing a new credential, > in which potential existing credentials with identical contents are > searched for and then used instead, discarding the new one, which comes > with some overhead. This is somewhat of a concern for us, yes. In theory, the number of unique creds for us is roughly bounded by the number of active connections. However, in a multithreaded server that swaps to per-thread credentials to do Real Work, the number of creds in a naive system would end up more on the order of the number of open files. I'm not sure that's a huge concern for us, though. In theory, we might also be able to de-dupe in the background. Serialization is a concern, but it seems doable. That wouldn't incur the active overhead of keeping a cred cache. Longer term, we may be interested in working on a cred cache, too. We have other motives behind this, though: we're looking at storing alternate identities in the cred, and given the expensiveness of mapping in certain circumstances it makes a lot more sense to just hunt for an existing cred that matches. -- Zach Loafman | Staff Engineer | Isilon Systems