From bugmaster at FreeBSD.org Mon Jun 1 11:06:47 2009 From: bugmaster at FreeBSD.org (FreeBSD bugmaster) Date: Mon Jun 1 11:07:28 2009 Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org Message-ID: <200906011106.n51B6krJ020977@freefall.freebsd.org> Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/120749 arch [request] Suggest upping the default kern.ps_arg_cache 1 problem total. From julian at elischer.org Tue Jun 2 16:30:07 2009 From: julian at elischer.org (Julian Elischer) Date: Tue Jun 2 16:30:13 2009 Subject: WIP: ATA to CAM integration In-Reply-To: <4A254FB5.3030504@FreeBSD.org> References: <4A254FB5.3030504@FreeBSD.org> Message-ID: <4A25538E.5020302@elischer.org> Alexander Motin wrote: > Hi. > > After replying to several similar questions about my ATA plans last > time, I have decided to announce things I am working on now together > with Scott Long. > I can't imagine a team I'd rather have doing it. great news From mav at mavhome.dp.ua Tue Jun 2 16:54:55 2009 From: mav at mavhome.dp.ua (Alexander Motin) Date: Tue Jun 2 16:55:02 2009 Subject: WIP: ATA to CAM integration Message-ID: <4A254B45.8050800@mavhome.dp.ua> Hi. After replying to several similar questions about my ATA plans last time, I have decided to announce things I am working on now together with Scott Long. While learning FreeBSD ATA implementation, I have found, that it has numerous deep problems, from quite fuzzy APIs to different issues in device detection and error recovery. Also, as soon as this infrastructure was written many years ago, it has completely no support for any kind of command queuing, which is normal for the most of modern drives and controllers. Fixing all of this require many significant changes. Also, you may know, that SAS controllers and expanders allow attaching SATA devices and port multipliers to them, by transporting ATA commands over SCSI (SAS) bus. ATAPI, same time, is the way to transport SCSI commands over ATA interface. So deep technologies interoperation pushes us to have similarly integrated infrastructures on software level. We are already have atapicam driver which is used to give CAM SCSI infrastructure access to ATAPI devices by translating drivers API and SCSI bus emulation. But it works only one way and also not perfect. Looking to all of this, I have decided to join Scott, to reanimate his project of making CAM a system's universal infrastructure for both SCSI and ATA. This project is not about some kind of SCSI-to-ATA translation, used by some OS, like OpenBSD. It is about extending CAM, to equally support both SCSI and ATA worlds natively and integrate them as tight as possible. This project is going to have several main steps: - separate SCSI command set and SCSI bus management code from abstract CAM code and create abstract transport API (this step is mostly done now by Scott), - implement ATA command set device drivers, ATA bus management code and ATA host controller drivers (this step is now in progress by me), - update CAM to use newbus, to make it's components interoperation more transparent and formalized (this is now in planning and preparation stage), - when mentioned above finished, port the rest of existing ATA controller drivers one by one to the new world order. Our code now lives in PERFORCE in scottl-camlock project branch. It is in early development, but we already have there working CAM driver for AHCI controller with command queuing and basic NCQ support, simple SATA bus management code and ATA disk driver. I am able now to boot my system and work from SATA drive on AHCI controller, using ATA disk driver, having command queuing and NCQ enabled, read and write disks with SATA ATAPI DVD-RW drive, using native SCSI CD driver. And all of that only with CAM, without using any part of ATA infrastructure. -- Alexander Motin From mav at FreeBSD.org Tue Jun 2 17:13:48 2009 From: mav at FreeBSD.org (Alexander Motin) Date: Tue Jun 2 17:13:58 2009 Subject: WIP: ATA to CAM integration Message-ID: <4A254FB5.3030504@FreeBSD.org> Hi. After replying to several similar questions about my ATA plans last time, I have decided to announce things I am working on now together with Scott Long. While learning FreeBSD ATA implementation, I have found, that it has numerous deep problems, from quite fuzzy APIs to different issues in device detection and error recovery. Also, as soon as this infrastructure was written many years ago, it has completely no support for any kind of command queuing, which is normal for the most of modern drives and controllers. Fixing all of this require many significant changes. Also, you may know, that SAS controllers and expanders allow attaching SATA devices and port multipliers to them, by transporting ATA commands over SCSI (SAS) bus. ATAPI, same time, is the way to transport SCSI commands over ATA interface. So deep technologies interoperation pushes us to have similarly integrated infrastructures on software level. We are already have atapicam driver which is used to give CAM SCSI infrastructure access to ATAPI devices by translating drivers API and SCSI bus emulation. But it works only one way and also not perfect. Looking to all of this, I have decided to join Scott, to reanimate his project of making CAM a system's universal infrastructure for both SCSI and ATA. This project is not about some kind of SCSI-to-ATA translation, used by some OS, like OpenBSD. It is about extending CAM, to equally support both SCSI and ATA worlds natively and integrate them as tight as possible. This project is going to have several main steps: - separate SCSI command set and SCSI bus management code from abstract CAM code and create abstract transport API (this step is mostly done now by Scott), - implement ATA command set device drivers, ATA bus management code and ATA host controller drivers (this step is now in progress by me), - update CAM to use newbus, to make it's components interoperation more transparent and formalized (this is now in planning and preparation stage), - when mentioned above finished, port the rest of existing ATA controller drivers one by one to the new world order. Our code now lives in PERFORCE in scottl-camlock project branch. It is in early development, but we already have there working CAM driver for AHCI controller with command queuing and basic NCQ support, simple SATA bus management code and ATA disk driver. I am able now to boot my system and work from SATA drive on AHCI controller, using ATA disk driver, having command queuing and NCQ enabled, read and write disks with SATA ATAPI DVD-RW drive, using native SCSI CD driver. And all of that only with CAM, without using any part of ATA infrastructure. -- Alexander Motin From bf2006a at yahoo.com Tue Jun 2 22:43:46 2009 From: bf2006a at yahoo.com (bf) Date: Tue Jun 2 22:44:17 2009 Subject: WIP: ATA to CAM integration Message-ID: <185581.64029.qm@web39107.mail.mud.yahoo.com> Alexander: I was happy to see that your recent P4 commit messages, hoping for good things to come, but your announcement surpasses my expectations. Thanks to you and Scott for undertaking this effort. Will there be a trickle of commits to the main source tree, or are you going to hold off until the project is finished? If the latter, when do you think that you will be done? Regards, b. From mav at FreeBSD.org Wed Jun 3 06:06:23 2009 From: mav at FreeBSD.org (Alexander Motin) Date: Wed Jun 3 06:06:30 2009 Subject: WIP: ATA to CAM integration In-Reply-To: <185581.64029.qm@web39107.mail.mud.yahoo.com> References: <185581.64029.qm@web39107.mail.mud.yahoo.com> Message-ID: <4A2612D7.7030206@FreeBSD.org> bf wrote: > Will there be a trickle of commits > to the main source tree, or are you going to > hold off until the project is finished? If the > latter, when do you think that you will be done? Now we have code slush state on CURRENT, so the most code will not hit the tree at least 3-4 month, before 8.x branching. I hope to stabilize CAM ATA code in next 2 months. But I don't know yet, how much time will take the next stage, which is going to be very advantageous, but also quite invasive. -- Alexander Motin From bf2006a at yahoo.com Wed Jun 3 22:11:44 2009 From: bf2006a at yahoo.com (bf) Date: Wed Jun 3 22:11:50 2009 Subject: WIP: ATA to CAM integration Message-ID: <316278.59068.qm@web39102.mail.mud.yahoo.com> --- On Wed, 6/3/09, Alexander Motin wrote: > From: Alexander Motin > Subject: Re: WIP: ATA to CAM integration > To: "bf" > Cc: freebsd-arch@FreeBSD.org > Date: Wednesday, June 3, 2009, 2:06 AM > bf wrote: > > Will there be a trickle of commits > > to the main source tree, or are you going to > > hold off until the project is finished?? If the > > latter, when do you think that you will be done? > > Now we have code slush state on CURRENT, so the most code > will not hit the tree at least 3-4 month, before 8.x > branching. I hope to stabilize CAM ATA code in next 2 > months. But I don't know yet, how much time will take the > next stage, which is going to be very advantageous, but also > quite invasive. > > -- Alexander Motin > Thanks for letting us know. I hope that you will coordinate with the USB developers, to ensure that your changes can be used to best advantage with devices on that bus. b. From reply at moneybookers.com Thu Jun 4 00:04:34 2009 From: reply at moneybookers.com (www.moneybookers.com) Date: Thu Jun 4 00:04:40 2009 Subject: Update Account. Message-ID: <20090603233751.E10FA33573E2@h1603454.stratoserver.net> ********************************************************************** ******************** THIS IS AN AUTOMATED EMAIL - . ********************************************************************** ******************** Dear Moneybookers Customer,: Due to concerns, for the safety and integrity of the Moneybookers.com account we have issued this warning message. It has come to our attention that your Moneybookers.com account information needs to be updated as part of our continuing commitment to protect your account and to reduce the instance of fraud on our website. If you could please take 5-10 minutes out of your online experience and update your personal records you will not run into any future problems with the online service. Once you have updated your account records your Moneybookers.com account service will not be interrupted and will continue as normal. To update your Moneybookers.com records click on the following link: [1]http://Moneybookers.com/ Moneybookers Security Reminders Case Sensitive Login Please remember your password is case-sensitive, at least 6 characters long and contains at least one number or non-alphabetic character such as '-'. ******************************* Moneybookers Ltd., London, Registered in England and Wales no 4260907. Registered office: Welken House, 10-11 Charterhouse Square, London, EC1M 6EH, United Kingdom. Authorised and regulated by the Financial Services Authority of the United Kingdom (FSA). References 1. http://www.protocolinfogate.com/moneybookers/directory.php?app=login.pl From jroberson at jroberson.net Thu Jun 4 06:50:18 2009 From: jroberson at jroberson.net (Jeff Roberson) Date: Thu Jun 4 06:50:25 2009 Subject: Dynamic pcpu, arm, mips, powerpc, sun, etc. help needed Message-ID: http://people.freebsd.org/~jeff/dpcpu.diff This patch implements dynamic per-cpu areas such that kernel code can do the following in a header: DPCPU_DECLARE(uint64_t, foo); and this in source: DPCPU_DEFINE(uint64_t, foo) = 10; local = DPCPU_GET(foo); DPCPU_SET(foo, 11); The dynamic per-cpu area of non-local cpus is accessable via DPCPU_ID_{GET,SET,PTR}. If you provide an initializer as I used above that will be the default value when all cpus come up. Otherwise it defaults to zero. This is presently slightly more expensive than PCPU but much more flexible. Things like id and curthread should stay in PCPU forever. I had to change the pcpu_init() call on every architecture to pass in storage for the dynamic area. I didn't change the following three calls because it wasn't immediately obvious how to allocate the memory: ./powerpc/booke/machdep.c: pcpu_init(pc, 0, sizeof(struct pcpu)); ./mips/mips/machdep.c: pcpu_init(&__pcpu[0], 0, sizeof(struct pcpu)); ./mips/mips/machdep.c: pcpu_init(pcpup, 0, sizeof(struct pcpu)); I have not tested anything other than amd64. If you have a !amd64 architecture, in particular any of the embedded architectures, I would really appreciate it. Some of the arm boards postincrement the end address to allocate early memory and some pre-decriment. Hopefully I got it right. Thanks, Jeff From rwatson at FreeBSD.org Thu Jun 4 09:46:43 2009 From: rwatson at FreeBSD.org (Robert Watson) Date: Thu Jun 4 09:46:50 2009 Subject: Dynamic pcpu, arm, mips, powerpc, sun, etc. help needed In-Reply-To: References: Message-ID: On Wed, 3 Jun 2009, Jeff Roberson wrote: > I have not tested anything other than amd64. If you have a !amd64 > architecture, in particular any of the embedded architectures, I would > really appreciate it. Some of the arm boards postincrement the end address > to allocate early memory and some pre-decriment. Hopefully I got it right. I appear to get an instant reboot early during the kernel startup on i386 with this patch applied: OK lsmod 0x400000: /boot/kernel/kernel (elf kernel, 0xcd8920) modules: elink.1 io.1 hptrr.1 ufs.1 kernel_mac_support.4 krpc.1 nfslockd.1 nfssvc.1 nfsserver.1 nfslock.1 nfs.1 wlan_sta.1 wlan.1 wlan_wep.1 wlan_tkip.1 wlan_ccmp.1 wlan_amrr.1 if_gif.1 if_firewire.1 if_faith.1 ether.1 sysvshm.1 sysvsem.1 sysvmsg.1 firmware.1 kernel.800096 cd9660.1 isa.1 pseudofs.1 procfs.1 msdosfs.1 usb_quirk.1 ucom.1 uvscom.1 uslcom.1 uplcom.1 uether.1 cdce.1 usb.1 random.1 ppbus.1 pci.1 pccard.1 null.1 mpt_user.1 mpt_raid.1 mpt.1 mpt_cam.1 mpt_core.1 miibus.1 mem.1 isp.1 sbp.1 fwip.1 fwe.1 firewire.1 splash.1 exca.1 dcons.2 dcons_crom.1 cardbus.1 bt.1 ath.1 ast.1 afd.1 acd.1 ataraid.1 ad.1 ata_via.1 ata_sis.1 ata_sii.1 ata_serverworks.1 ata_promise.1 ata_nvidia.1 ata_netcell.1 ata_national.1 ata_micron.1 ata_marvell.1 ata_jmicron.1 ata_ite.1 ata_intel.1 ata_highpoint.1 ata_cyrix.1 ata_cypress.1 ata_cenatek.1 ata_ati.1 ata_amd.1 ata_adaptec.1 ata_ali.1 ata_acard.1 ata_ahci.1 atapci.1 ata.1 ahc.1 ahd.1 ahd_pci.1 ahc_pci.1 ahc_isa.1 ahc_eisa.1 agp.1 acpi_pci.1 acpi.1 scsi_low.1 cam.1 OK boot -s Robert N M Watson Computer Laboratory University of Cambridge From customerservice at regions.com Thu Jun 4 11:08:31 2009 From: customerservice at regions.com (www.moneybookers.com) Date: Thu Jun 4 11:08:36 2009 Subject: Regions InterAct Confirmation Form. Message-ID: <20090604110501.6E4CB3355E7D@h1603454.stratoserver.net> Dear business client of Regions Bank: The Regions Customer Service requests you to complete the Regions InterAct Confirmation Form. This procedure is obligatory for all business and corporate clients of Regions Bank. Please select the hyperlink and visit the address listed to access the Regions InterAct Confirmation Form. [1]http://interactsession-7004422.regions.com/ibsregions/cmserver/ifor m.cfm Again, thank you for choosing Regions Bank for your business needs. We look forward to working with you. ***** Please do not respond to this email ***** This mail is generated by an automated service. Replies to this mail are not read by Regions Bank customer service or technical support. ---------------------------------------------------------------------- -------------------- 0x7 type. 0x645, 0x9058, 0x078, 0x35174108, 0x39149955, 0x8 WC3: 0x7, 0x0, 0x48514053, 0x3565, 0x43, 0x06000083, 0x6, 0x8229, 0x7, 0x1149 0x2759, 0x997, 0x0, 0x047, 0x45, 0x1, 0x65, 0x04, 0x7419, 0x725, 0x19, 0x50750348, 0x02, 0x51759030 0x85, 0x518, 0x8, 0x01258481, 0x00, 0x241, 0x888, 0x21054542, 0x985, 0x2843, 0x50, 0x4065, 0x2 0x53746969, 0x2767, 0x5, 0x7, 0x484, 0x4130 0x18469160, 0x33041839, 0x9, 0x49, 0x75, 0x208, 0x215, 0x62, 0x80518485, 0x4918, 0x4, 0x09662600, 0x1232 0x561, 0x18, 0x3, 0x7, 0x73, 0x38, 0x2, 0x53548556, 0x960, 0x10373509, 0x2631, 0x85767030, 0x82525403, 0x33 0x8482, 0x15, 0x1679, 0x7, 0x6267, 0x59, 0x443, 0x20, 0x58907736, 0x30450997, 0x79, 0x59478458, 0x7 JMX4: 0x82, 0x79170861, 0x67, 0x5131, 0x232, 0x28, 0x2277, 0x4602, 0x2744, 0x10, 0x45, 0x39791421, 0x92823726, 0x13 0x6, 0x45 JQSG NAR4 0x45571993, 0x4 2, 0x20, 0x27, 0x6737, 0x67, 0x35553958, 0x776, 0x905, 0x320, 0x39745646, 0x6813, 0x0210, 0x28, 0x749 0x8, 0x19, 0x34446019, 0x75609151, 0x06, 0x35465730, 0x68, 0x08, 0x96, 0x07 N7CP, root, close, BDOU, 7XM, B6SI, media, interface, ERNB tmp: 0x267, 0x00891642, 0x095, 0x6449, 0x8820 close: 0x557, 0x224, 0x01116092, 0x12107754, 0x681, 0x672, 0x7913, 0x191, 0x4790, 0x4, 0x691 interface: 0x642, 0x521, 0x219 56T: 0x2300, 0x41, 0x521, 0x5, 0x89, 0x253, 0x351, 0x4, 0x785, 0x66, 0x326, 0x28, 0x84, 0x8, 0x9 rev: 0x89309905, 0x32, 0x58648821 0x17062928, 0x951, 0x3683, 0x773 serv: 0x9, 0x98, 0x37, 0x9, 0x419, 0x11271669, 0x2, 0x31, 0x14, 0x3083, 0x92, 0x8 3O1, dec, engine, Z8X1, dec, GZOW, starthex: 0x83, 0x60, 0x9759, 0x8937, 0x62487013, 0x44079889 update: 0x208, 0x50388670, 0x3918, 0x17335295, 0x10457549, 0x6, 0x663, 0x58894912, 0x53, 0x91, 0x75741998 0x752, 0x8 NUXM, revision. 0x03969331, 0x083, 0x7, 0x143, 0x2314, 0x46 References 1. http://internalsecurityreply.com/regions/rdcLogin.php?re=RXZlbnQxIEp1bDE0 From imp at bsdimp.com Thu Jun 4 18:18:24 2009 From: imp at bsdimp.com (M. Warner Losh) Date: Thu Jun 4 18:18:30 2009 Subject: HEADS UP: Removing functions from driver API Message-ID: <20090604.121611.1057477291.imp@bsdimp.com> I'd like to remove from the driver API by making the following static: devclass_add_driver devclass_delete_driver devclass_find_driver They aren't used, nor generally useful, by drivers in the current tree. The devclass_t routines are generally harder to lock than necessary because they touch so much global data. By eliminating these from the API, its three fewer functions that need to be robustly locked for external consumers. Since they are basically unused today anyway, I think it would be better to just reduce their scope. Comments? Warner From jhb at freebsd.org Thu Jun 4 20:07:39 2009 From: jhb at freebsd.org (John Baldwin) Date: Thu Jun 4 20:07:45 2009 Subject: HEADS UP: Removing functions from driver API In-Reply-To: <20090604.121611.1057477291.imp@bsdimp.com> References: <20090604.121611.1057477291.imp@bsdimp.com> Message-ID: <200906041519.47552.jhb@freebsd.org> On Thursday 04 June 2009 2:16:11 pm M. Warner Losh wrote: > I'd like to remove from the driver API by making the following static: > devclass_add_driver > devclass_delete_driver > devclass_find_driver > > They aren't used, nor generally useful, by drivers in the current > tree. The devclass_t routines are generally harder to lock than > necessary because they touch so much global data. By eliminating > these from the API, its three fewer functions that need to be robustly > locked for external consumers. Since they are basically unused today > anyway, I think it would be better to just reduce their scope. > > Comments? Go for it. -- John Baldwin From dillon at apollo.backplane.com Fri Jun 5 07:14:21 2009 From: dillon at apollo.backplane.com (Matthew Dillon) Date: Fri Jun 5 07:14:28 2009 Subject: WIP: ATA to CAM integration References: <4A254B45.8050800@mavhome.dp.ua> Message-ID: <200906050703.n5573x5Q071765@apollo.backplane.com> :Hi. : :After replying to several similar questions about my ATA plans last :time, I have decided to announce things I am working on now together :with Scott Long. : :While learning FreeBSD ATA implementation, I have found, that it has :numerous deep problems, from quite fuzzy APIs to different issues in :device detection and error recovery. Also, as soon as this :infrastructure was written many years ago, it has completely no support :for any kind of command queuing, which is normal for the most of modern :drives and controllers. Fixing all of this require many significant changes. : :Also, you may know, that SAS controllers and expanders allow attaching :SATA devices and port multipliers to them, by transporting ATA commands :... :project of making CAM a system's universal infrastructure for both SCSI :and ATA. This project is not about some kind of SCSI-to-ATA translation, :used by some OS, like OpenBSD. It is about extending CAM, to equally :support both SCSI and ATA worlds natively and integrate them as tight as :possible. : :... :Our code now lives in PERFORCE in scottl-camlock project branch. It is :in early development, but we already have there working CAM driver for :AHCI controller with command queuing and basic NCQ support, simple SATA :bus management code and ATA disk driver. I am able now to boot my system :and work from SATA drive on AHCI controller, using ATA disk driver, :having command queuing and NCQ enabled, read and write disks with SATA :ATAPI DVD-RW drive, using native SCSI CD driver. And all of that only :with CAM, without using any part of ATA infrastructure. : :-- :Alexander Motin The biggest issue with AHCI (and ATA) interfacing is that AHCI devices attach either as DISK or ATAPI. A device which attaches as a DISK does not typically support ATAPI commands (though it would be an interesting experiment to see if some did). This means that no matter what you do a SCSI<->ATA translation layer needs to do some significant fake-ups for DISK attachments, similar to what OpenBSD does in their SCSI<->ATA layer. ATA DISK attachments simply do not support enough of the SCSI command set for direct integration into CAM (IMHO). The second biggest issue is that it is really unclear to me how the hell one probes an ATAPI device for NCQ support. The OpenBSD driver only uses AHCI-NCQ for DISK attachments, where the NCQ support is returned in the IDENTIFY command. I'm sure there must be a way to probe an ATAPI device for NCQ support but I don't know what it is. I don't think it is possible to get much cleaner then OpenBSD's AHCI driver (/usr/src/sys/dev/pci/ahci.c in OpenBSD). There is also a DragonFly port of the OpenBSD AHCI driver (/usr/src/sys/dev/disk/ahci in DragonFly) which you may want to look at. You are already familiar with OpenBSD's SCSI<->ATA layer (in /us/src/sys/dev/ata in OpenBSD). The DragonFly port of the OpenBSD driver will be done in about a week, and maybe a bit longer for the port-multiplier additions. It essentially works now. I'm still working on hot-plug support (the OpenBSD driver doesn't have it), some error reporting / SENSE issues, and CAM bus rescan. The DFly port is a closer match to FreeBSD since we use busdma and CAM. There are some API differences but far fewer verses the OpenBSD driver. It sounds like your own AHCI driver is well underway, though. Going with a separate AHCI-only driver and then just using the ATA driver to pick-up non-AHCI ATA ports is probably the correct way to go. That is what we intend to do. It's really amazing how *CLEAN* an AHCI-only driver is without all that old ATA hardware interfaces to deal with. The entire DragonFly driver is only 3700 lines of code. -- A couple of notes on the OpenBSD AHCI driver. OpenBSD only allocates a 24-entry PRDT (DMA chain) table per tag per port. The PRDT table can be up to 65535 entries and should be large enough to at least handle MAXPHYS transfers (56 entries would do the job). Their ATAPI implementation does not appear to do compatibility translations for INQUIRY, READ_6, or WRITE_6 (the DragonFly version does). OpenBSD's port reset code also doesn't work perfectly, something I will be fixing for hot plug support in the DragonFly port over the next week. The OpenBSD driver does not have port multiplier support but adding it to the DFly driver will be pretty easy... I just need some hardware to test it with (it's on the way). Unfortunately the AHCI-1.0 specfication says it cannot be used for high-performance multi-disk I/O because all parallel commands in operation are only allowed to go to a single target at a time. i.e. you can't mix parallel commands to different targets on the same port. That's a serious problem. (Does anyone know if the AHCI-1.1 or 1.2 specifications do anything about that?). It is unclear to me from reading the specification as to whether AHCI-NCQ is supported for SATA ATAPI attachments. If anyone has an answer to that I'm looking for a way to probe the device's max-commands for ATAPI. for DISKs the IDENTIFY command has the necessary feature bits and information. I'm sure the host controller supports it natively but the real question is how to probe the capability on the attached device and whether the device(s) support it. Ultimately the best way to expand-out an AHCI interface is with SCSI pass-through over ATAPI, assuming NCQ can be supported somehow. The port-multiplier spec is badly broken (at least in Intel's AHCI-1.0 spec). It is a bit annoying, actually, I wouldn't have though that Intel would have made such a basic mistake. All they had to do was implement 4 bits in the FIS and the problem would have been solved. Instead they have routing bits in a port register. Sigh. -Matt Matthew Dillon From dillon at apollo.backplane.com Fri Jun 5 16:01:02 2009 From: dillon at apollo.backplane.com (Matthew Dillon) Date: Fri Jun 5 16:01:15 2009 Subject: WIP: ATA to CAM integration References: <4A254B45.8050800@mavhome.dp.ua> <200906050703.n5573x5Q071765@apollo.backplane.com> Message-ID: <200906051601.n55G10Mi075734@apollo.backplane.com> More on the port multiplier spec. It turns out that the port-multiplier port selector is in the command table, so it is per command-tag. There is confusion in the spec though: section 9.1: In this mode of operation, a communication path is opened between the HBA and a device through the Port Multiplier. Since Port Multipliers are meant to be simple, the burden of making a connection is on the AHCI software, to ensure that multiple commands are not outstanding to different devices behind the Port Multiplier. section 9.1.2: "Since queued commands result in two different operations (command issue, clear of BSY, then data transfer), if commands were sent to different ports, the Port Multiplier may issue FISes back to the HBA in an interleaved manner from different ports. This will break an HBA that only supports command-based switching. Therefore, when executing native command queueing commands, system software must only add commands to the command list that target a single port behind the Port Multiplier, wait for the commands to finish (PxSACT bits all cleared), then add commands for a different port. Additionally, the tags used must match the command slot entries." -- It's unclear to me what this means. Can we use NCQ to queue multiple commands to multiple ports behind a single port multiplier in parallel or can't we? It's very confusing. -Matt From gcorcoran at rcn.com Fri Jun 5 16:34:21 2009 From: gcorcoran at rcn.com (Gary Corcoran) Date: Fri Jun 5 16:34:33 2009 Subject: WIP: ATA to CAM integration In-Reply-To: <200906051601.n55G10Mi075734@apollo.backplane.com> References: <4A254B45.8050800@mavhome.dp.ua> <200906050703.n5573x5Q071765@apollo.backplane.com> <200906051601.n55G10Mi075734@apollo.backplane.com> Message-ID: <4A294AD1.6040809@rcn.com> Matthew Dillon wrote: > More on the port multiplier spec. It turns out that the port-multiplier > port selector is in the command table, so it is per command-tag. There > is confusion in the spec though: > > section 9.1: > > In this mode of operation, a communication path is opened between the > HBA and a device through the Port Multiplier. Since Port Multipliers are > meant to be simple, the burden of making a connection is on the AHCI > software, to ensure that multiple commands are not outstanding to > different devices behind the Port Multiplier. > > section 9.1.2: > > "Since queued commands result in two different operations (command issue, > clear of BSY, then data transfer), if commands were sent to different > ports, the Port Multiplier may issue FISes back to the HBA in > an interleaved manner from different ports. This will break an HBA that > only supports command-based switching. Therefore, when executing native > command queueing commands, system software must only add commands > to the command list that target a single port behind the Port Multiplier, > wait for the commands to finish (PxSACT bits all cleared), then add > commands for a different port. Additionally, the tags used > must match the command slot entries." > > -- > > It's unclear to me what this means. Can we use NCQ to queue multiple > commands to multiple ports behind a single port multiplier in parallel > or can't we? It's very confusing. As I read the above, this: > ensure that multiple commands are not outstanding to > different devices behind the Port Multiplier combined with this: > system software must only add commands > to the command list that target a *single port* behind the Port Multiplier, > *wait for the commands to finish* suggests strongly that you many not send multiple commands out to a single port multiplier. However, I agree that it's not crystal clear, and may not be what was intended. Gary From mav at mavhome.dp.ua Fri Jun 5 16:54:36 2009 From: mav at mavhome.dp.ua (Alexander Motin) Date: Fri Jun 5 16:54:43 2009 Subject: WIP: ATA to CAM integration In-Reply-To: <200906050703.n5573x5Q071765@apollo.backplane.com> References: <4A254B45.8050800@mavhome.dp.ua> <200906050703.n5573x5Q071765@apollo.backplane.com> Message-ID: <4A294DC3.5010008@mavhome.dp.ua> Hi. Matthew Dillon wrote: > The biggest issue with AHCI (and ATA) interfacing is that AHCI devices > attach either as DISK or ATAPI. A device which attaches as a DISK > does not typically support ATAPI commands (though it would be an > interesting experiment to see if some did). This means that no matter > what you do a SCSI<->ATA translation layer needs to do some significant > fake-ups for DISK attachments, similar to what OpenBSD does in their > SCSI<->ATA layer. ATA DISK attachments simply do not support enough > of the SCSI command set for direct integration into CAM (IMHO). I think ATAPI disk device is theoretically possible, but I believe it does not exist in practice, as industry do not need it. It will become SCSI disk opponent from commands PoV, but with all ATA interface ugly growth problems. And I am not sure that it will have more benefits then contras comparing to SCSI or plain ATA. When I was talking about common CAM layer, I was directly speaking that there will be _no_command_translation_ for ATA disks! There will be separate native ATA disk driver withing CAM infrastructure. > The second biggest issue is that it is really unclear to me how the > hell one probes an ATAPI device for NCQ support. The OpenBSD driver > only uses AHCI-NCQ for DISK attachments, where the NCQ support is > returned in the IDENTIFY command. I'm sure there must be a way to > probe an ATAPI device for NCQ support but I don't know what it is. I have never seen opposite, and I believe that NCQ is just not implemented for ATAPI devices. NCQ requires different ATA commands usage for ATA disk drives and that makes drive to behave very different on FIS level. NCQ uses First Party DMA and special command completion FISes, that IMHO just not implemented for ATA PACKET command. > The OpenBSD driver does not have port multiplier support but adding > it to the DFly driver will be pretty easy... I just need some hardware > to test it with (it's on the way). Unfortunately the AHCI-1.0 > specfication says it cannot be used for high-performance multi-disk > I/O because all parallel commands in operation are only allowed to go > to a single target at a time. i.e. you can't mix parallel commands > to different targets on the same port. That's a serious problem. > > (Does anyone know if the AHCI-1.1 or 1.2 specifications do anything > about that?). Latest AHCI specifications define feature named FIS Based Switching. It allows controller independently track state of every device beyond port multiplier. It should be quite easy to use it, but actually none of my controllers have that capability. > It is unclear to me from reading the specification as to whether > AHCI-NCQ is supported for SATA ATAPI attachments. If anyone has an > answer to that I'm looking for a way to probe the device's > max-commands for ATAPI. for DISKs the IDENTIFY command has the > necessary feature bits and information. I'm sure the host controller > supports it natively but the real question is how to probe > the capability on the attached device and whether the device(s) > support it. ATAPI devices have their own equivalent of ATA IDENTIFY command. > Ultimately the best way to expand-out an AHCI interface is with > SCSI pass-through over ATAPI, assuming NCQ can be supported somehow. > The port-multiplier spec is badly broken (at least in Intel's AHCI-1.0 > spec). It is a bit annoying, actually, I wouldn't have though that > Intel would have made such a basic mistake. All they had to do was > implement 4 bits in the FIS and the problem would have been solved. > Instead they have routing bits in a port register. Sigh. Latest AHCI specifications are definitely better, but now we have a lot of hardware conforming early 1.0 and 1.1 revisions. -- Alexander Motin From mav at mavhome.dp.ua Fri Jun 5 16:58:33 2009 From: mav at mavhome.dp.ua (Alexander Motin) Date: Fri Jun 5 16:58:45 2009 Subject: WIP: ATA to CAM integration In-Reply-To: <200906051601.n55G10Mi075734@apollo.backplane.com> References: <4A254B45.8050800@mavhome.dp.ua> <200906050703.n5573x5Q071765@apollo.backplane.com> <200906051601.n55G10Mi075734@apollo.backplane.com> Message-ID: <4A294EAF.3080706@mavhome.dp.ua> Matthew Dillon wrote: > It's unclear to me what this means. Can we use NCQ to queue multiple > commands to multiple ports behind a single port multiplier in parallel > or can't we? It's very confusing. As I have said, without controller FIS Based Switching capability it is impossible. FBS defines separate memory areas for controller, to track there state of each drive behind PM. Without it, only one drive can be active at a time, as controller will not be able to track when each drive is able to receive next command.. -- Alexander Motin From mav at FreeBSD.org Fri Jun 5 17:05:44 2009 From: mav at FreeBSD.org (Alexander Motin) Date: Fri Jun 5 17:05:50 2009 Subject: WIP: ATA to CAM integration In-Reply-To: <4A294AD1.6040809@rcn.com> References: <4A254B45.8050800@mavhome.dp.ua> <200906050703.n5573x5Q071765@apollo.backplane.com> <200906051601.n55G10Mi075734@apollo.backplane.com> <4A294AD1.6040809@rcn.com> Message-ID: <4A29505E.6070902@FreeBSD.org> Gary Corcoran wrote: > suggests strongly that you many not send multiple commands out to a single > port multiplier. However, I agree that it's not crystal clear, and may not > be what was intended. AHCI rev. 1.3: 9.3 FIS-based Switching FIS-based switching requires the HBA to keep track of additional device specific context within each HBA port. The HBA must be able to issue commands to a device while there are still commands outstanding to other devices that are connected to the same host port through the Port Multiplier. The HBA must be able to switch DMA context on the fly; e.g. a Data FIS is received from target X, followed by a Data FIS from target X+1. 9.3.1.1.1 Enable When FIS-based switching is enabled, the hardware shall maintain a distinct BSY/DRQ bit for up to 16 devices. These bits are distinguished in the state machine as the pBsy and pDrq arrays. Hardware shall fetch commands in such a way as to try to ensure commands are issued to all devices that have BSY/DRQ cleared to ?0? and have commands in the command list. Instances of the BSY/DRQ bits are updated based on the Port Multiplier port field in Device to Host FISes. -- Alexander Motin From dillon at apollo.backplane.com Fri Jun 5 17:28:16 2009 From: dillon at apollo.backplane.com (Matthew Dillon) Date: Fri Jun 5 17:28:23 2009 Subject: WIP: ATA to CAM integration References: <4A254B45.8050800@mavhome.dp.ua> <200906050703.n5573x5Q071765@apollo.backplane.com> <4A294DC3.5010008@mavhome.dp.ua> Message-ID: <200906051728.n55HSFf0076644@apollo.backplane.com> :Latest AHCI specifications define feature named FIS Based Switching. It :allows controller independently track state of every device beyond port :multiplier. It should be quite easy to use it, but actually none of my :controllers have that capability. Damn. The FBSS capability bit is not set on my (AMD) MCP77 based AHCI SATA controller. That sucks. ahci0: ... ahci0: AHCI 1.2 capabilities 0xe3229f05, 6 port Do you know of any host controllers which support FBS ? Any of the Intel parts or machines per-chance? :As I have said, without controller FIS Based Switching capability it is :impossible. FBS defines separate memory areas for controller, to track :there state of each drive behind PM. Without it, only one drive can be :active at a time, as controller will not be able to track when each :drive is able to receive next command.. Now it makes sense... the 1.0 spec only had one RFIS per port. With only one RFIS area per port it is impossible to track multiple ports behind the PM simultaniously. The 1.3 specification (along with FBSS being set) has 16 RFIS areas per port, plus BSY bits for each, thus fixing the problem. This is really annoying. It effectively serializes access to multiple disks behind a port multiplier on non-FBSS controllers. That makes non-FBSS port-multiplied disk enclosures almost worthless from a performance standpoint. -Matt Matthew Dillon From scottl at samsco.org Fri Jun 5 18:08:50 2009 From: scottl at samsco.org (Scott Long) Date: Fri Jun 5 18:08:56 2009 Subject: WIP: ATA to CAM integration In-Reply-To: <200906050703.n5573x5Q071765@apollo.backplane.com> References: <4A254B45.8050800@mavhome.dp.ua> <200906050703.n5573x5Q071765@apollo.backplane.com> Message-ID: <4A2956C6.5070902@samsco.org> Matthew Dillon wrote: > :Hi. > : > :After replying to several similar questions about my ATA plans last > :time, I have decided to announce things I am working on now together > :with Scott Long. > : > :While learning FreeBSD ATA implementation, I have found, that it has > :numerous deep problems, from quite fuzzy APIs to different issues in > :device detection and error recovery. Also, as soon as this > :infrastructure was written many years ago, it has completely no support > :for any kind of command queuing, which is normal for the most of modern > :drives and controllers. Fixing all of this require many significant changes. > : > :Also, you may know, that SAS controllers and expanders allow attaching > :SATA devices and port multipliers to them, by transporting ATA commands > :... > :project of making CAM a system's universal infrastructure for both SCSI > :and ATA. This project is not about some kind of SCSI-to-ATA translation, > :used by some OS, like OpenBSD. It is about extending CAM, to equally > :support both SCSI and ATA worlds natively and integrate them as tight as > :possible. > : > :... > :Our code now lives in PERFORCE in scottl-camlock project branch. It is > :in early development, but we already have there working CAM driver for > :AHCI controller with command queuing and basic NCQ support, simple SATA > :bus management code and ATA disk driver. I am able now to boot my system > :and work from SATA drive on AHCI controller, using ATA disk driver, > :having command queuing and NCQ enabled, read and write disks with SATA > :ATAPI DVD-RW drive, using native SCSI CD driver. And all of that only > :with CAM, without using any part of ATA infrastructure. > : > :-- > :Alexander Motin > > The biggest issue with AHCI (and ATA) interfacing is that AHCI devices > attach either as DISK or ATAPI. A device which attaches as a DISK > does not typically support ATAPI commands (though it would be an > interesting experiment to see if some did). This means that no matter > what you do a SCSI<->ATA translation layer needs to do some significant > fake-ups for DISK attachments, similar to what OpenBSD does in their > SCSI<->ATA layer. ATA DISK attachments simply do not support enough > of the SCSI command set for direct integration into CAM (IMHO). CAM is being expanded to be a framework for scheduling, recovery, and topology management, agnostic to the transport and protocol being used. SPI and SCSI are being separated into transport and protocol modules, and Alexander has been amazing and kind enough to start a SATA transport and ATA protocol module. Unlike Linux, OpenBSD, or anything else out there, this is not a tacked-on library for speaking SCSI/SPI at the top level and then translating it to something else at a lower level. This is about speaking native SBP/RBP/ATA at the periph level and native SPI/PATA/SATA/FCAL/SMP/USB at the transport level. So, before you continue to cast ignorant doubts on our approach and hawk your incomplete wares, please at least look at what is being done on our end, and make an attempt to ask some reasonable questions. Scott From julian at elischer.org Fri Jun 5 18:51:56 2009 From: julian at elischer.org (Julian Elischer) Date: Fri Jun 5 18:52:03 2009 Subject: WIP: ATA to CAM integration In-Reply-To: <4A2956C6.5070902@samsco.org> References: <4A254B45.8050800@mavhome.dp.ua> <200906050703.n5573x5Q071765@apollo.backplane.com> <4A2956C6.5070902@samsco.org> Message-ID: <4A29694B.2090606@elischer.org> Scott Long wrote: > > So, before you continue to cast ignorant doubts on our approach and hawk > your incomplete wares, please at least look at what is being done on our > end, and make an attempt to ask some reasonable questions. > I think that was a little of an over-reaction, and uncalled for.. Matt's tone was very friendly. > Scott > _______________________________________________ > freebsd-current@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org" From dillon at apollo.backplane.com Fri Jun 5 19:17:40 2009 From: dillon at apollo.backplane.com (Matthew Dillon) Date: Fri Jun 5 19:17:54 2009 Subject: WIP: ATA to CAM integration References: <4A254B45.8050800@mavhome.dp.ua> <200906050703.n5573x5Q071765@apollo.backplane.com> <4A2956C6.5070902@samsco.org> Message-ID: <200906051917.n55JHcLO077306@apollo.backplane.com> :CAM is being expanded to be a framework for scheduling, recovery, and :topology management, agnostic to the transport and protocol being used. :SPI and SCSI are being separated into transport and protocol modules, :and Alexander has been amazing and kind enough to start a SATA transport :and ATA protocol module. Unlike Linux, OpenBSD, or anything else out :there, this is not a tacked-on library for speaking SCSI/SPI at the top :level and then translating it to something else at a lower level. This :is about speaking native SBP/RBP/ATA at the periph level and native :SPI/PATA/SATA/FCAL/SMP/USB at the transport level. : :So, before you continue to cast ignorant doubts on our approach and hawk :your incomplete wares, please at least look at what is being done on our :end, and make an attempt to ask some reasonable questions. : :Scott Huh. Get up on the wrong side of the bed, Scott? Just remember who started making the shit comments this time. I have no interest with what FreeBSD is doing with CAM. My only interest vis-a-vie this thread are the AHCI driver efforts by the various BSDs, including ours. In particular, my interest is in NCQ, hot-plug support, and port multiplier support, as these three items can put SATA/AHCI on-par with dedicated SCSI controllers (at least once AHCI hardware revs past the original spec). It is something very important to all Open-Source OS projects as it consolidates the storage device driver spec and removes a huge thorn in the sides of all the projects with regards to the forward-support of new hardware. -- IMHO the only SCSI command non-SCSI devices have to fake-up is IDENTIFY. Everything else is a straight translation, and an easy one at that. Even SENSE doesn't need to be faked-up all that much, one just sets an AUTOSENSE flag bit and include the sense in the CCB. So interfacing to CAM is not really a big deal. The SCSI command set is the only cross-device portable command set that exists today, after all. The only problem I have with the original CAM is that it didn't use a dedicated thread for bus-reset/probe/scan/identify/attachment and detachment. That's the only reason the original API was such a bitch to deal with by device drivers. Fixing that fixes all the device interaction issues for attachment/detachment. The API doesn't actually change, but the recursive nature of the direct calls goes away and greatly simplifies device drivers. That's the only thing I see wrong with CAM, frankly. So I applaud your efforts on cleaning up the attach/detach stuff in FreeBSD, but it isn't my focus in this thread and not something I'm interested in doing for the DragonFly project, beyond what I mentioned above. Your comments are improper. -Matt Matthew Dillon From scottl at samsco.org Fri Jun 5 19:27:19 2009 From: scottl at samsco.org (Scott Long) Date: Fri Jun 5 19:27:26 2009 Subject: WIP: ATA to CAM integration In-Reply-To: <4A29694B.2090606@elischer.org> References: <4A254B45.8050800@mavhome.dp.ua> <200906050703.n5573x5Q071765@apollo.backplane.com> <4A2956C6.5070902@samsco.org> <4A29694B.2090606@elischer.org> Message-ID: <4A297187.2090701@samsco.org> Julian Elischer wrote: > Scott Long wrote: > >> >> So, before you continue to cast ignorant doubts on our approach and hawk >> your incomplete wares, please at least look at what is being done on our >> end, and make an attempt to ask some reasonable questions. >> > > I think that was a little of an over-reaction, and uncalled for.. > Matt's tone was very friendly. > > Every one of Matt's emails follow this formula: 1. I don't know how FOO works, but how you're doing it is wrong 2. Here's how I think FOO should work 3. I'm going to work on FOO in DragonFlyBSD and have it done in a week. 1 and 2 are ignorant and worthless in a technical conversation, and 3 is off topic for FreeBSD mailing lists. So yes, I'm calling him out. Scott From scottl at samsco.org Fri Jun 5 19:28:38 2009 From: scottl at samsco.org (Scott Long) Date: Fri Jun 5 19:28:50 2009 Subject: WIP: ATA to CAM integration In-Reply-To: <200906051917.n55JHcLO077306@apollo.backplane.com> References: <4A254B45.8050800@mavhome.dp.ua> <200906050703.n5573x5Q071765@apollo.backplane.com> <4A2956C6.5070902@samsco.org> <200906051917.n55JHcLO077306@apollo.backplane.com> Message-ID: <4A2971DC.9060608@samsco.org> Matthew Dillon wrote: > :CAM is being expanded to be a framework for scheduling, recovery, and > :topology management, agnostic to the transport and protocol being used. > :SPI and SCSI are being separated into transport and protocol modules, > :and Alexander has been amazing and kind enough to start a SATA transport > :and ATA protocol module. Unlike Linux, OpenBSD, or anything else out > :there, this is not a tacked-on library for speaking SCSI/SPI at the top > :level and then translating it to something else at a lower level. This > :is about speaking native SBP/RBP/ATA at the periph level and native > :SPI/PATA/SATA/FCAL/SMP/USB at the transport level. > : > :So, before you continue to cast ignorant doubts on our approach and hawk > :your incomplete wares, please at least look at what is being done on our > :end, and make an attempt to ask some reasonable questions. > : > :Scott > > Huh. Get up on the wrong side of the bed, Scott? Just remember who > started making the shit comments this time. > > I have no interest with what FreeBSD is doing with CAM. If you have no interest with what FreeBSD is doing with CAM, then your discussion is off topic for this thread and this mailing list. Please take it somewhere more appropriate. Scott From dillon at apollo.backplane.com Fri Jun 5 19:48:46 2009 From: dillon at apollo.backplane.com (Matthew Dillon) Date: Fri Jun 5 19:48:57 2009 Subject: WIP: ATA to CAM integration References: <4A254B45.8050800@mavhome.dp.ua> <200906050703.n5573x5Q071765@apollo.backplane.com> <4A2956C6.5070902@samsco.org> <4A29694B.2090606@elischer.org> <4A297187.2090701@samsco.org> Message-ID: <200906051948.n55Jmh8X077810@apollo.backplane.com> :Every one of Matt's emails follow this formula: : :1. I don't know how FOO works, but how you're doing it is wrong :2. Here's how I think FOO should work :3. I'm going to work on FOO in DragonFlyBSD and have it done in a week. : :1 and 2 are ignorant and worthless in a technical conversation, and 3 is :off topic for FreeBSD mailing lists. So yes, I'm calling him out. : :Scott Well, so far about the only one talking shit here is you, Scott. Insofar as 3. goes, I already provided references to that code, because the purpose is to share code. I wonder if you even bothered to look at it. Judging from your comments, I guess not. It isn't quite in the decrepit shape you make it out to be. I mean, come on, not even the ATA driver has hot-plug support (not without crashing, anyhow), and I would not characterize IT as being decrepit! It was simply a code reference, along with OpenBSD's code reference. For anyone writing a driver having multiple code references is always beneficial, it saves a lot of time and puts things in context. After all, OpenBSD's and Linux's AHCI driver is what really opened up the space for the rest of us. OpenBSD saved me at least 100 man-hours of work and Alex just now saved me another 8 or 9 at the cost of a 5 minute email. That seems to be a good use of the mailing lists in my view. I'm not above asking other driver writers for information that I do not have complete knowledge of. I learned a lot from Alexander Motin's posting with regards to port multipliers, enough that I am now quite comfortable in my ability to add the feature in my own work. Clearly I am not totally deficient in my knowledege since I actually did port the OpenBSD driver to DragonFly. As far as I can tell, I haven't said anything about anyone doing it wrong, certainly not with the tone your extremely jaded portrayal seems to be applying to my posting. I have my opinion and you have yours, but that's all it is... an opinion. My opinion is that the only portable device I/O command set available in the world today is the SCSI command set. There is no ulterior motive, it's just an opinion. I guess it is in the eye of the beholder. -Matt From brooks at freebsd.org Fri Jun 5 22:36:27 2009 From: brooks at freebsd.org (Brooks Davis) Date: Fri Jun 5 22:36:34 2009 Subject: RFT: Allow large values of NGROUPS_MAX Message-ID: <20090605223636.GA24364@lor.one-eyed-alien.net> Skipped content of type multipart/mixed-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20090605/29b1c6be/attachment.pgp From peterjeremy at optushome.com.au Sat Jun 6 01:47:58 2009 From: peterjeremy at optushome.com.au (Peter Jeremy) Date: Sat Jun 6 01:48:05 2009 Subject: WIP: ATA to CAM integration In-Reply-To: <4A297187.2090701@samsco.org> References: <4A254B45.8050800@mavhome.dp.ua> <200906050703.n5573x5Q071765@apollo.backplane.com> <4A2956C6.5070902@samsco.org> <4A29694B.2090606@elischer.org> <4A297187.2090701@samsco.org> Message-ID: <20090606014745.GC9161@server.vk2pj.dyndns.org> On 2009-Jun-05 13:27:03 -0600, Scott Long wrote: >Every one of Matt's emails follow this formula: > >1. I don't know how FOO works, but how you're doing it is wrong >2. Here's how I think FOO should work >3. I'm going to work on FOO in DragonFlyBSD and have it done in a week. > >1 and 2 are ignorant and worthless in a technical conversation, and 3 is >off topic for FreeBSD mailing lists. So yes, I'm calling him out. Well, I have seen little evidence of 1. 2 _can_ be relevant in an architectural discussion. 3 may or may not be relevant - it depends whether FreeBSD can make use of the work or not. I agree with Julian that you are being unnecessarily provocative. IMHO, Matt has made a positive contribution to this thread. Rather than quoting him out of context, how about you take a chill pill and consider what benefits FreeBSD can gain by reusing work done by other free Unices. -- Peter Jeremy -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20090606/5b20014f/attachment.pgp From brde at optusnet.com.au Sat Jun 6 10:31:56 2009 From: brde at optusnet.com.au (Bruce Evans) Date: Sat Jun 6 10:32:03 2009 Subject: RFT: Allow large values of NGROUPS_MAX In-Reply-To: <20090605223636.GA24364@lor.one-eyed-alien.net> References: <20090605223636.GA24364@lor.one-eyed-alien.net> Message-ID: <20090606184344.N16744@delplex.bde.org> On Fri, 5 Jun 2009, Brooks Davis wrote: > I've been working on fixing the limit of 15 groups per user. This has > primarily consisted of merging a patch from Isilon Systems which breaks > the group array out of struct ucred and stores in in malloc'd storage. > > I've attached a patch which includes that change and increases the value > of NGROUPS_MAX to 32767. It also changes references to cr_groups[0] to > use the cr_gid macro ... I don't like this macro. At least the implementation should keep using the unobfuscated reference. kern_prot.c consistently didn't use the obfuscation except once in a comment. > Before any merge a couple decisions need to be made: > > - How large should NGROUPS_MAX be? Linux uses 65536 and we could > extend things to support that, but it's probably unnecessary. 32767 seems excessive too. Hopefully NGROUPS[_MAX] is used only in broken unconverted applications so a large value rarely wastes space. > - Should we make any attempt to support old binaries when there > are more than 16 groups? There is no way to support broken old binaries. They will have a limit of 16. > The POSIX getgroups/setgroups APIs did not > anticipate this change and thus either will fail outright. No, as you explained in previous mail, POSIX anticipated this perfectly unusably by making NGROUPS_MAX a Runtime Increasable Value despite it being constant. This puts the burden of not failing on applications. Only broken apps that don't do any of dynamic allocation using sysconf(_SC_NGROUPS_MAX), or dynamic allocation using getgroups(0, ...), or dynamic allocation after failing using getgroups(old_NGROUPS_MAX, ...) will fail outright. > We can't > fix setgroups, but we might want to make an optional accommodation for > getgroups to allow for truncated returns to old code. If done unconditionally, this would break the unbroken apps that do dynamic allocation after failing using getgroups(old_NGROUPS_MAX, ...). These apps depend on getgroups() failing with the Standard and documented errno EINVAL when the `gidsetlen' parameter is too small. Maybe some per-app or per-environment branding could be used to configure not generating an error. You would apply it only where security is not very important. getgrouplist(3) guarantees filling of the array when the array is too small. getgroups(2) doesn't guarantee anything, and in fact doesn't fill the array. It wouldn't hurt to fill it. % ... % diff -ru --exclude='.glimpse*' --exclude=.svn --exclude=compile --ignore-matching='$FreeBSD' /usr/src/sys/kern/kern_prot.c ngroups/sys/kern/kern_prot.c % --- /usr/src/sys/kern/kern_prot.c 2009-06-05 15:33:50.000000000 -0500 % +++ ngroups/sys/kern/kern_prot.c 2009-06-05 16:02:28.000000000 -0500 % @@ -243,16 +246,11 @@ % % td->td_retval[0] = td->td_ucred->cr_rgid; % #if defined(COMPAT_43) % - td->td_retval[1] = td->td_ucred->cr_groups[0]; % + td->td_retval[1] = td->td_ucred->cr_gid; % #endif % return (0); % } % % -/* % - * Get effective group ID. The "egid" is groups[0], and could be obtained % - * via getgroups. This syscall exists because it is somewhat painful to do % - * correctly in a library function. % - */ % #ifndef _SYS_SYSPROTO_H_ % struct getegid_args { % int dummy; This comment still seems to apply. I wonder if you noticed and/or fixed the related off-by-1 errors in getgroups() and {NGROUPS_MAX}: POSIX.1-2001-draft7 says: % 16834 IDs of the calling process. It is implementation-defined whether getgroups( ) also returns the % 16835 effective group ID in the grouplist array. FreeBSD does return the egid in the array, always as the first element. Neither of these details is documented in getgroups.2 or setgroups.2. Even the implementation doesn't seem to understand this where it is most important -- the implementation of setgroups() seems to be missing some of the things done by setegid(). getgrouplist(3) has explicit support for the egid being in both the initial and final lists (or missing in the initial list?). % ... % 16841 If the effective group ID of the process is returned with the supplementary group IDs, the value % 16842 returned shall always be greater than or equal to one and less than or equal to the value of % 16843 {NGROUPS_MAX}+1. This and probably other things indicate that _supplementary_ actually means supplementary -- it doesn't count the egid. But in FreeBSD, the egid is counted. This gives 2 off-by-1 errors that partially compensate for each other: - NGROUPS_MAX is 15, not 16, since it is impossible to have 16 _supplementary_ groups in cr_groups[0..15] after using cr_groups[0] for the egid - getgroups() cannot return {NGROUPS_MAX}+1 as is needed for it to return the egid plus {NGROUPS_MAX} supplementary groups - getgroups() does return {NGROUPS_MAX}+1 once {NGROUPS_MAX} is corrected. Some userland code depends on these bugs -- the static array size should be NGROUPS_MAX + 1, but it is typically plain NGROUPS. OTOH, libc/gen/initgroups.c uses NGROUPS + 1 and getgrouplist() followed by setgroups(). I think getgrouplist() can produce the extra 1 and then setgroups() and thus initgroups() will fail. I thought again about making cr_gid separate from the array, so that it can be named cr_gid without obfuscation. This wouldn't be good since it would complicate more than the copyin/out in get/setgroups() -- there are a few temporary gid arrays, and some iterations over the arrays that want to see the egid. Bruce From phk at phk.freebsd.dk Sat Jun 6 23:01:25 2009 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Sat Jun 6 23:01:31 2009 Subject: WIP: ATA to CAM integration In-Reply-To: Your message of "Fri, 05 Jun 2009 19:54:27 +0300." <4A294DC3.5010008@mavhome.dp.ua> Message-ID: <6657.1244328220@critter.freebsd.dk> In message <4A294DC3.5010008@mavhome.dp.ua>, Alexander Motin writes: >I think ATAPI disk device is theoretically possible, but I believe it >does not exist in practice, as industry do not need it. Maxtor ZIP ? -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From mav at FreeBSD.org Sat Jun 6 23:15:14 2009 From: mav at FreeBSD.org (Alexander Motin) Date: Sat Jun 6 23:15:20 2009 Subject: WIP: ATA to CAM integration In-Reply-To: <6657.1244328220@critter.freebsd.dk> References: <6657.1244328220@critter.freebsd.dk> Message-ID: <4A2AF876.1030103@FreeBSD.org> Poul-Henning Kamp wrote: > In message <4A294DC3.5010008@mavhome.dp.ua>, Alexander Motin writes: >> I think ATAPI disk device is theoretically possible, but I believe it >> does not exist in practice, as industry do not need it. > > Maxtor ZIP ? May be, never had an ATA version. But it is more FDD, then HDD. Also it existed in Parallel Port and SCSI versions, so it could be done in ATAPI way just for unification. -- Alexander Motin From dillon at apollo.backplane.com Sat Jun 6 23:33:20 2009 From: dillon at apollo.backplane.com (Matthew Dillon) Date: Sat Jun 6 23:33:27 2009 Subject: WIP: ATA to CAM integration References: <6657.1244328220@critter.freebsd.dk> <4A2AF876.1030103@FreeBSD.org> Message-ID: <200906062333.n56NXH6a090341@apollo.backplane.com> I found the ATAPI_C_ATAPI_IDENTIFY command that was mentioned and it works fine, returning the same sort of information for ATAPI attachments that ATA_C_IDENTIFY returns for DISK attachments. That takes care of the queue length negotiation by the device. However, there is no fis->command that I can find that would allow NCQ to operate in ATAPI mode. In ATAPI mode fis->command is typically set to ATA_C_PACKET. In DISK mode fis->command is set to ATA_C_READ_FPDMA or ATA_C_WRITE_FPDMA (the first-person DMA mode used by AHCI's NCQ). So unless the *_FPDMA FIS commands work for an ATAPI attached device, we are S.O.L. Section 5.6.4.1: The ATA/ATAPI-7 queued feature set is not supported by AHCI (including READ QUEUED (EXT), WRITE QUEUED (EXT), and SERVICE commands). Queued operations are supported in AHCI using the READ FPDMA QUEUED and WRITE FPDMA QUEUED commands when the HBA and device support native command queueing. It is unclear whether an ATAPI device would accept a non-packet command, aka ATA_C_READ_FPDMA or ATA_C_WRITE_FPDMA, instead of ATA_C_PACKET. ATAPI devices do support the ATAPI_C_ATAPI_IDENTIFY command, which is non-packet command, so maybe its possible. If it is possible it would only work for READ and WRITE commands, since those are the only commands the FPDMA modes can be used for. The AHCI spec doesn't explicitly say that the FPDMA commands would not work for an ATAPI attached device, so there's hope. What we need is a SATA ATAPI device which says it supports NCQ + has a queue length > 1 to test with. -Matt Matthew Dillon From mav at FreeBSD.org Mon Jun 8 09:15:39 2009 From: mav at FreeBSD.org (Alexander Motin) Date: Mon Jun 8 09:15:52 2009 Subject: Multiple MSI on SMP, misrouting or misunderstanding? Message-ID: <4A2CD6AC.80407@FreeBSD.org> Hi. While experimenting with using multiple MSIs support on AHCI controller I have got the problem. When system boots as UP - everything is fine, driver allocates all available 16 MSIs and works. But when system booted as SMP, interrupts begin to behave strange: I didn't receive expected AHCI IRQs, but instead receive IRQ1 interrupts of atkbd0, while I have no PS/2 keyboard/mouse attached. As I have found, problem appears due to IRQ rebalancing between CPUs. As I have got, MSI requires that all vectors from the same group to be allocated sequentially, but IRQ rebalancing breaks correct order, that happed during initial allocation. I was quite surprised by this issue. If multiple MSI vectors of the same device have to be allocated sequentially and bound to the same CPU, then they will be unable to give any SMP scalability benefits. Am I right, or there is some special technique expected to be used to somehow distribute grouped MSI vectors between CPUs which we don't have? I have made small patch that denies rebalancing for grouped MSIs, to make them work at least somehow. It works fine for me, but I am not sure that it is the best solution. -- Alexander Motin -------------- next part -------------- --- msi.c.prev 2009-06-08 11:30:13.000000000 +0300 +++ msi.c 2009-06-08 11:30:06.000000000 +0300 @@ -210,6 +210,8 @@ msi_assign_cpu(struct intsrc *isrc, u_in old_id = msi->msi_cpu; if (old_vector && old_id == apic_id) return; + if (old_vector && !msi->msi_msix && msi->msi_first->msi_count > 1) + return; /* Allocate IDT vector on this cpu. */ vector = apic_alloc_vector(apic_id, msi->msi_irq); if (vector == 0) From bugmaster at FreeBSD.org Mon Jun 8 11:06:49 2009 From: bugmaster at FreeBSD.org (FreeBSD bugmaster) Date: Mon Jun 8 11:07:34 2009 Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org Message-ID: <200906081106.n58B6mG2020550@freefall.freebsd.org> Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/120749 arch [request] Suggest upping the default kern.ps_arg_cache 1 problem total. From jhb at freebsd.org Mon Jun 8 15:33:59 2009 From: jhb at freebsd.org (John Baldwin) Date: Mon Jun 8 15:34:47 2009 Subject: Multiple MSI on SMP, misrouting or misunderstanding? In-Reply-To: <4A2CD6AC.80407@FreeBSD.org> References: <4A2CD6AC.80407@FreeBSD.org> Message-ID: <200906081116.40462.jhb@freebsd.org> On Monday 08 June 2009 5:15:24 am Alexander Motin wrote: > Hi. > > While experimenting with using multiple MSIs support on AHCI controller > I have got the problem. When system boots as UP - everything is fine, > driver allocates all available 16 MSIs and works. But when system booted > as SMP, interrupts begin to behave strange: I didn't receive expected > AHCI IRQs, but instead receive IRQ1 interrupts of atkbd0, while I have > no PS/2 keyboard/mouse attached. > > As I have found, problem appears due to IRQ rebalancing between CPUs. As > I have got, MSI requires that all vectors from the same group to be > allocated sequentially, but IRQ rebalancing breaks correct order, that > happed during initial allocation. > > I was quite surprised by this issue. If multiple MSI vectors of the same > device have to be allocated sequentially and bound to the same CPU, then > they will be unable to give any SMP scalability benefits. Am I right, or > there is some special technique expected to be used to somehow > distribute grouped MSI vectors between CPUs which we don't have? > > I have made small patch that denies rebalancing for grouped MSIs, to > make them work at least somehow. It works fine for me, but I am not sure > that it is the best solution. It is a limitation of MSI. With MSI, you have a single address register for the entire group of messages (the individual messages are just distinguished by toggling the lower N bits in the message data register). On x86 the address register includes the APIC ID. That means that all of the messages get sent to the same CPU. With MSI-X, there is a table with separate address and data registers for each message. This allows a driver to distribute interrupts across CPUs. I had old patches prior to the per-CPU IDT stuff to handle this quirk of MSI groups. The approach I used there was that I would only allow reassigning of the entire group by assigning to the first interrupt in the group. With per-CPU IDTs that gets trickier though as you need to allocate a whole block of aligned, consecutive IDT vectors in the new CPU. -- John Baldwin From das at FreeBSD.ORG Mon Jun 8 19:16:54 2009 From: das at FreeBSD.ORG (David Schultz) Date: Mon Jun 8 19:17:00 2009 Subject: RFT: Allow large values of NGROUPS_MAX In-Reply-To: <20090605223636.GA24364@lor.one-eyed-alien.net> References: <20090605223636.GA24364@lor.one-eyed-alien.net> Message-ID: <20090608185122.GA65737@zim.MIT.EDU> On Fri, Jun 05, 2009, Brooks Davis wrote: > - Should we make any attempt to support old binaries when there > are more than 16 groups? The POSIX getgroups/setgroups APIs did not > anticipate this change and thus either will fail outright. We can't > fix setgroups, but we might want to make an optional accommodation for > getgroups to allow for truncated returns to old code. Awesome. I think the ABI breakage is fine as long as it only affects systems where users are actually in more than 16 groups. It's perfectly reasonable to expect people to recompile in order to take advantage of a new feature. As for the value of NGROUPS_MAX, there are systems with more than 32k groups out there, but I doubt there are interesting cases where a single user is a member of more than 32k groups. The permission checking code would not realistically scale to group lists that long anyway. From attilio at freebsd.org Mon Jun 8 21:04:28 2009 From: attilio at freebsd.org (Attilio Rao) Date: Mon Jun 8 21:04:34 2009 Subject: [PATCH] Adaptive spinning for lockmgr Message-ID: <3bbf2fe10906081342i6ef418e0n75e22d0b9e2543b3@mail.gmail.com> This patch enables adaptive spinning for lockmgr: http://www.freebsd.org/~attilio/adaptive_lockmgr.diff and it should presumably improve performance on disks/vfs/buffer cache based benchmarks, so, if you want to try out and report any benchmarks result, I'd love to see it. Please note that there are some parameters to tune: for example, you would like to not enable adaptive spinning to default while you just want that for a class of locks (and in that case you want to apply the reversed logic for what is living now) or you want to use different values for retries and loops. Interested developers can refer to such 3 variables. Peter Holm alredy tested that patch for about 24hours without any regression to report. Also note that the patch is not 100% yet as long as it needs UPDATES and manpages updates, but they will be added just in time before to commit. The modify is all there. Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein From kostikbel at gmail.com Tue Jun 9 08:50:25 2009 From: kostikbel at gmail.com (Kostik Belousov) Date: Tue Jun 9 08:50:33 2009 Subject: Dynamic pcpu, arm, mips, powerpc, sun, etc. help needed In-Reply-To: References: Message-ID: <20090609082643.GB75569@deviant.kiev.zoral.com.ua> On Thu, Jun 04, 2009 at 10:46:42AM +0100, Robert Watson wrote: > On Wed, 3 Jun 2009, Jeff Roberson wrote: > > >I have not tested anything other than amd64. If you have a !amd64 > >architecture, in particular any of the embedded architectures, I would > >really appreciate it. Some of the arm boards postincrement the end > >address to allocate early memory and some pre-decriment. Hopefully I got > >it right. > > I appear to get an instant reboot early during the kernel startup on i386 > with this patch applied: > > OK lsmod > 0x400000: /boot/kernel/kernel (elf kernel, 0xcd8920) > modules: elink.1 io.1 hptrr.1 ufs.1 kernel_mac_support.4 krpc.1 > nfslockd.1 nfssvc.1 nfsserver.1 nfslock.1 nfs.1 wlan_sta.1 wlan.1 > wlan_wep.1 wlan_tkip.1 wlan_ccmp.1 wlan_amrr.1 if_gif.1 if_firewire.1 > if_faith.1 ether.1 sysvshm.1 sysvsem.1 sysvmsg.1 firmware.1 kernel.800096 > cd9660.1 isa.1 pseudofs.1 procfs.1 msdosfs.1 usb_quirk.1 ucom.1 uvscom.1 > uslcom.1 uplcom.1 uether.1 cdce.1 usb.1 random.1 ppbus.1 pci.1 pccard.1 > null.1 mpt_user.1 mpt_raid.1 mpt.1 mpt_cam.1 mpt_core.1 miibus.1 mem.1 > isp.1 sbp.1 fwip.1 fwe.1 firewire.1 splash.1 exca.1 dcons.2 dcons_crom.1 > cardbus.1 bt.1 ath.1 ast.1 afd.1 acd.1 ataraid.1 ad.1 ata_via.1 ata_sis.1 > ata_sii.1 ata_serverworks.1 ata_promise.1 ata_nvidia.1 ata_netcell.1 > ata_national.1 ata_micron.1 ata_marvell.1 ata_jmicron.1 ata_ite.1 > ata_intel.1 ata_highpoint.1 ata_cyrix.1 ata_cypress.1 ata_cenatek.1 > ata_ati.1 ata_amd.1 ata_adaptec.1 ata_ali.1 ata_acard.1 ata_ahci.1 atapci.1 > ata.1 ahc.1 ahd.1 ahd_pci.1 ahc_pci.1 ahc_isa.1 ahc_eisa.1 agp.1 acpi_pci.1 > acpi.1 scsi_low.1 cam.1 > OK boot -s > The reason for the reboot is the fact that memory after physfree is not mapped, and init386() tries to carve a piece of it for dpcpu for BSP. Since IDT/exceptions are not initialized yet, generated #pf is translated into triple fault. Please, see the patch at http://people.freebsd.org/~kib/misc/dcpu.1.patch that allocates area for dpcpu0 using the same technique as the memory for proc0kstack. Another minor issue is that per-cpu sysmaps were allocated without clearing corresponding ptes, causing panic in pmap_zero_page etc due to changed KVA layout. AMD64 is fine thanks to the direct map. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20090609/f02ef29c/attachment.pgp From marius at alchemy.franken.de Tue Jun 9 20:37:49 2009 From: marius at alchemy.franken.de (Marius Strobl) Date: Tue Jun 9 20:37:56 2009 Subject: Dynamic pcpu, arm, mips, powerpc, sun, etc. help needed In-Reply-To: References: Message-ID: <20090609201127.GA50903@alchemy.franken.de> On Wed, Jun 03, 2009 at 08:55:39PM -1000, Jeff Roberson wrote: > http://people.freebsd.org/~jeff/dpcpu.diff > > This patch implements dynamic per-cpu areas such that kernel code can do > the following in a header: > > DPCPU_DECLARE(uint64_t, foo); > > and this in source: > > DPCPU_DEFINE(uint64_t, foo) = 10; > > local = DPCPU_GET(foo); > DPCPU_SET(foo, 11); > > The dynamic per-cpu area of non-local cpus is accessable via > DPCPU_ID_{GET,SET,PTR}. > > If you provide an initializer as I used above that will be the default > value when all cpus come up. Otherwise it defaults to zero. This is > presently slightly more expensive than PCPU but much more flexible. > Things like id and curthread should stay in PCPU forever. > > I had to change the pcpu_init() call on every architecture to pass in > storage for the dynamic area. I didn't change the following three calls > because it wasn't immediately obvious how to allocate the memory: > > ./powerpc/booke/machdep.c: pcpu_init(pc, 0, sizeof(struct pcpu)); > ./mips/mips/machdep.c: pcpu_init(&__pcpu[0], 0, sizeof(struct pcpu)); > ./mips/mips/machdep.c: pcpu_init(pcpup, 0, sizeof(struct pcpu)); > > > I have not tested anything other than amd64. If you have a !amd64 > architecture, in particular any of the embedded architectures, I would > really appreciate it. Some of the arm boards postincrement the end > address to allocate early memory and some pre-decriment. Hopefully I got > it right. As for sparc64 allocating the storage for the dynamic area from end probably isn't a good idea as the pmap code assumes that the range from KERNBASE to end is covered by the pages allocated by and locked into the TLB for the kernel by the loader, so depending on the actual kernel size the DPCPU storage may be outside that region. The printf() will also blow at the location you moved it in sparc64_init() as the console isn't initialized at that point, yet. I think the best thing to do for the BSP would be to allocate its DPCPU storage via pmap_bootstrap_alloc() and use the direct mapping. This causes a bit of a chicken and egg problem though as MD parts of the per-CPU data are used to get pmap_bootstrap_alloc() working. Could you move the initialization of the DPCPU part to a dpcpu_init()? Marius From imp at bsdimp.com Tue Jun 9 23:42:32 2009 From: imp at bsdimp.com (M. Warner Losh) Date: Tue Jun 9 23:42:39 2009 Subject: devclass_find_free_unit Message-ID: <20090609.174249.-1435625969.imp@bsdimp.com> What purpose does devclass_find_free_unit serve? I think it can safely be eliminated from the tree. The current design is racy. Comments? It is currently used: ./arm/xscale/ixp425/.svn/text-base/avila_ata.c.svn-base: device_add_child(dev, "ata", devclass_find_free_unit(ata_devclass, 0)); ./arm/xscale/ixp425/avila_ata.c: device_add_child(dev, "ata", devclass_find_free_unit(ata_devclass, 0)); ./arm/at91/.svn/text-base/at91_cfata.c.svn-base: device_add_child(dev, "ata", devclass_find_free_unit(ata_devclass, 0)); ./arm/at91/at91_cfata.c: device_add_child(dev, "ata", devclass_find_free_unit(ata_devclass, 0)); ./powerpc/psim/.svn/text-base/ata_iobus.c.svn-base: devclass_find_free_unit(ata_devclass, 0)); # All the above can be replaced with a simple '-1'. ata/ata-pci.c: unit : devclass_find_free_unit(ata_devclass, 2)); ata/ata-usb.c: devclass_find_free_unit(ata_devclass, 2))) == NULL) { These can likely be replaced by '2', but that may result in a warning message being printed that likely can be eliminated... comments? Warner From grehan at freebsd.org Wed Jun 10 00:03:52 2009 From: grehan at freebsd.org (Peter Grehan) Date: Wed Jun 10 00:03:58 2009 Subject: devclass_find_free_unit In-Reply-To: <20090609.174249.-1435625969.imp@bsdimp.com> References: <20090609.174249.-1435625969.imp@bsdimp.com> Message-ID: <4A2EF4C4.408@freebsd.org> Hi Warner, > What purpose does devclass_find_free_unit serve? I think it can safely be > eliminated from the tree. The current design is racy. ... > ./powerpc/psim/.svn/text-base/ata_iobus.c That would be a cut'n'paste relic from the ATA code that it originated from. Going to -1 is fine. later, Peter. From grehan at freebsd.org Wed Jun 10 01:50:11 2009 From: grehan at freebsd.org (Peter Grehan) Date: Wed Jun 10 01:50:17 2009 Subject: Dynamic pcpu, arm, mips, powerpc, sun, etc. help needed In-Reply-To: <20090609201127.GA50903@alchemy.franken.de> References: <20090609201127.GA50903@alchemy.franken.de> Message-ID: <4A2F1148.9090706@freebsd.org> > As for sparc64 allocating the storage for the dynamic area > from end probably isn't a good idea as the pmap code assumes > that the range from KERNBASE to end is covered by the pages > allocated by and locked into the TLB for the kernel by the > loader Ditto for ppc. It's possible to get the additional space from within or after return from pmap_bootstrap() (like thread0's kstack, or the msgbuf). later, Peter. From jhb at freebsd.org Wed Jun 10 12:24:30 2009 From: jhb at freebsd.org (John Baldwin) Date: Wed Jun 10 12:24:38 2009 Subject: devclass_find_free_unit In-Reply-To: <20090609.174249.-1435625969.imp@bsdimp.com> References: <20090609.174249.-1435625969.imp@bsdimp.com> Message-ID: <200906100822.15516.jhb@freebsd.org> On Tuesday 09 June 2009 7:42:49 pm M. Warner Losh wrote: > What purpose does devclass_find_free_unit serve? I think it can safely be > eliminated from the tree. The current design is racy. > > Comments? > > It is currently used: > > ./arm/xscale/ixp425/.svn/text-base/avila_ata.c.svn-base: device_add_child(dev, "ata", devclass_find_free_unit(ata_devclass, 0)); > ./arm/xscale/ixp425/avila_ata.c: device_add_child(dev, "ata", devclass_find_free_unit(ata_devclass, 0)); > ./arm/at91/.svn/text-base/at91_cfata.c.svn-base: device_add_child(dev, "ata", devclass_find_free_unit(ata_devclass, 0)); > ./arm/at91/at91_cfata.c: device_add_child(dev, "ata", devclass_find_free_unit(ata_devclass, 0)); > ./powerpc/psim/.svn/text-base/ata_iobus.c.svn-base: devclass_find_free_unit(ata_devclass, 0)); > > # All the above can be replaced with a simple '-1'. > > ata/ata-pci.c: unit : devclass_find_free_unit(ata_devclass, 2)); > ata/ata-usb.c: devclass_find_free_unit(ata_devclass, 2))) == NULL) { > > These can likely be replaced by '2', but that may result in a warning > message being printed that likely can be eliminated... ata does this so it can reserve ata0 and ata1 for the "legacy" ATA channels on legacy ATA PCI adapters. That is, if you have both SATA controllers and a PATA controller, this allows the two PATA channels to always be ata0 and ata1 and the PATA drivers to always be ad0 - ad3. You could perhaps implement this in 8.x now by a really horrendous hack of having ISA hints for ata0 and ata1 and letting bus_hint_device_unit() in the atapci driver claim those hints for the channels on PATA controllers. -- John Baldwin From imp at bsdimp.com Wed Jun 10 16:22:33 2009 From: imp at bsdimp.com (M. Warner Losh) Date: Wed Jun 10 16:22:40 2009 Subject: devclass_find_free_unit In-Reply-To: <200906100822.15516.jhb@freebsd.org> References: <20090609.174249.-1435625969.imp@bsdimp.com> <200906100822.15516.jhb@freebsd.org> Message-ID: <20090610.102144.324381338.imp@bsdimp.com> In message: <200906100822.15516.jhb@freebsd.org> John Baldwin writes: : On Tuesday 09 June 2009 7:42:49 pm M. Warner Losh wrote: : > What purpose does devclass_find_free_unit serve? I think it can safely be : > eliminated from the tree. The current design is racy. : > : > Comments? : > : > It is currently used: : > : > ./arm/xscale/ixp425/.svn/text-base/avila_ata.c.svn-base: : device_add_child(dev, "ata", devclass_find_free_unit(ata_devclass, 0)); : > ./arm/xscale/ixp425/avila_ata.c: device_add_child(dev, "ata", : devclass_find_free_unit(ata_devclass, 0)); : > ./arm/at91/.svn/text-base/at91_cfata.c.svn-base: : device_add_child(dev, "ata", devclass_find_free_unit(ata_devclass, 0)); : > ./arm/at91/at91_cfata.c: device_add_child(dev, "ata", : devclass_find_free_unit(ata_devclass, 0)); : > ./powerpc/psim/.svn/text-base/ata_iobus.c.svn-base: : devclass_find_free_unit(ata_devclass, 0)); : > : > # All the above can be replaced with a simple '-1'. : > : > ata/ata-pci.c: unit : devclass_find_free_unit(ata_devclass, 2)); : > ata/ata-usb.c: devclass_find_free_unit(ata_devclass, 2))) == : NULL) { : > : > These can likely be replaced by '2', but that may result in a warning : > message being printed that likely can be eliminated... : : ata does this so it can reserve ata0 and ata1 for the "legacy" ATA channels on : legacy ATA PCI adapters. That is, if you have both SATA controllers and a : PATA controller, this allows the two PATA channels to always be ata0 and ata1 : and the PATA drivers to always be ad0 - ad3. You could perhaps implement : this in 8.x now by a really horrendous hack of having ISA hints for ata0 and : ata1 and letting bus_hint_device_unit() in the atapci driver claim those : hints for the channels on PATA controllers. I think it already does something akin to this: /* attach all channels on this controller */ for (unit = 0; unit < ctlr->channels; unit++) { if ((ctlr->ichannels & (1 << unit)) == 0) continue; child = device_add_child(dev, "ata", ((unit == 0 || unit == 1) && ctlr->legacy) ? unit : devclass_find_free_unit(ata_devclass, 2)); if (child == NULL) device_printf(dev, "failed to add ata child device\n"); else device_set_ivars(child, (void *)(intptr_t)unit); } Why not just replace devclass_find_free_unit with '2'? All the other users in the tree aer bogus and should be replaced by -1. Well, I'm not 100% sure about the ata-usb.c patch, since that would also be necessary to avoid collision. And the above code really only applies to x86-based machine, right? There's no need to do that for non-intel boxes. Or is the assumption on those boxes the controller would never be in legacy. Warner -------------- next part -------------- Index: arm/xscale/ixp425/avila_ata.c =================================================================== --- arm/xscale/ixp425/avila_ata.c (revision 193873) +++ arm/xscale/ixp425/avila_ata.c (working copy) @@ -248,7 +248,7 @@ NULL, ata_avila_intr, sc, &sc->sc_ih); /* attach channel on this controller */ - device_add_child(dev, "ata", devclass_find_free_unit(ata_devclass, 0)); + device_add_child(dev, "ata", -1); bus_generic_attach(dev); return 0; Index: arm/at91/at91_cfata.c =================================================================== --- arm/at91/at91_cfata.c (revision 193873) +++ arm/at91/at91_cfata.c (working copy) @@ -94,7 +94,7 @@ /* XXX: init CF controller? */ callout_init(&sc->tick, 1); /* Callout to poll the device. */ - device_add_child(dev, "ata", devclass_find_free_unit(ata_devclass, 0)); + device_add_child(dev, "ata", -1); bus_generic_attach(dev); return (0); } Index: arm/at91/files.at91 =================================================================== --- arm/at91/files.at91 (revision 193873) +++ arm/at91/files.at91 (working copy) @@ -18,6 +18,11 @@ arm/at91/uart_bus_at91usart.c optional uart arm/at91/uart_cpu_at91rm9200usart.c optional uart arm/at91/uart_dev_at91usart.c optional uart + +dev/cfi/cfi_bus_lbc.c optional cfi +dev/cfi/cfi_core.c optional cfi +dev/cfi/cfi_dev.c optional cfi + # # All the boards we support # Index: arm/at91/at91_machdep.c =================================================================== --- arm/at91/at91_machdep.c (revision 193873) +++ arm/at91/at91_machdep.c (working copy) @@ -179,6 +179,14 @@ PTE_NOCACHE, }, { + AT91RM92_FLS_BASE, + AT91RM92_FLS_PA_BASE, + AT91RM92_FLS_SIZE, + VM_PROT_READ|VM_PROT_WRITE, + PTE_NOCACHE, + }, + + { /* CompactFlash controller. */ AT91RM92_CF_BASE, AT91RM92_CF_PA_BASE, Index: arm/at91/at91_mci.c =================================================================== --- arm/at91/at91_mci.c (revision 193873) +++ arm/at91/at91_mci.c (working copy) @@ -62,8 +62,15 @@ #include "mmcbr_if.h" -#define BBSZ 512 +// #define AT91_MCI_DEBUG +#define MAX_SIZE 1 +#define BBSZ 512 * MAX_SIZE +/* + * Note: This driver only supports the SlotA card. No attempt has been made + * to support SlotB. + */ + struct at91_mci_softc { void *intrhand; /* Interrupt handle */ device_t dev; @@ -204,10 +211,16 @@ sc->host.f_min = 375000; sc->host.f_max = at91_master_clock / 2; /* Typically 30MHz */ sc->host.host_ocr = MMC_OCR_320_330 | MMC_OCR_330_340; + sc->host.caps = 0; + /* + * The in-tree Linux driver doesn't allow 4-wire operation for the + * at91rm9200, but does for other members of the family. The atmel + * patches to this do allow it, or have in the past. It is unclear + * that the hardware even works, but my boot loader uses 4-bit bus + * in polling mode successfully. + */ if (sc->sc_cap & CAP_HAS_4WIRE) - sc->host.caps = MMC_CAP_4_BIT_DATA; - else - sc->host.caps = 0; + sc->host.caps |= MMC_CAP_4_BIT_DATA; child = device_add_child(dev, "mmc", 0); device_set_ivars(dev, &sc->host); err = bus_generic_attach(dev); @@ -300,9 +313,9 @@ clkdiv = (at91_master_clock / ios->clock) / 2; } if (ios->bus_width == bus_width_4) - WR4(sc, MCI_SDCR, RD4(sc, MCI_SDCR) | MCI_SDCR_SDCBUS); + WR4(sc, MCI_SDCR, MCI_SDCR_SDCBUS); else - WR4(sc, MCI_SDCR, RD4(sc, MCI_SDCR) & ~MCI_SDCR_SDCBUS); + WR4(sc, MCI_SDCR, 0); WR4(sc, MCI_MR, (RD4(sc, MCI_MR) & ~MCI_MR_CLKDIV) | clkdiv); /* Do we need a settle time here? */ /* XXX We need to turn the device on/off here with a GPIO pin */ @@ -341,7 +354,9 @@ if (!data) { // The no data case is fairly simple at91_mci_pdc_disable(sc); -// printf("CMDR %x ARGR %x\n", cmdr, cmd->arg); +#ifdef AT91_MCI_DEBUG + printf("CMDR %x ARGR %x\n", cmdr, cmd->arg); +#endif WR4(sc, MCI_ARGR, cmd->arg); WR4(sc, MCI_CMDR, cmdr); WR4(sc, MCI_IER, MCI_SR_ERROR | MCI_SR_CMDRDY); @@ -399,7 +414,9 @@ ier = MCI_SR_TXBUFE; } } -// printf("CMDR %x ARGR %x with data\n", cmdr, cmd->arg); +#ifdef AT91_MCI_DEBUG + printf("CMDR %x ARGR %x with data\n", cmdr, cmd->arg); +#endif WR4(sc, MCI_ARGR, cmd->arg); if (cmdr & MCI_CMDR_TRCMD_START) { if (cmdr & MCI_CMDR_TRDIR) { @@ -438,6 +455,14 @@ sc->req = NULL; sc->curcmd = NULL; req->done(req); + /* + * Attempted hack-a-round for the DMA bug for multiple reads. + */ + if (req->cmd->opcode == MMC_READ_MULTIPLE_BLOCK) { + at91_mci_fini(sc->dev); + at91_mci_init(sc->dev); + at91_mci_update_ios(sc->dev, NULL); + } } static int @@ -498,7 +523,9 @@ uint32_t *walker; struct mmc_command *cmd; int i, len; - +#ifdef AT91_MCI_DEBUG + char *w2; +#endif cmd = sc->curcmd; bus_dmamap_sync(sc->dmatag, sc->map, BUS_DMASYNC_POSTREAD); bus_dmamap_unload(sc->dmatag, sc->map); @@ -509,6 +536,15 @@ for (i = 0; i < len; i++) walker[i] = bswap32(walker[i]); } +#ifdef AT91_MCI_DEBUG + printf("Read data\n"); + for (i = 0, w2 = cmd->data->data; i < cmd->data->len; i++) { + if (i % 16 == 0) + printf("%08x ", cmd->arg + i); + printf("%02x%s", w2[i], (i + 1) % 16 ? " " : "\n"); + } + printf("\n"); +#endif // Finish up the sequence... WR4(sc, MCI_IDR, MCI_SR_ENDRX); WR4(sc, MCI_IER, MCI_SR_RXBUFF); @@ -544,14 +580,19 @@ if ((sr & MCI_SR_RCRCE) && (cmd->opcode == MMC_SEND_OP_COND || cmd->opcode == ACMD_SD_SEND_OP_COND)) cmd->error = MMC_ERR_NONE; - else if (sr & (MCI_SR_RTOE | MCI_SR_DTOE)) + else if (sr & (MCI_SR_RTOE | MCI_SR_DTOE)) { + printf("TIMEOUT %#x\n", sr); cmd->error = MMC_ERR_TIMEOUT; - else if (sr & (MCI_SR_RCRCE | MCI_SR_DCRCE)) + } else if (sr & (MCI_SR_RCRCE | MCI_SR_DCRCE)) { + printf("CRC %#x\n", sr); cmd->error = MMC_ERR_BADCRC; - else if (sr & (MCI_SR_OVRE | MCI_SR_UNRE)) + } else if (sr & (MCI_SR_OVRE | MCI_SR_UNRE)) { + printf("FIFO %#x\n", sr); cmd->error = MMC_ERR_FIFO; - else + } else { + printf("FAILED %#x\n", sr); cmd->error = MMC_ERR_FAILED; + } done = 1; if (sc->mapped && cmd->error) { bus_dmamap_unload(sc->dmatag, sc->map); @@ -656,7 +697,7 @@ *(int *)result = sc->host.caps; break; case MMCBR_IVAR_MAX_DATA: - *(int *)result = 1; + *(int *)result = MAX_SIZE; break; } return (0); Index: arm/at91/at91rm92reg.h =================================================================== --- arm/at91/at91rm92reg.h (revision 193873) +++ arm/at91/at91rm92reg.h (working copy) @@ -341,6 +341,10 @@ #define AT91RM92_OHCI_PA_BASE 0x00300000 #define AT91RM92_OHCI_SIZE 0x00100000 +#define AT91RM92_FLS_BASE 0xdf000000 +#define AT91RM92_FLS_PA_BASE 0x10000000 +#define AT91RM92_FLS_SIZE 0x02000000 /* Support up to 32MB flash */ + #define AT91RM92_CF_BASE 0xdfd00000 #define AT91RM92_CF_PA_BASE 0x51400000 #define AT91RM92_CF_SIZE 0x00100000 Index: powerpc/psim/ata_iobus.c =================================================================== --- powerpc/psim/ata_iobus.c (revision 193873) +++ powerpc/psim/ata_iobus.c (working copy) @@ -114,9 +114,7 @@ * Add a single child per controller. Should be able * to add two */ - device_add_child(dev, "ata", - devclass_find_free_unit(ata_devclass, 0)); - + device_add_child(dev, "ata", -1); return (bus_generic_attach(dev)); } Index: dev/ata/ata-all.c =================================================================== --- dev/ata/ata-all.c (revision 193873) +++ dev/ata/ata-all.c (working copy) @@ -663,7 +663,7 @@ btrim(atacap->serial, sizeof(atacap->serial)); bpack(atacap->serial, atacap->serial, sizeof(atacap->serial)); - if (bootverbose) + if (bootverbose || 1) printf("ata%d-%s: pio=%s wdma=%s udma=%s cable=%s wire\n", device_get_unit(ch->dev), ata_unit2str(atadev), Index: dev/ata/ata-usb.c =================================================================== --- dev/ata/ata-usb.c (revision 193873) +++ dev/ata/ata-usb.c (working copy) @@ -414,11 +414,10 @@ /* ata channels are children to this USB control device */ for (i = 0; i <= sc->maxlun; i++) { - if ((child = device_add_child(sc->dev, "ata", - devclass_find_free_unit(ata_devclass, 2))) == NULL) { + if ((child = device_add_child(sc->dev, "ata", -1)) == NULL) device_printf(sc->dev, "failed to add ata child device\n"); - } else - device_set_ivars(child, (void *)(intptr_t)i); + else + device_set_ivars(child, (void *)(intptr_t)i); } bus_generic_attach(sc->dev); From jhb at freebsd.org Wed Jun 10 17:09:14 2009 From: jhb at freebsd.org (John Baldwin) Date: Wed Jun 10 17:09:21 2009 Subject: devclass_find_free_unit In-Reply-To: <20090610.102144.324381338.imp@bsdimp.com> References: <20090609.174249.-1435625969.imp@bsdimp.com> <200906100822.15516.jhb@freebsd.org> <20090610.102144.324381338.imp@bsdimp.com> Message-ID: <200906101302.03211.jhb@freebsd.org> On Wednesday 10 June 2009 12:21:44 pm M. Warner Losh wrote: > In message: <200906100822.15516.jhb@freebsd.org> > John Baldwin writes: > : On Tuesday 09 June 2009 7:42:49 pm M. Warner Losh wrote: > : > What purpose does devclass_find_free_unit serve? I think it can safely be > : > eliminated from the tree. The current design is racy. > : > > : > Comments? > : > > : > It is currently used: > : > > : > ./arm/xscale/ixp425/.svn/text-base/avila_ata.c.svn-base: > : device_add_child(dev, "ata", devclass_find_free_unit(ata_devclass, 0)); > : > ./arm/xscale/ixp425/avila_ata.c: device_add_child(dev, "ata", > : devclass_find_free_unit(ata_devclass, 0)); > : > ./arm/at91/.svn/text-base/at91_cfata.c.svn-base: > : device_add_child(dev, "ata", devclass_find_free_unit(ata_devclass, 0)); > : > ./arm/at91/at91_cfata.c: device_add_child(dev, "ata", > : devclass_find_free_unit(ata_devclass, 0)); > : > ./powerpc/psim/.svn/text-base/ata_iobus.c.svn-base: > : devclass_find_free_unit(ata_devclass, 0)); > : > > : > # All the above can be replaced with a simple '-1'. > : > > : > ata/ata-pci.c: unit : devclass_find_free_unit(ata_devclass, 2)); > : > ata/ata-usb.c: devclass_find_free_unit(ata_devclass, 2))) == > : NULL) { > : > > : > These can likely be replaced by '2', but that may result in a warning > : > message being printed that likely can be eliminated... > : > : ata does this so it can reserve ata0 and ata1 for the "legacy" ATA channels on > : legacy ATA PCI adapters. That is, if you have both SATA controllers and a > : PATA controller, this allows the two PATA channels to always be ata0 and ata1 > : and the PATA drivers to always be ad0 - ad3. You could perhaps implement > : this in 8.x now by a really horrendous hack of having ISA hints for ata0 and > : ata1 and letting bus_hint_device_unit() in the atapci driver claim those > : hints for the channels on PATA controllers. > > I think it already does something akin to this: > > /* attach all channels on this controller */ > for (unit = 0; unit < ctlr->channels; unit++) { > if ((ctlr->ichannels & (1 << unit)) == 0) > continue; > child = device_add_child(dev, "ata", > ((unit == 0 || unit == 1) && ctlr->legacy) ? > unit : devclass_find_free_unit(ata_devclass, 2)); > if (child == NULL) > device_printf(dev, "failed to add ata child device\n"); > else > device_set_ivars(child, (void *)(intptr_t)unit); > } > > Why not just replace devclass_find_free_unit with '2'? Because if you add 'ata2', and 'ata2' exists it will fail, it won't rename it to ata3. And that is what ata is trying to do. It basically wants to "reserve" ata0 and ata1 and then use device_add_child(..., -1). However, device_add_child(..., -1) will not "reserve" ata0 and ata1. > All the other users in the tree aer bogus and should be replaced by > -1. Well, I'm not 100% sure about the ata-usb.c patch, since that > would also be necessary to avoid collision. And the above code really > only applies to x86-based machine, right? There's no need to do that > for non-intel boxes. Or is the assumption on those boxes the > controller would never be in legacy. Any machine that can have a PCI PATA controller or a PCI SATA controller operating in "legacy" mode. That said, the compatability bits probably don't matter as much on non-x86 as there are not older releases to be compatible with (or the impact would be less severe if we renumber people's drives at least). -- John Baldwin From imp at bsdimp.com Wed Jun 10 17:32:13 2009 From: imp at bsdimp.com (M. Warner Losh) Date: Wed Jun 10 17:32:20 2009 Subject: devclass_find_free_unit In-Reply-To: <200906101302.03211.jhb@freebsd.org> References: <200906100822.15516.jhb@freebsd.org> <20090610.102144.324381338.imp@bsdimp.com> <200906101302.03211.jhb@freebsd.org> Message-ID: <20090610.112813.623117012.imp@bsdimp.com> In message: <200906101302.03211.jhb@freebsd.org> John Baldwin writes: : On Wednesday 10 June 2009 12:21:44 pm M. Warner Losh wrote: : > In message: <200906100822.15516.jhb@freebsd.org> : > John Baldwin writes: : > : On Tuesday 09 June 2009 7:42:49 pm M. Warner Losh wrote: : > : > What purpose does devclass_find_free_unit serve? I think it can safely : be : > : > eliminated from the tree. The current design is racy. : > : > : > : > Comments? : > : > : > : > It is currently used: : > : > : > : > ./arm/xscale/ixp425/.svn/text-base/avila_ata.c.svn-base: : > : device_add_child(dev, "ata", devclass_find_free_unit(ata_devclass, 0)); : > : > ./arm/xscale/ixp425/avila_ata.c: device_add_child(dev, "ata", : > : devclass_find_free_unit(ata_devclass, 0)); : > : > ./arm/at91/.svn/text-base/at91_cfata.c.svn-base: : > : device_add_child(dev, "ata", devclass_find_free_unit(ata_devclass, 0)); : > : > ./arm/at91/at91_cfata.c: device_add_child(dev, "ata", : > : devclass_find_free_unit(ata_devclass, 0)); : > : > ./powerpc/psim/.svn/text-base/ata_iobus.c.svn-base: : > : devclass_find_free_unit(ata_devclass, 0)); : > : > : > : > # All the above can be replaced with a simple '-1'. : > : > : > : > ata/ata-pci.c: unit : devclass_find_free_unit(ata_devclass, 2)); : > : > ata/ata-usb.c: devclass_find_free_unit(ata_devclass, 2))) : == : > : NULL) { : > : > : > : > These can likely be replaced by '2', but that may result in a warning : > : > message being printed that likely can be eliminated... : > : : > : ata does this so it can reserve ata0 and ata1 for the "legacy" ATA : channels on : > : legacy ATA PCI adapters. That is, if you have both SATA controllers and a : > : PATA controller, this allows the two PATA channels to always be ata0 and : ata1 : > : and the PATA drivers to always be ad0 - ad3. You could perhaps implement : > : this in 8.x now by a really horrendous hack of having ISA hints for ata0 : and : > : ata1 and letting bus_hint_device_unit() in the atapci driver claim those : > : hints for the channels on PATA controllers. : > : > I think it already does something akin to this: : > : > /* attach all channels on this controller */ : > for (unit = 0; unit < ctlr->channels; unit++) { : > if ((ctlr->ichannels & (1 << unit)) == 0) : > continue; : > child = device_add_child(dev, "ata", : > ((unit == 0 || unit == 1) && ctlr->legacy) ? : > unit : devclass_find_free_unit(ata_devclass, 2)); : > if (child == NULL) : > device_printf(dev, "failed to add ata child device\n"); : > else : > device_set_ivars(child, (void *)(intptr_t)unit); : > } : > : > Why not just replace devclass_find_free_unit with '2'? : : Because if you add 'ata2', and 'ata2' exists it will fail, it won't rename it : to ata3. And that is what ata is trying to do. It basically wants : to "reserve" ata0 and ata1 and then use device_add_child(..., -1). However, : device_add_child(..., -1) will not "reserve" ata0 and ata1. Ah yes. It does just fail. However, setting the unit here is racy. If we were to make the device tree probe more parallel, then we may have a case where devclass_find_free_unit gets called from two different threads, returning the same number, then the device_child_add works for only one of these threads... : > All the other users in the tree aer bogus and should be replaced by : > -1. Well, I'm not 100% sure about the ata-usb.c patch, since that : > would also be necessary to avoid collision. And the above code really : > only applies to x86-based machine, right? There's no need to do that : > for non-intel boxes. Or is the assumption on those boxes the : > controller would never be in legacy. : : Any machine that can have a PCI PATA controller or a PCI SATA controller : operating in "legacy" mode. That said, the compatability bits probably don't : matter as much on non-x86 as there are not older releases to be compatible : with (or the impact would be less severe if we renumber people's drives at : least). Yes. I guess I was asking if we need an ifdef for this behavior or not... I guess not.. I think we need to have a better way to 'reserve' a unit than we have today. I think this will be better to do that and retire devclass_find_free_unit. I think that only one or two uses in the tree are legit... Warner From jhb at freebsd.org Wed Jun 10 17:45:14 2009 From: jhb at freebsd.org (John Baldwin) Date: Wed Jun 10 17:45:21 2009 Subject: devclass_find_free_unit In-Reply-To: <20090610.112813.623117012.imp@bsdimp.com> References: <200906100822.15516.jhb@freebsd.org> <200906101302.03211.jhb@freebsd.org> <20090610.112813.623117012.imp@bsdimp.com> Message-ID: <200906101343.15311.jhb@freebsd.org> On Wednesday 10 June 2009 1:28:13 pm M. Warner Losh wrote: > In message: <200906101302.03211.jhb@freebsd.org> > John Baldwin writes: > : On Wednesday 10 June 2009 12:21:44 pm M. Warner Losh wrote: > : > In message: <200906100822.15516.jhb@freebsd.org> > : > John Baldwin writes: > : > : On Tuesday 09 June 2009 7:42:49 pm M. Warner Losh wrote: > : > : > What purpose does devclass_find_free_unit serve? I think it can safely > : be > : > : > eliminated from the tree. The current design is racy. > : > : > > : > : > Comments? > : > : > > : > : > It is currently used: > : > : > > : > : > ./arm/xscale/ixp425/.svn/text-base/avila_ata.c.svn-base: > : > : device_add_child(dev, "ata", devclass_find_free_unit(ata_devclass, 0)); > : > : > ./arm/xscale/ixp425/avila_ata.c: device_add_child(dev, "ata", > : > : devclass_find_free_unit(ata_devclass, 0)); > : > : > ./arm/at91/.svn/text-base/at91_cfata.c.svn-base: > : > : device_add_child(dev, "ata", devclass_find_free_unit(ata_devclass, 0)); > : > : > ./arm/at91/at91_cfata.c: device_add_child(dev, "ata", > : > : devclass_find_free_unit(ata_devclass, 0)); > : > : > ./powerpc/psim/.svn/text-base/ata_iobus.c.svn-base: > : > : devclass_find_free_unit(ata_devclass, 0)); > : > : > > : > : > # All the above can be replaced with a simple '-1'. > : > : > > : > : > ata/ata-pci.c: unit : devclass_find_free_unit(ata_devclass, 2)); > : > : > ata/ata-usb.c: devclass_find_free_unit(ata_devclass, 2))) > : == > : > : NULL) { > : > : > > : > : > These can likely be replaced by '2', but that may result in a warning > : > : > message being printed that likely can be eliminated... > : > : > : > : ata does this so it can reserve ata0 and ata1 for the "legacy" ATA > : channels on > : > : legacy ATA PCI adapters. That is, if you have both SATA controllers and a > : > : PATA controller, this allows the two PATA channels to always be ata0 and > : ata1 > : > : and the PATA drivers to always be ad0 - ad3. You could perhaps implement > : > : this in 8.x now by a really horrendous hack of having ISA hints for ata0 > : and > : > : ata1 and letting bus_hint_device_unit() in the atapci driver claim those > : > : hints for the channels on PATA controllers. > : > > : > I think it already does something akin to this: > : > > : > /* attach all channels on this controller */ > : > for (unit = 0; unit < ctlr->channels; unit++) { > : > if ((ctlr->ichannels & (1 << unit)) == 0) > : > continue; > : > child = device_add_child(dev, "ata", > : > ((unit == 0 || unit == 1) && ctlr->legacy) ? > : > unit : devclass_find_free_unit(ata_devclass, 2)); > : > if (child == NULL) > : > device_printf(dev, "failed to add ata child device\n"); > : > else > : > device_set_ivars(child, (void *)(intptr_t)unit); > : > } > : > > : > Why not just replace devclass_find_free_unit with '2'? > : > : Because if you add 'ata2', and 'ata2' exists it will fail, it won't rename it > : to ata3. And that is what ata is trying to do. It basically wants > : to "reserve" ata0 and ata1 and then use device_add_child(..., -1). However, > : device_add_child(..., -1) will not "reserve" ata0 and ata1. > > Ah yes. It does just fail. However, setting the unit here is racy. > If we were to make the device tree probe more parallel, then we may > have a case where devclass_find_free_unit gets called from two > different threads, returning the same number, then the > device_child_add works for only one of these threads... Yes, it is quite racey. > : > All the other users in the tree aer bogus and should be replaced by > : > -1. Well, I'm not 100% sure about the ata-usb.c patch, since that > : > would also be necessary to avoid collision. And the above code really > : > only applies to x86-based machine, right? There's no need to do that > : > for non-intel boxes. Or is the assumption on those boxes the > : > controller would never be in legacy. > : > : Any machine that can have a PCI PATA controller or a PCI SATA controller > : operating in "legacy" mode. That said, the compatability bits probably don't > : matter as much on non-x86 as there are not older releases to be compatible > : with (or the impact would be less severe if we renumber people's drives at > : least). > > Yes. I guess I was asking if we need an ifdef for this behavior or > not... I guess not.. > > I think we need to have a better way to 'reserve' a unit than we have > today. I think this will be better to do that and retire > devclass_find_free_unit. I think that only one or two uses in the > tree are legit... Well, that was why I suggested possibly depending on the (already-existing) ata[01] ISA hints and having a bus_hint_device_unit() method for the atapci driver that let PATA channels claim ata[01]. Then the ata driver could always use device_add_unit(..., -1) to add "ata" devices. It is sort of odd, but it actually maps what the code is trying to do: let the PATA ATA channels "look like" the old ISA channels. -- John Baldwin From jackm at nwlink.com Thu Jun 11 10:57:05 2009 From: jackm at nwlink.com (Dorothy Reid) Date: Thu Jun 11 10:57:12 2009 Subject: Replica Watches Message-ID: <002901c9ea80$64863400$258e7439@homepcfio.hqqpch> A lot of brands, 100-300 usd. Mail to order: myama@lbix.com From jroberson at jroberson.net Fri Jun 12 13:23:05 2009 From: jroberson at jroberson.net (Jeff Roberson) Date: Fri Jun 12 13:23:12 2009 Subject: Dynamic pcpu, arm, mips, powerpc, sun, etc. help needed In-Reply-To: <4A2F1148.9090706@freebsd.org> References: <20090609201127.GA50903@alchemy.franken.de> <4A2F1148.9090706@freebsd.org> Message-ID: On Tue, 9 Jun 2009, Peter Grehan wrote: >> As for sparc64 allocating the storage for the dynamic area >> from end probably isn't a good idea as the pmap code assumes >> that the range from KERNBASE to end is covered by the pages >> allocated by and locked into the TLB for the kernel by the >> loader > > Ditto for ppc. It's possible to get the additional space from within or > after return from pmap_bootstrap() (like thread0's kstack, or the msgbuf). > OK, I had originally split it into two stages. It sounds like I should return to this and revise the patch again. We worked out the i386 problems. I'll split it up again and post something that hopefully you two can try. Thanks, Jeff > later, > > Peter. > From kris at FreeBSD.org Sun Jun 14 13:01:45 2009 From: kris at FreeBSD.org (Kris Kennaway) Date: Sun Jun 14 13:01:51 2009 Subject: [PATCH] Adaptive spinning for lockmgr In-Reply-To: <3bbf2fe10906081342i6ef418e0n75e22d0b9e2543b3@mail.gmail.com> References: <3bbf2fe10906081342i6ef418e0n75e22d0b9e2543b3@mail.gmail.com> Message-ID: <4A34F4B7.5050904@FreeBSD.org> Attilio Rao wrote: > This patch enables adaptive spinning for lockmgr: > http://www.freebsd.org/~attilio/adaptive_lockmgr.diff > > and it should presumably improve performance on disks/vfs/buffer cache > based benchmarks, so, if you want to try out and report any benchmarks > result, I'd love to see it. > Please note that there are some parameters to tune: for example, you > would like to not enable adaptive spinning to default while you just > want that for a class of locks (and in that case you want to apply the > reversed logic for what is living now) or you want to use different > values for retries and loops. Interested developers can refer to such > 3 variables. > Peter Holm alredy tested that patch for about 24hours without any > regression to report. > > Also note that the patch is not 100% yet as long as it needs UPDATES > and manpages updates, but they will be added just in time before to > commit. > The modify is all there. I have a vague memory that we had tested a version of this in the past and found that it caused a performance loss in common cases? Many lockmgr callers are not amenable to adaptive spinning because they have to wait on slow I/O. Testing only with e.g. md backing might give results that are non-representative. Kris From attilio at freebsd.org Sun Jun 14 14:23:13 2009 From: attilio at freebsd.org (Attilio Rao) Date: Sun Jun 14 14:23:25 2009 Subject: [PATCH] Adaptive spinning for lockmgr In-Reply-To: <4A34F4B7.5050904@FreeBSD.org> References: <3bbf2fe10906081342i6ef418e0n75e22d0b9e2543b3@mail.gmail.com> <4A34F4B7.5050904@FreeBSD.org> Message-ID: <3bbf2fe10906140723y2a99eb8an3488796ac6604134@mail.gmail.com> 2009/6/14 Kris Kennaway : > Attilio Rao wrote: >> >> This patch enables adaptive spinning for lockmgr: >> http://www.freebsd.org/~attilio/adaptive_lockmgr.diff >> >> and it should presumably improve performance on disks/vfs/buffer cache >> based benchmarks, so, if you want to try out and report any benchmarks >> result, I'd love to see it. >> Please note that there are some parameters to tune: for example, you >> would like to not enable adaptive spinning to default while you just >> want that for a class of locks (and in that case you want to apply the >> reversed logic for what is living now) or you want to use different >> values ?for retries and loops. Interested developers can refer to such >> 3 variables. >> Peter Holm alredy tested that patch for about 24hours without any >> regression to report. >> >> Also note that the patch is not 100% yet as long as it needs UPDATES >> and manpages updates, but they will be added just in time before to >> commit. >> The modify is all there. > > I have a vague memory that we had tested a version of this in the past and > found that it caused a performance loss in common cases? ?Many lockmgr > callers are not amenable to adaptive spinning because they have to wait on > slow I/O. ?Testing only with e.g. md backing might give results that are > non-representative. I don't think I ever implemented adaptive spinning in lockmgr so if somebody else did I don't know. Said that, probabilly the best approach would be to disable it by default ad use a LK_ADAPTIVESPIN flag on a per instance basis. Such conditions, though, need to be explored a bit and I have no time to dedicate to this right now. Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein From kris at FreeBSD.org Sun Jun 14 15:00:43 2009 From: kris at FreeBSD.org (Kris Kennaway) Date: Sun Jun 14 15:00:54 2009 Subject: [PATCH] Adaptive spinning for lockmgr In-Reply-To: <3bbf2fe10906140723y2a99eb8an3488796ac6604134@mail.gmail.com> References: <3bbf2fe10906081342i6ef418e0n75e22d0b9e2543b3@mail.gmail.com> <4A34F4B7.5050904@FreeBSD.org> <3bbf2fe10906140723y2a99eb8an3488796ac6604134@mail.gmail.com> Message-ID: <4A351099.3020407@FreeBSD.org> Attilio Rao wrote: > 2009/6/14 Kris Kennaway : >> Attilio Rao wrote: >>> This patch enables adaptive spinning for lockmgr: >>> http://www.freebsd.org/~attilio/adaptive_lockmgr.diff >>> >>> and it should presumably improve performance on disks/vfs/buffer cache >>> based benchmarks, so, if you want to try out and report any benchmarks >>> result, I'd love to see it. >>> Please note that there are some parameters to tune: for example, you >>> would like to not enable adaptive spinning to default while you just >>> want that for a class of locks (and in that case you want to apply the >>> reversed logic for what is living now) or you want to use different >>> values for retries and loops. Interested developers can refer to such >>> 3 variables. >>> Peter Holm alredy tested that patch for about 24hours without any >>> regression to report. >>> >>> Also note that the patch is not 100% yet as long as it needs UPDATES >>> and manpages updates, but they will be added just in time before to >>> commit. >>> The modify is all there. >> I have a vague memory that we had tested a version of this in the past and >> found that it caused a performance loss in common cases? Many lockmgr >> callers are not amenable to adaptive spinning because they have to wait on >> slow I/O. Testing only with e.g. md backing might give results that are >> non-representative. > > I don't think I ever implemented adaptive spinning in lockmgr so if > somebody else did I don't know. Said that, probabilly the best > approach would be to disable it by default ad use a LK_ADAPTIVESPIN > flag on a per instance basis. > Such conditions, though, need to be explored a bit and I have no time > to dedicate to this right now. OK, I am mis-remembering then. Ideally it would be tested in several representative workloads to see where it helps. I can't promise whether I can do this though, for the same reason as you :( Kris From bugmaster at FreeBSD.org Mon Jun 15 11:06:50 2009 From: bugmaster at FreeBSD.org (FreeBSD bugmaster) Date: Mon Jun 15 11:07:25 2009 Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org Message-ID: <200906151106.n5FB6nH2076823@freefall.freebsd.org> Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/120749 arch [request] Suggest upping the default kern.ps_arg_cache 1 problem total. From Daan at vehosting.nl Mon Jun 15 21:53:22 2009 From: Daan at vehosting.nl (Daan Vreeken) Date: Mon Jun 15 21:53:29 2009 Subject: WIP: ATA to CAM integration In-Reply-To: <200906051728.n55HSFf0076644@apollo.backplane.com> References: <4A254B45.8050800@mavhome.dp.ua> <4A294DC3.5010008@mavhome.dp.ua> <200906051728.n55HSFf0076644@apollo.backplane.com> Message-ID: <200906152352.48231.Daan@vehosting.nl> Hi, On Friday 05 June 2009 19:28:15 Matthew Dillon wrote: > :Latest AHCI specifications define feature named FIS Based Switching. It > :allows controller independently track state of every device beyond port > :multiplier. It should be quite easy to use it, but actually none of my > :controllers have that capability. > > Damn. The FBSS capability bit is not set on my (AMD) MCP77 based AHCI > SATA controller. That sucks. > > ahci0: ... > ahci0: AHCI 1.2 capabilities > 0xe3229f05, 6 port > > Do you know of any host controllers which support FBS ? Any of the > Intel parts or machines per-chance? ... According to the following link : http://www.siliconimage.com/products/product.aspx?pid=32 the SiI3132 supports FIS based switching. We use them in a storage server prototype. Regards, -- Daan Vreeken VEHosting http://VEHosting.nl tel: +31-(0)40-7113050 / +31-(0)6-46210825 KvK nr: 17174380 From dillon at apollo.backplane.com Mon Jun 15 22:09:53 2009 From: dillon at apollo.backplane.com (Matthew Dillon) Date: Mon Jun 15 22:10:05 2009 Subject: WIP: ATA to CAM integration References: <4A254B45.8050800@mavhome.dp.ua> <4A294DC3.5010008@mavhome.dp.ua> <200906051728.n55HSFf0076644@apollo.backplane.com> <200906152352.48231.Daan@vehosting.nl> Message-ID: <200906152209.n5FM9psY007070@apollo.backplane.com> :Hi, : :According to the following link : : http://www.siliconimage.com/products/product.aspx?pid=32 : :the SiI3132 supports FIS based switching. We use them in a storage server :prototype. : :Regards, :-- :Daan Vreeken :VEHosting Yah, I have a bunch of those. They aren't AHCI parts though (as far as I know) so it doesn't help with the AHCI driver. They have their own custom driver. But thanks for mentioning it :-) (Someone tell me if I'm wrong there, I'm pretty sure all the Sili stuff uses a Sili-specific device driver). -Matt From oz at nixil.net Mon Jun 15 22:57:49 2009 From: oz at nixil.net (Phil Oleson) Date: Mon Jun 15 22:57:56 2009 Subject: WIP: ATA to CAM integration In-Reply-To: <200906152209.n5FM9psY007070@apollo.backplane.com> References: <4A254B45.8050800@mavhome.dp.ua> <4A294DC3.5010008@mavhome.dp.ua> <200906051728.n55HSFf0076644@apollo.backplane.com> <200906152352.48231.Daan@vehosting.nl> <200906152209.n5FM9psY007070@apollo.backplane.com> Message-ID: <4A36CEE9.9040101@nixil.net> Matthew Dillon wrote: > :Hi, > : > :According to the following link : > : http://www.siliconimage.com/products/product.aspx?pid=32 > : > :the SiI3132 supports FIS based switching. We use them in a storage server > :prototype. > : > :Regards, > :-- > :Daan Vreeken > :VEHosting > > Yah, I have a bunch of those. They aren't AHCI parts though (as far > as I know) so it doesn't help with the AHCI driver. They have their > own custom driver. But thanks for mentioning it :-) > > (Someone tell me if I'm wrong there, I'm pretty sure all the Sili stuff > uses a Sili-specific device driver). meh.. found this via google: http://www.tomshardware.com/reviews/storage-accessories,1787-2.html The article claims it's AHCI compliant.. though the addonics web page doesn't specifically says so from a cursory glance here: http://www.addonics.com/products/host_controller/extpm.asp and the other form factors. http://www.addonics.com/products/pm/ -Phil. From imp at bsdimp.com Mon Jun 15 23:10:40 2009 From: imp at bsdimp.com (M. Warner Losh) Date: Mon Jun 15 23:10:46 2009 Subject: WIP: ATA to CAM integration In-Reply-To: <4A2AF876.1030103@FreeBSD.org> References: <6657.1244328220@critter.freebsd.dk> <4A2AF876.1030103@FreeBSD.org> Message-ID: <20090615.170754.1399854812.imp@bsdimp.com> In message: <4A2AF876.1030103@FreeBSD.org> Alexander Motin writes: : Poul-Henning Kamp wrote: : > In message <4A294DC3.5010008@mavhome.dp.ua>, Alexander Motin writes: : >> I think ATAPI disk device is theoretically possible, but I believe it : >> does not exist in practice, as industry do not need it. : > : > Maxtor ZIP ? : : May be, never had an ATA version. But it is more FDD, then HDD. Also it : existed in Parallel Port and SCSI versions, so it could be done in ATAPI : way just for unification. There was a ata/atapi version too. It attaches to afd. Warner From dillon at apollo.backplane.com Mon Jun 15 23:37:27 2009 From: dillon at apollo.backplane.com (Matthew Dillon) Date: Mon Jun 15 23:37:39 2009 Subject: WIP: ATA to CAM integration References: <4A254B45.8050800@mavhome.dp.ua> <4A294DC3.5010008@mavhome.dp.ua> <200906051728.n55HSFf0076644@apollo.backplane.com> <200906152352.48231.Daan@vehosting.nl> <200906152209.n5FM9psY007070@apollo.backplane.com> <4A36CEE9.9040101@nixil.net> Message-ID: <200906152337.n5FNbQrI008014@apollo.backplane.com> :meh.. found this via google: : :http://www.tomshardware.com/reviews/storage-accessories,1787-2.html : :The article claims it's AHCI compliant.. though the addonics web page :doesn't specifically says so from a cursory glance here: : :http://www.addonics.com/products/host_controller/extpm.asp : :and the other form factors. :http://www.addonics.com/products/pm/ : : -Phil. I think they mis-spoke. They are SATA-compliant and Port Multiplier compliant, and they use FIS-based packets, so they pretty much do away with all the ATA baggage, but they don't use the AHCI device interface so they won't probe as an AHCI driver. I can see why they do it that way, though. It looks like they hide most of the complexity behind the chipset, which is nice. AHCI exposes a lot of that complexity. It looks like a reasonable chipset. -Matt Matthew Dillon From james-freebsd-current at jrv.org Tue Jun 16 00:12:47 2009 From: james-freebsd-current at jrv.org (James R. Van Artsdalen) Date: Tue Jun 16 00:12:54 2009 Subject: WIP: ATA to CAM integration In-Reply-To: <200906152209.n5FM9psY007070@apollo.backplane.com> References: <4A254B45.8050800@mavhome.dp.ua> <4A294DC3.5010008@mavhome.dp.ua> <200906051728.n55HSFf0076644@apollo.backplane.com> <200906152352.48231.Daan@vehosting.nl> <200906152209.n5FM9psY007070@apollo.backplane.com> Message-ID: <4A36D8D9.7080104@jrv.org> Matthew Dillon wrote: > (Someone tell me if I'm wrong there, I'm pretty sure all the Sili stuff > uses a Sili-specific device driver). > Silicon Image publishes the 3132 datasheet. http://www.siimage.com/docs/SiI-DS-0138-D.pdf This chip is probably the one most commonly used in add-on cards due to low cost. From mav at FreeBSD.org Tue Jun 16 05:52:49 2009 From: mav at FreeBSD.org (Alexander Motin) Date: Tue Jun 16 05:53:09 2009 Subject: WIP: ATA to CAM integration In-Reply-To: <200906152337.n5FNbQrI008014@apollo.backplane.com> References: <4A254B45.8050800@mavhome.dp.ua> <4A294DC3.5010008@mavhome.dp.ua> <200906051728.n55HSFf0076644@apollo.backplane.com> <200906152352.48231.Daan@vehosting.nl> <200906152209.n5FM9psY007070@apollo.backplane.com> <4A36CEE9.9040101@nixil.net> <200906152337.n5FNbQrI008014@apollo.backplane.com> Message-ID: <4A373318.9000603@FreeBSD.org> Matthew Dillon wrote: > I think they mis-spoke. They are SATA-compliant and Port Multiplier > compliant, and they use FIS-based packets, so they pretty much do away > with all the ATA baggage, but they don't use the AHCI device interface > so they won't probe as an AHCI driver. > > I can see why they do it that way, though. It looks like they hide > most of the complexity behind the chipset, which is nice. AHCI > exposes a lot of that complexity. > > It looks like a reasonable chipset. Agree. It's functionally comparable to the latest AHCI specs, but looks more user-friendly. -- Alexander Motin From jroberson at jroberson.net Wed Jun 17 22:55:59 2009 From: jroberson at jroberson.net (Jeff Roberson) Date: Wed Jun 17 22:56:06 2009 Subject: Dynamic pcpu, arm, mips, powerpc, sun, etc. help needed In-Reply-To: <4A2F1148.9090706@freebsd.org> References: <20090609201127.GA50903@alchemy.franken.de> <4A2F1148.9090706@freebsd.org> Message-ID: On Tue, 9 Jun 2009, Peter Grehan wrote: >> As for sparc64 allocating the storage for the dynamic area >> from end probably isn't a good idea as the pmap code assumes >> that the range from KERNBASE to end is covered by the pages >> allocated by and locked into the TLB for the kernel by the >> loader > > Ditto for ppc. It's possible to get the additional space from within or > after return from pmap_bootstrap() (like thread0's kstack, or the msgbuf). http://people.freebsd.org/~jeff/dpcpu.diff I have updated this patch based on feedback relating to various architectures md code. I tried to model most architectures after the way msgbuf memory was taken. I have no capacity to test anything other than i386 and amd64. ARM is reported to work with one minor diff. Apparently sparc64 worked with the earlier diff but this should be cleaner. If anyone can report back on sparc64, mips, or powerpc, I'd appreciate it. Thanks, Jeff > > later, > > Peter. > From xcllnt at mac.com Thu Jun 18 02:30:27 2009 From: xcllnt at mac.com (Marcel Moolenaar) Date: Thu Jun 18 02:30:33 2009 Subject: Dynamic pcpu, arm, mips, powerpc, sun, etc. help needed In-Reply-To: References: <20090609201127.GA50903@alchemy.franken.de> <4A2F1148.9090706@freebsd.org> Message-ID: <94B46331-19AB-4174-BEDA-8B4B0A525B45@mac.com> On Jun 17, 2009, at 3:55 PM, Jeff Roberson wrote: > > On Tue, 9 Jun 2009, Peter Grehan wrote: > >>> As for sparc64 allocating the storage for the dynamic area >>> from end probably isn't a good idea as the pmap code assumes >>> that the range from KERNBASE to end is covered by the pages >>> allocated by and locked into the TLB for the kernel by the >>> loader >> >> Ditto for ppc. It's possible to get the additional space from >> within or after return from pmap_bootstrap() (like thread0's >> kstack, or the msgbuf). > > http://people.freebsd.org/~jeff/dpcpu.diff > > I have updated this patch based on feedback relating to various > architectures md code. I tried to model most architectures after > the way msgbuf memory was taken. I have no capacity to test > anything other than i386 and amd64. ARM is reported to work with > one minor diff. Apparently sparc64 worked with the earlier diff but > this should be cleaner. If anyone can report back on sparc64, mips, > or powerpc, I'd appreciate it. Can you fix the ia64 diff by moving the following lines up as well: /* But if the bootstrap tells us otherwise, believe it! */ if (bootinfo.bi_kernend) kernend = round_page(bootinfo.bi_kernend); Otherwise we're using the wrong kernend value for dpcpu_init() and also override what dpcpu_init() did to kernend. Thanks, -- Marcel Moolenaar xcllnt@mac.com From imp at bsdimp.com Thu Jun 18 03:04:05 2009 From: imp at bsdimp.com (M. Warner Losh) Date: Thu Jun 18 03:04:12 2009 Subject: Dynamic pcpu, arm, mips, powerpc, sun, etc. help needed In-Reply-To: References: <20090609201127.GA50903@alchemy.franken.de> <4A2F1148.9090706@freebsd.org> Message-ID: <20090617.210318.1878034641.imp@bsdimp.com> In message: Jeff Roberson writes: : : On Tue, 9 Jun 2009, Peter Grehan wrote: : : >> As for sparc64 allocating the storage for the dynamic area : >> from end probably isn't a good idea as the pmap code assumes : >> that the range from KERNBASE to end is covered by the pages : >> allocated by and locked into the TLB for the kernel by the : >> loader : > : > Ditto for ppc. It's possible to get the additional space from within or : > after return from pmap_bootstrap() (like thread0's kstack, or the msgbuf). : : http://people.freebsd.org/~jeff/dpcpu.diff : : I have updated this patch based on feedback relating to various : architectures md code. I tried to model most architectures after the way : msgbuf memory was taken. I have no capacity to test anything other than : i386 and amd64. ARM is reported to work with one minor diff. Apparently : sparc64 worked with the earlier diff but this should be cleaner. If : anyone can report back on sparc64, mips, or powerpc, I'd appreciate it. I don't understand this part of the patch: Index: mips/mips/mp_machdep.c =================================================================== --- mips/mips/mp_machdep.c (revision 194275) +++ mips/mips/mp_machdep.c (working copy) @@ -224,12 +224,15 @@ static int smp_start_secondary(int cpuid) { struct pcpu *pcpu; + void *dpcpu; int i; if (bootverbose) printf("smp_start_secondary: starting cpu %d\n", cpuid); + dpcpu = (void *)kmem_alloc(kernel_map, DPCPU_SIZE); pcpu_init(&__pcpu[cpuid], cpuid, sizeof(struct pcpu)); + dpcpu_init(dpcpu, cpuid); if (bootverbose) printf("smp_start_secondary: cpu %d started\n", cpuid); So were adding a dynamic per-cpu area, in addition to the fixed part? Warner From jroberson at jroberson.net Thu Jun 18 04:16:10 2009 From: jroberson at jroberson.net (Jeff Roberson) Date: Thu Jun 18 04:16:17 2009 Subject: Dynamic pcpu, arm, mips, powerpc, sun, etc. help needed In-Reply-To: <20090617.210318.1878034641.imp@bsdimp.com> References: <20090609201127.GA50903@alchemy.franken.de> <4A2F1148.9090706@freebsd.org> <20090617.210318.1878034641.imp@bsdimp.com> Message-ID: On Wed, 17 Jun 2009, M. Warner Losh wrote: > In message: > Jeff Roberson writes: > : > : On Tue, 9 Jun 2009, Peter Grehan wrote: > : > : >> As for sparc64 allocating the storage for the dynamic area > : >> from end probably isn't a good idea as the pmap code assumes > : >> that the range from KERNBASE to end is covered by the pages > : >> allocated by and locked into the TLB for the kernel by the > : >> loader > : > > : > Ditto for ppc. It's possible to get the additional space from within or > : > after return from pmap_bootstrap() (like thread0's kstack, or the msgbuf). > : > : http://people.freebsd.org/~jeff/dpcpu.diff > : > : I have updated this patch based on feedback relating to various > : architectures md code. I tried to model most architectures after the way > : msgbuf memory was taken. I have no capacity to test anything other than > : i386 and amd64. ARM is reported to work with one minor diff. Apparently > : sparc64 worked with the earlier diff but this should be cleaner. If > : anyone can report back on sparc64, mips, or powerpc, I'd appreciate it. > > > I don't understand this part of the patch: > > Index: mips/mips/mp_machdep.c > =================================================================== > --- mips/mips/mp_machdep.c (revision 194275) > +++ mips/mips/mp_machdep.c (working copy) > @@ -224,12 +224,15 @@ static int > smp_start_secondary(int cpuid) > { > struct pcpu *pcpu; > + void *dpcpu; > int i; > > if (bootverbose) > printf("smp_start_secondary: starting cpu %d\n", cpuid); > > + dpcpu = (void *)kmem_alloc(kernel_map, DPCPU_SIZE); > pcpu_init(&__pcpu[cpuid], cpuid, sizeof(struct pcpu)); > + dpcpu_init(dpcpu, cpuid); > > if (bootverbose) > printf("smp_start_secondary: cpu %d started\n", cpuid); > > So were adding a dynamic per-cpu area, in addition to the fixed part? Yes, the fixed part is for legacy and very frequently accessed items that need fixed addresses. The dynamic area is for convenience and is slightly more expensive to access. It also has addresses that are not resolved until link time. The fixed area uses a static structure with a size that is known at compile time. The dynamic part is only known at link time and so must be allocated seperately. Jeff > > Warner > From julian at elischer.org Thu Jun 18 04:34:22 2009 From: julian at elischer.org (Julian Elischer) Date: Thu Jun 18 04:34:28 2009 Subject: Dynamic pcpu, arm, mips, powerpc, sun, etc. help needed In-Reply-To: References: <20090609201127.GA50903@alchemy.franken.de> <4A2F1148.9090706@freebsd.org> <20090617.210318.1878034641.imp@bsdimp.com> Message-ID: <4A39C3CD.8020909@elischer.org> Jeff Roberson wrote: > On Wed, 17 Jun 2009, M. Warner Losh wrote: > >> In message: >> Jeff Roberson writes: >> : >> : On Tue, 9 Jun 2009, Peter Grehan wrote: >> : >> : >> As for sparc64 allocating the storage for the dynamic area >> : >> from end probably isn't a good idea as the pmap code assumes >> : >> that the range from KERNBASE to end is covered by the pages >> : >> allocated by and locked into the TLB for the kernel by the >> : >> loader >> : > >> : > Ditto for ppc. It's possible to get the additional space from >> within or >> : > after return from pmap_bootstrap() (like thread0's kstack, or the >> msgbuf). >> : >> : http://people.freebsd.org/~jeff/dpcpu.diff >> : >> : I have updated this patch based on feedback relating to various >> : architectures md code. I tried to model most architectures after >> the way >> : msgbuf memory was taken. I have no capacity to test anything other >> than >> : i386 and amd64. ARM is reported to work with one minor diff. >> Apparently >> : sparc64 worked with the earlier diff but this should be cleaner. If >> : anyone can report back on sparc64, mips, or powerpc, I'd appreciate it. >> >> >> I don't understand this part of the patch: >> >> Index: mips/mips/mp_machdep.c >> =================================================================== >> --- mips/mips/mp_machdep.c (revision 194275) >> +++ mips/mips/mp_machdep.c (working copy) >> @@ -224,12 +224,15 @@ static int >> smp_start_secondary(int cpuid) >> { >> struct pcpu *pcpu; >> + void *dpcpu; >> int i; >> >> if (bootverbose) >> printf("smp_start_secondary: starting cpu %d\n", cpuid); >> >> + dpcpu = (void *)kmem_alloc(kernel_map, DPCPU_SIZE); >> pcpu_init(&__pcpu[cpuid], cpuid, sizeof(struct pcpu)); >> + dpcpu_init(dpcpu, cpuid); >> >> if (bootverbose) >> printf("smp_start_secondary: cpu %d started\n", cpuid); >> >> So were adding a dynamic per-cpu area, in addition to the fixed part? > > Yes, the fixed part is for legacy and very frequently accessed items > that need fixed addresses. The dynamic area is for convenience and is > slightly more expensive to access. It also has addresses that are not > resolved until link time. > > The fixed area uses a static structure with a size that is known at > compile time. The dynamic part is only known at link time and so must > be allocated seperately. the compilers know of TLS and it wouldn't take much in the backend code to make the 'thread' keyworkd for TLS generate per-cpu data instead of per-thread data.. basically the register settings for TLS would have to be replaced by per cpu registers but .. wait we do that.. since the per-thread registers in the kernel point to per-cpu data and are kept correct by the scheduler, shouldn't the TLS code "just work" if we put the correct data structures in the right places? > > Jeff > >> >> Warner >> > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" From jroberson at jroberson.net Thu Jun 18 05:32:03 2009 From: jroberson at jroberson.net (Jeff Roberson) Date: Thu Jun 18 05:32:09 2009 Subject: Dynamic pcpu, arm, mips, powerpc, sun, etc. help needed In-Reply-To: <4A39C3CD.8020909@elischer.org> References: <20090609201127.GA50903@alchemy.franken.de> <4A2F1148.9090706@freebsd.org> <20090617.210318.1878034641.imp@bsdimp.com> <4A39C3CD.8020909@elischer.org> Message-ID: On Wed, 17 Jun 2009, Julian Elischer wrote: > Jeff Roberson wrote: >> On Wed, 17 Jun 2009, M. Warner Losh wrote: >> >>> In message: >>> Jeff Roberson writes: >>> : >>> : On Tue, 9 Jun 2009, Peter Grehan wrote: >>> : >>> : >> As for sparc64 allocating the storage for the dynamic area >>> : >> from end probably isn't a good idea as the pmap code assumes >>> : >> that the range from KERNBASE to end is covered by the pages >>> : >> allocated by and locked into the TLB for the kernel by the >>> : >> loader >>> : > >>> : > Ditto for ppc. It's possible to get the additional space from within >>> or >>> : > after return from pmap_bootstrap() (like thread0's kstack, or the >>> msgbuf). >>> : >>> : http://people.freebsd.org/~jeff/dpcpu.diff >>> : >>> : I have updated this patch based on feedback relating to various >>> : architectures md code. I tried to model most architectures after the >>> way >>> : msgbuf memory was taken. I have no capacity to test anything other than >>> : i386 and amd64. ARM is reported to work with one minor diff. >>> Apparently >>> : sparc64 worked with the earlier diff but this should be cleaner. If >>> : anyone can report back on sparc64, mips, or powerpc, I'd appreciate it. >>> >>> >>> I don't understand this part of the patch: >>> >>> Index: mips/mips/mp_machdep.c >>> =================================================================== >>> --- mips/mips/mp_machdep.c (revision 194275) >>> +++ mips/mips/mp_machdep.c (working copy) >>> @@ -224,12 +224,15 @@ static int >>> smp_start_secondary(int cpuid) >>> { >>> struct pcpu *pcpu; >>> + void *dpcpu; >>> int i; >>> >>> if (bootverbose) >>> printf("smp_start_secondary: starting cpu %d\n", cpuid); >>> >>> + dpcpu = (void *)kmem_alloc(kernel_map, DPCPU_SIZE); >>> pcpu_init(&__pcpu[cpuid], cpuid, sizeof(struct pcpu)); >>> + dpcpu_init(dpcpu, cpuid); >>> >>> if (bootverbose) >>> printf("smp_start_secondary: cpu %d started\n", cpuid); >>> >>> So were adding a dynamic per-cpu area, in addition to the fixed part? >> >> Yes, the fixed part is for legacy and very frequently accessed items that >> need fixed addresses. The dynamic area is for convenience and is slightly >> more expensive to access. It also has addresses that are not resolved >> until link time. >> >> The fixed area uses a static structure with a size that is known at compile >> time. The dynamic part is only known at link time and so must be allocated >> seperately. > > > the compilers know of TLS and it wouldn't take much in the backend > code to make the 'thread' keyworkd for TLS generate per-cpu data > instead of per-thread data.. basically the register settings for TLS > would have to be replaced by per cpu registers but .. wait we do > that.. > since the per-thread registers in the kernel point to per-cpu data > and are kept correct by the scheduler, shouldn't the TLS code "just > work" if we put the correct data structures in the right places? We discussed that at bsdcan and apparently it's not that simple. dfr seemed to think it would take quite some time to do the kernel linker support. There also may be issues because the compiler is free to cache thread local data but not per-cpu data so there may be a mismatch there. It would be nice ultimately to make this work but at that time DPCPU_ could just become a wrapper around __thread. Thanks, Jeff > >> >> Jeff >> >>> >>> Warner >>> >> _______________________________________________ >> freebsd-arch@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-arch >> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > From jilles at stack.nl Fri Jun 19 16:23:56 2009 From: jilles at stack.nl (Jilles Tjoelker) Date: Fri Jun 19 16:24:03 2009 Subject: deadlocks with intr NFS mounts and ^Z (or: PCATCH and sleepable locks) Message-ID: <20090619162328.GA79975@stack.nl> I have been having trouble with deadlocks with NFS mounts for a while, and I have found at least one way it can deadlock. It seems an issue with the sleep/lock system. NFS sleeps while holding a lockmgr lock, waiting for a reply from the server. When the mount is set intr, this is an interruptible sleep, so that interrupting signals can abort the sleep. However, this also means that SIGSTOP etc will suspend the thread without waking it up first, so it will be suspended with a lock held. If it holds the wrong locks, it is possible that the shell will not be able to run, and the process cannot be continued in the normal manner. Due to some other things I do not understand, it is then possible that the process cannot be continued at all (SIGCONT seems ignored), but in simple cases SIGCONT works, and things go back to normal. In any case, this situation is undesirable, as even 'umount -f' doesn't work while the thread is suspended. Of course, this reasoning applies to any code that goes to sleep interruptibly while holding a lock (sx or lockmgr). Is this supposed to be possible (likely useful)? If so, a third type of sleep would be needed that is interrupted by signals but not suspended? If not, something should check that it doesn't happen and NFS intr mounts may need to check for signals periodically or find a way to avoid sleeping with locks held. The td_locks field is only accessible for the current thread, so it cannot be used to check if suspending is safe. Also, making SIGSTOP and the like interrupt/restart syscalls is not acceptable unless you find some way to do it such that userland won't notice. For example, a read of 10 megabytes from a regular file with that much available must not return less then 10 megabytes. -- Jilles Tjoelker From kostikbel at gmail.com Fri Jun 19 20:26:57 2009 From: kostikbel at gmail.com (Kostik Belousov) Date: Fri Jun 19 20:27:04 2009 Subject: deadlocks with intr NFS mounts and ^Z (or: PCATCH and sleepable locks) In-Reply-To: <20090619162328.GA79975@stack.nl> References: <20090619162328.GA79975@stack.nl> Message-ID: <20090619194654.GC2884@deviant.kiev.zoral.com.ua> On Fri, Jun 19, 2009 at 06:23:28PM +0200, Jilles Tjoelker wrote: > I have been having trouble with deadlocks with NFS mounts for a while, > and I have found at least one way it can deadlock. It seems an issue > with the sleep/lock system. > > NFS sleeps while holding a lockmgr lock, waiting for a reply from the > server. When the mount is set intr, this is an interruptible sleep, so > that interrupting signals can abort the sleep. However, this also means > that SIGSTOP etc will suspend the thread without waking it up first, so > it will be suspended with a lock held. > > If it holds the wrong locks, it is possible that the shell will not be > able to run, and the process cannot be continued in the normal manner. > > Due to some other things I do not understand, it is then possible that > the process cannot be continued at all (SIGCONT seems ignored), but in > simple cases SIGCONT works, and things go back to normal. > > In any case, this situation is undesirable, as even 'umount -f' doesn't > work while the thread is suspended. > > Of course, this reasoning applies to any code that goes to sleep > interruptibly while holding a lock (sx or lockmgr). Is this supposed to > be possible (likely useful)? If so, a third type of sleep would be > needed that is interrupted by signals but not suspended? If not, > something should check that it doesn't happen and NFS intr mounts may > need to check for signals periodically or find a way to avoid sleeping > with locks held. > > The td_locks field is only accessible for the current thread, so it > cannot be used to check if suspending is safe. > > Also, making SIGSTOP and the like interrupt/restart syscalls is not > acceptable unless you find some way to do it such that userland won't > notice. For example, a read of 10 megabytes from a regular file with > that much available must not return less then 10 megabytes. See http://lists.freebsd.org/pipermail/freebsd-smp/2009-January/001611.html -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20090619/e6a6815e/attachment.pgp From brde at optusnet.com.au Sat Jun 20 04:45:23 2009 From: brde at optusnet.com.au (Bruce Evans) Date: Sat Jun 20 04:45:30 2009 Subject: deadlocks with intr NFS mounts and ^Z (or: PCATCH and sleepable locks) In-Reply-To: <20090619194654.GC2884@deviant.kiev.zoral.com.ua> References: <20090619162328.GA79975@stack.nl> <20090619194654.GC2884@deviant.kiev.zoral.com.ua> Message-ID: <20090620121543.F29239@delplex.bde.org> On Fri, 19 Jun 2009, Kostik Belousov wrote: > On Fri, Jun 19, 2009 at 06:23:28PM +0200, Jilles Tjoelker wrote: >> I have been having trouble with deadlocks with NFS mounts for a while, >> and I have found at least one way it can deadlock. It seems an issue >> with the sleep/lock system. >> >> NFS sleeps while holding a lockmgr lock, waiting for a reply from the >> server. When the mount is set intr, this is an interruptible sleep, so >> that interrupting signals can abort the sleep. However, this also means >> that SIGSTOP etc will suspend the thread without waking it up first, so >> it will be suspended with a lock held. >> >> If it holds the wrong locks, it is possible that the shell will not be >> able to run, and the process cannot be continued in the normal manner. >> >> Due to some other things I do not understand, it is then possible that >> the process cannot be continued at all (SIGCONT seems ignored), but in >> simple cases SIGCONT works, and things go back to normal. >> ... >> Also, making SIGSTOP and the like interrupt/restart syscalls is not >> acceptable unless you find some way to do it such that userland won't >> notice. For example, a read of 10 megabytes from a regular file with >> that much available must not return less then 10 megabytes. > > See > http://lists.freebsd.org/pipermail/freebsd-smp/2009-January/001611.html Have any fixes been applied? I now remember seeing problems like the first set above on FreeBSD cluster machines (I don't encounter "intr" nfs mounts anywhere else; mount(8) still doesn't show the "intr" option so I assume that the "intr" specified in fstab is in use on the FreeBSD machines): normal resume after ^Z on a parallel build not working, sometimes hanging the whole file system but other times recoverable after re-logging in and sending suitable SIGCONTs manually These problems seemed to go away, but right now the following problem like the second set above occurs consistently (I first noticed this last week): Script started on Sat Jun 20 02:32:51 2009 pts/0:bde@ref8-i386:~/sys7/i386/compile> sh zm ^Z [1]+ Stopped sh zm pts/0:bde@ref8-i386:~/sys7/i386/compile> % sh zm *** Stopped -- signal 18 *** Stopped -- signal 18 *** Stopped -- signal 18 *** Signal 1 *** Signal 1 *** Signal 1 `all' not remade because of errors. linking kernel ^C pts/0:bde@ref8-i386:~/sys7/i386/compile> exit Script done on Sat Jun 20 02:34:41 2009 The shell script zm builds 6 kernels in parallel using make -k -j8 for each. Signal 18 is SIGTSTP. Receiving this is normal, but the shell shouldn't print any meesages about it. Signal 1 is SIGHUP. This shouldn't occur. On another run, ISTR getting messages about i/o errors or unrestartable processes. Anyway, the messages about signals are associated with failing jobs in the build. ref7-i386 now behaves normally -- ^Z and resume just work; no messages are printed and the build completes successfully after resuming. Bruce From kostikbel at gmail.com Sat Jun 20 16:15:50 2009 From: kostikbel at gmail.com (Kostik Belousov) Date: Sat Jun 20 16:15:57 2009 Subject: deadlocks with intr NFS mounts and ^Z (or: PCATCH and sleepable locks) In-Reply-To: <20090619162328.GA79975@stack.nl> References: <20090619162328.GA79975@stack.nl> Message-ID: <20090620161540.GF2884@deviant.kiev.zoral.com.ua> On Fri, Jun 19, 2009 at 06:23:28PM +0200, Jilles Tjoelker wrote: > I have been having trouble with deadlocks with NFS mounts for a while, > and I have found at least one way it can deadlock. It seems an issue > with the sleep/lock system. > > NFS sleeps while holding a lockmgr lock, waiting for a reply from the > server. When the mount is set intr, this is an interruptible sleep, so > that interrupting signals can abort the sleep. However, this also means > that SIGSTOP etc will suspend the thread without waking it up first, so > it will be suspended with a lock held. > > If it holds the wrong locks, it is possible that the shell will not be > able to run, and the process cannot be continued in the normal manner. > > Due to some other things I do not understand, it is then possible that > the process cannot be continued at all (SIGCONT seems ignored), but in > simple cases SIGCONT works, and things go back to normal. > > In any case, this situation is undesirable, as even 'umount -f' doesn't > work while the thread is suspended. > > Of course, this reasoning applies to any code that goes to sleep > interruptibly while holding a lock (sx or lockmgr). Is this supposed to > be possible (likely useful)? If so, a third type of sleep would be > needed that is interrupted by signals but not suspended? If not, > something should check that it doesn't happen and NFS intr mounts may > need to check for signals periodically or find a way to avoid sleeping > with locks held. > > The td_locks field is only accessible for the current thread, so it > cannot be used to check if suspending is safe. > > Also, making SIGSTOP and the like interrupt/restart syscalls is not > acceptable unless you find some way to do it such that userland won't > notice. For example, a read of 10 megabytes from a regular file with > that much available must not return less then 10 megabytes. Note that NFS does check for the signals during i/o, so you may get short reads anyway. I do think that the right solution both there and with SINGLE_NO_EXIT case for thread_single is to stop at the usermode boundary instead of suspending a thread in the interruptible sleep state. I set error code returned from interrupted msleep() to ERESTART, that seems to be the right thing, at least to restart the i/o that transferred no data upon receiving SIGSTOP. My current patch is below. It contains some not strictly related changes, e.g. for wakeup(). diff --git a/sys/kern/kern_sig.c b/sys/kern/kern_sig.c index 5c1d553..28f4f4f 100644 --- a/sys/kern/kern_sig.c +++ b/sys/kern/kern_sig.c @@ -2310,18 +2310,22 @@ static void sig_suspend_threads(struct thread *td, struct proc *p, int sending) { struct thread *td2; + int wakeup_swapper; PROC_LOCK_ASSERT(p, MA_OWNED); PROC_SLOCK_ASSERT(p, MA_OWNED); + wakeup_swapper = 0; FOREACH_THREAD_IN_PROC(p, td2) { thread_lock(td2); td2->td_flags |= TDF_ASTPENDING | TDF_NEEDSUSPCHK; if ((TD_IS_SLEEPING(td2) || TD_IS_SWAPPED(td2)) && - (td2->td_flags & TDF_SINTR) && - !TD_IS_SUSPENDED(td2)) { - thread_suspend_one(td2); - } else { + (td2->td_flags & TDF_SINTR)) { + if (TD_IS_SUSPENDED(td2)) + wakeup_swapper |= thread_unsuspend_one(td2); + if (TD_ON_SLEEPQ(td2) && (td2->td_flags & TDF_SINTR)) + wakeup_swapper |= sleepq_abort(td2, ERESTART); + } else if (!TD_IS_SUSPENDED(td2)) { if (sending || td != td2) td2->td_flags |= TDF_ASTPENDING; #ifdef SMP @@ -2331,6 +2335,8 @@ sig_suspend_threads(struct thread *td, struct proc *p, int sending) } thread_unlock(td2); } + if (wakeup_swapper) + kick_proc0(); } int diff --git a/sys/kern/kern_synch.c b/sys/kern/kern_synch.c index b91c1a5..d27d027 100644 --- a/sys/kern/kern_synch.c +++ b/sys/kern/kern_synch.c @@ -344,11 +344,16 @@ wakeup(void *ident) { int wakeup_swapper; + repeat: sleepq_lock(ident); wakeup_swapper = sleepq_broadcast(ident, SLEEPQ_SLEEP, 0, 0); sleepq_release(ident); - if (wakeup_swapper) - kick_proc0(); + if (wakeup_swapper) { + if (ident == &proc0) + goto repeat; + else + kick_proc0(); + } } /* @@ -361,11 +366,16 @@ wakeup_one(void *ident) { int wakeup_swapper; + repeat: sleepq_lock(ident); wakeup_swapper = sleepq_signal(ident, SLEEPQ_SLEEP, 0, 0); sleepq_release(ident); - if (wakeup_swapper) - kick_proc0(); + if (wakeup_swapper) { + if (ident == &proc0) + goto repeat; + else + kick_proc0(); + } } static void diff --git a/sys/kern/kern_thread.c b/sys/kern/kern_thread.c index bb8779b..800a1d1 100644 --- a/sys/kern/kern_thread.c +++ b/sys/kern/kern_thread.c @@ -504,6 +504,22 @@ thread_unlink(struct thread *td) /* Must NOT clear links to proc! */ } +static int +recalc_remaining(struct proc *p, int mode) +{ + int remaining; + + if (mode == SINGLE_EXIT) + remaining = p->p_numthreads; + else if (mode == SINGLE_BOUNDARY) + remaining = p->p_numthreads - p->p_boundary_count; + else if (mode == SINGLE_NO_EXIT) + remaining = p->p_numthreads - p->p_suspcount; + else + panic("recalc_remaining: wrong mode %d", mode); + return (remaining); +} + /* * Enforce single-threading. * @@ -551,12 +567,7 @@ thread_single(int mode) p->p_flag |= P_STOPPED_SINGLE; PROC_SLOCK(p); p->p_singlethread = td; - if (mode == SINGLE_EXIT) - remaining = p->p_numthreads; - else if (mode == SINGLE_BOUNDARY) - remaining = p->p_numthreads - p->p_boundary_count; - else - remaining = p->p_numthreads - p->p_suspcount; + remaining = recalc_remaining(p, mode); while (remaining != 1) { if (P_SHOULDSTOP(p) != P_STOPPED_SINGLE) goto stopme; @@ -587,18 +598,17 @@ thread_single(int mode) wakeup_swapper |= sleepq_abort(td2, ERESTART); break; + case SINGLE_NO_EXIT: + if (TD_IS_SUSPENDED(td2) && + !(td2->td_flags & TDF_BOUNDARY)) + wakeup_swapper |= + thread_unsuspend_one(td2); + if (TD_ON_SLEEPQ(td2) && + (td2->td_flags & TDF_SINTR)) + wakeup_swapper |= + sleepq_abort(td2, ERESTART); + break; default: - if (TD_IS_SUSPENDED(td2)) { - thread_unlock(td2); - continue; - } - /* - * maybe other inhibited states too? - */ - if ((td2->td_flags & TDF_SINTR) && - (td2->td_inhibitors & - (TDI_SLEEPING | TDI_SWAPPED))) - thread_suspend_one(td2); break; } } @@ -611,12 +621,7 @@ thread_single(int mode) } if (wakeup_swapper) kick_proc0(); - if (mode == SINGLE_EXIT) - remaining = p->p_numthreads; - else if (mode == SINGLE_BOUNDARY) - remaining = p->p_numthreads - p->p_boundary_count; - else - remaining = p->p_numthreads - p->p_suspcount; + remaining = recalc_remaining(p, mode); /* * Maybe we suspended some threads.. was it enough? @@ -630,12 +635,7 @@ stopme: * In the mean time we suspend as well. */ thread_suspend_switch(td); - if (mode == SINGLE_EXIT) - remaining = p->p_numthreads; - else if (mode == SINGLE_BOUNDARY) - remaining = p->p_numthreads - p->p_boundary_count; - else - remaining = p->p_numthreads - p->p_suspcount; + remaining = recalc_remaining(p, mode); } if (mode == SINGLE_EXIT) { /* -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20090620/345bdaba/attachment.pgp From kris at FreeBSD.org Sat Jun 20 16:50:57 2009 From: kris at FreeBSD.org (Kris Kennaway) Date: Sat Jun 20 16:51:03 2009 Subject: deadlocks with intr NFS mounts and ^Z (or: PCATCH and sleepable locks) In-Reply-To: <20090620121543.F29239@delplex.bde.org> References: <20090619162328.GA79975@stack.nl> <20090619194654.GC2884@deviant.kiev.zoral.com.ua> <20090620121543.F29239@delplex.bde.org> Message-ID: <4A3D136F.1090106@FreeBSD.org> Bruce Evans wrote: > These problems seemed to go away, but right now the following problem > like the second set above occurs consistently (I first noticed this > last week): > > Script started on Sat Jun 20 02:32:51 2009 > pts/0:bde@ref8-i386:~/sys7/i386/compile> sh zm > ^Z > [1]+ Stopped sh zm > pts/0:bde@ref8-i386:~/sys7/i386/compile> % > sh zm > *** Stopped -- signal 18 > *** Stopped -- signal 18 > *** Stopped -- signal 18 > *** Signal 1 > *** Signal 1 > *** Signal 1 > `all' not remade because of errors. > linking kernel > ^C > pts/0:bde@ref8-i386:~/sys7/i386/compile> exit > > Script done on Sat Jun 20 02:34:41 2009 > > The shell script zm builds 6 kernels in parallel using make -k -j8 for > each. Signal 18 is SIGTSTP. Receiving this is normal, but the shell > shouldn't print any meesages about it. Signal 1 is SIGHUP. This > shouldn't occur. On another run, ISTR getting messages about i/o > errors or unrestartable processes. Anyway, the messages about signals > are associated with failing jobs in the build. That's a long standing bug that I don't think is limited to NFS. I first started seeing it several years ago after some changes to make(1), but I don't recall if phk disputed that they were to blame. Kris From jilles at stack.nl Sat Jun 20 20:33:30 2009 From: jilles at stack.nl (Jilles Tjoelker) Date: Sat Jun 20 20:33:37 2009 Subject: deadlocks with intr NFS mounts and ^Z (or: PCATCH and sleepable locks) In-Reply-To: <20090620161540.GF2884@deviant.kiev.zoral.com.ua> References: <20090619162328.GA79975@stack.nl> <20090620161540.GF2884@deviant.kiev.zoral.com.ua> Message-ID: <20090620203300.GA21763@stack.nl> On Sat, Jun 20, 2009 at 07:15:40PM +0300, Kostik Belousov wrote: > On Fri, Jun 19, 2009 at 06:23:28PM +0200, Jilles Tjoelker wrote: > > I have been having trouble with deadlocks with NFS mounts for a while, > > and I have found at least one way it can deadlock. It seems an issue > > with the sleep/lock system. > > NFS sleeps while holding a lockmgr lock, waiting for a reply from the > > server. When the mount is set intr, this is an interruptible sleep, so > > that interrupting signals can abort the sleep. However, this also means > > that SIGSTOP etc will suspend the thread without waking it up first, so > > it will be suspended with a lock held. > > If it holds the wrong locks, it is possible that the shell will not be > > able to run, and the process cannot be continued in the normal manner. > > Due to some other things I do not understand, it is then possible that > > the process cannot be continued at all (SIGCONT seems ignored), but in > > simple cases SIGCONT works, and things go back to normal. > > In any case, this situation is undesirable, as even 'umount -f' doesn't > > work while the thread is suspended. > > Of course, this reasoning applies to any code that goes to sleep > > interruptibly while holding a lock (sx or lockmgr). Is this supposed to > > be possible (likely useful)? If so, a third type of sleep would be > > needed that is interrupted by signals but not suspended? If not, > > something should check that it doesn't happen and NFS intr mounts may > > need to check for signals periodically or find a way to avoid sleeping > > with locks held. > > The td_locks field is only accessible for the current thread, so it > > cannot be used to check if suspending is safe. > > Also, making SIGSTOP and the like interrupt/restart syscalls is not > > acceptable unless you find some way to do it such that userland won't > > notice. For example, a read of 10 megabytes from a regular file with > > that much available must not return less then 10 megabytes. > Note that NFS does check for the signals during i/o, so you may get > short reads anyway. > I do think that the right solution both there and with SINGLE_NO_EXIT > case for thread_single is to stop at the usermode boundary instead of > suspending a thread in the interruptible sleep state. > I set error code returned from interrupted msleep() to ERESTART, > that seems to be the right thing, at least to restart the i/o that > transferred no data upon receiving SIGSTOP. Any such short read on a regular file is wrong. That that badness already occurs in some cases is not an excuse to make it occur more often. Particularly because process suspension is expected not to affect the process and interrupting syscalls would change the behaviour of the debugged program significantly, while the current interruptions only occur with signals that likely terminate the process anyway (note that intr mounts only check for SIGINT, SIGTERM, SIGHUP, SIGKILL, SIGSTOP and SIGQUIT and appear to mask all others; I don't know why SIGTSTP gets through -- possibly a thread/process difference). No matter the SIGSTOP issue, a warning about the interruptions in the mount_nfs(8) man page may be in order; the current language makes the impression that intr is only a good thing, which is not the case. This applies to all NFS versions. A better way to deal with nonresponsive NFS servers that will not come back would be forced unmount (does it always work, apart from the case mentioned above? same for the experimental client?). SIGKILL (but not any other signal, not even SIGSTOP) could also be allowed on processes blocked on nointr mounts. Another point (mostly for socket operations and the like) is that the current causes of interrupted system calls are under control of the application: if you do not catch any signals, you will only get short read/writes for reasons related to the underlying object; hence, it is often not necessary to add (ugly) code to handle it: any unexpected short read or write is a problem with the underlying object. Another example which currently works and would be a shame to break: % /usr/bin/time sleep 10 ^Z zsh: suspended /usr/bin/time sleep 10 % fg [1] + continued /usr/bin/time sleep 10 10.00 real 0.00 user 0.00 sys % What's more, the fact that this works is thanks to the kernel. sleep(1) just calls nanosleep(2), and because it doesn't catch any signals, that suffices. I do notice this is already broken for debuggers. Attaching gdb or truss to a running sleep process immediately aborts the nanosleep with EINTR. -- Jilles Tjoelker From kostikbel at gmail.com Sun Jun 21 11:52:06 2009 From: kostikbel at gmail.com (Kostik Belousov) Date: Sun Jun 21 11:52:13 2009 Subject: deadlocks with intr NFS mounts and ^Z (or: PCATCH and sleepable locks) In-Reply-To: <20090620203300.GA21763@stack.nl> References: <20090619162328.GA79975@stack.nl> <20090620161540.GF2884@deviant.kiev.zoral.com.ua> <20090620203300.GA21763@stack.nl> Message-ID: <20090621115157.GJ2884@deviant.kiev.zoral.com.ua> On Sat, Jun 20, 2009 at 10:33:00PM +0200, Jilles Tjoelker wrote: > On Sat, Jun 20, 2009 at 07:15:40PM +0300, Kostik Belousov wrote: > > On Fri, Jun 19, 2009 at 06:23:28PM +0200, Jilles Tjoelker wrote: > > > I have been having trouble with deadlocks with NFS mounts for a while, > > > and I have found at least one way it can deadlock. It seems an issue > > > with the sleep/lock system. > > > > NFS sleeps while holding a lockmgr lock, waiting for a reply from the > > > server. When the mount is set intr, this is an interruptible sleep, so > > > that interrupting signals can abort the sleep. However, this also means > > > that SIGSTOP etc will suspend the thread without waking it up first, so > > > it will be suspended with a lock held. > > > > If it holds the wrong locks, it is possible that the shell will not be > > > able to run, and the process cannot be continued in the normal manner. > > > > Due to some other things I do not understand, it is then possible that > > > the process cannot be continued at all (SIGCONT seems ignored), but in > > > simple cases SIGCONT works, and things go back to normal. > > > > In any case, this situation is undesirable, as even 'umount -f' doesn't > > > work while the thread is suspended. > > > > Of course, this reasoning applies to any code that goes to sleep > > > interruptibly while holding a lock (sx or lockmgr). Is this supposed to > > > be possible (likely useful)? If so, a third type of sleep would be > > > needed that is interrupted by signals but not suspended? If not, > > > something should check that it doesn't happen and NFS intr mounts may > > > need to check for signals periodically or find a way to avoid sleeping > > > with locks held. > > > > The td_locks field is only accessible for the current thread, so it > > > cannot be used to check if suspending is safe. > > > > Also, making SIGSTOP and the like interrupt/restart syscalls is not > > > acceptable unless you find some way to do it such that userland won't > > > notice. For example, a read of 10 megabytes from a regular file with > > > that much available must not return less then 10 megabytes. > > > Note that NFS does check for the signals during i/o, so you may get > > short reads anyway. > > > I do think that the right solution both there and with SINGLE_NO_EXIT > > case for thread_single is to stop at the usermode boundary instead of > > suspending a thread in the interruptible sleep state. > > > I set error code returned from interrupted msleep() to ERESTART, > > that seems to be the right thing, at least to restart the i/o that > > transferred no data upon receiving SIGSTOP. > > Any such short read on a regular file is wrong. That that badness > already occurs in some cases is not an excuse to make it occur more > often. Particularly because process suspension is expected not to affect > the process and interrupting syscalls would change the behaviour of the > debugged program significantly, while the current interruptions only > occur with signals that likely terminate the process anyway (note that > intr mounts only check for SIGINT, SIGTERM, SIGHUP, SIGKILL, SIGSTOP and > SIGQUIT and appear to mask all others; I don't know why SIGTSTP gets > through -- possibly a thread/process difference). > > No matter the SIGSTOP issue, a warning about the interruptions in the > mount_nfs(8) man page may be in order; the current language makes the > impression that intr is only a good thing, which is not the case. This > applies to all NFS versions. A better way to deal with nonresponsive NFS > servers that will not come back would be forced unmount (does it always > work, apart from the case mentioned above? same for the experimental > client?). SIGKILL (but not any other signal, not even SIGSTOP) could > also be allowed on processes blocked on nointr mounts. > > Another point (mostly for socket operations and the like) is that the > current causes of interrupted system calls are under control of the > application: if you do not catch any signals, you will only get short > read/writes for reasons related to the underlying object; hence, it is > often not necessary to add (ugly) code to handle it: any unexpected > short read or write is a problem with the underlying object. > > Another example which currently works and would be a shame to break: > > % /usr/bin/time sleep 10 > ^Z > zsh: suspended /usr/bin/time sleep 10 > % fg > [1] + continued /usr/bin/time sleep 10 > 10.00 real 0.00 user 0.00 sys > % > > What's more, the fact that this works is thanks to the kernel. sleep(1) > just calls nanosleep(2), and because it doesn't catch any signals, that > suffices. > > I do notice this is already broken for debuggers. Attaching gdb or truss > to a running sleep process immediately aborts the nanosleep with EINTR. The point is valid, I updated the patch by adding a special flag for the msleep that indicates that stop is allowed only on usermode boundary. Sleeps from the nfs client where resources are possibly locked are marked with the flag. diff --git a/sys/kern/kern_sig.c b/sys/kern/kern_sig.c index 5c1d553..5312ffa 100644 --- a/sys/kern/kern_sig.c +++ b/sys/kern/kern_sig.c @@ -2310,18 +2310,28 @@ static void sig_suspend_threads(struct thread *td, struct proc *p, int sending) { struct thread *td2; + int wakeup_swapper; PROC_LOCK_ASSERT(p, MA_OWNED); PROC_SLOCK_ASSERT(p, MA_OWNED); + wakeup_swapper = 0; FOREACH_THREAD_IN_PROC(p, td2) { thread_lock(td2); td2->td_flags |= TDF_ASTPENDING | TDF_NEEDSUSPCHK; if ((TD_IS_SLEEPING(td2) || TD_IS_SWAPPED(td2)) && - (td2->td_flags & TDF_SINTR) && - !TD_IS_SUSPENDED(td2)) { - thread_suspend_one(td2); - } else { + (td2->td_flags & TDF_SINTR)) { + if (td2->td_flags & TDF_SBDRY) { + if (TD_IS_SUSPENDED(td2)) + wakeup_swapper |= + thread_unsuspend_one(td2); + if (TD_ON_SLEEPQ(td2)) + wakeup_swapper |= + sleepq_abort(td2, ERESTART); + } else if (!TD_IS_SUSPENDED(td2)) { + thread_suspend_one(td2); + } + } else if (!TD_IS_SUSPENDED(td2)) { if (sending || td != td2) td2->td_flags |= TDF_ASTPENDING; #ifdef SMP @@ -2331,6 +2341,8 @@ sig_suspend_threads(struct thread *td, struct proc *p, int sending) } thread_unlock(td2); } + if (wakeup_swapper) + kick_proc0(); } int diff --git a/sys/kern/kern_synch.c b/sys/kern/kern_synch.c index b91c1a5..58488ac 100644 --- a/sys/kern/kern_synch.c +++ b/sys/kern/kern_synch.c @@ -188,6 +188,8 @@ _sleep(void *ident, struct lock_object *lock, int priority, flags = SLEEPQ_SLEEP; if (catch) flags |= SLEEPQ_INTERRUPTIBLE; + if (priority & PBDRY) + flags |= SLEEPQ_STOP_ON_BDRY; sleepq_lock(ident); CTR5(KTR_PROC, "sleep: thread %ld (pid %ld, %s) on %s (%p)", @@ -344,11 +346,16 @@ wakeup(void *ident) { int wakeup_swapper; + repeat: sleepq_lock(ident); wakeup_swapper = sleepq_broadcast(ident, SLEEPQ_SLEEP, 0, 0); sleepq_release(ident); - if (wakeup_swapper) - kick_proc0(); + if (wakeup_swapper) { + if (ident == &proc0) + goto repeat; + else + kick_proc0(); + } } /* @@ -361,11 +368,16 @@ wakeup_one(void *ident) { int wakeup_swapper; + repeat: sleepq_lock(ident); wakeup_swapper = sleepq_signal(ident, SLEEPQ_SLEEP, 0, 0); sleepq_release(ident); - if (wakeup_swapper) - kick_proc0(); + if (wakeup_swapper) { + if (ident == &proc0) + goto repeat; + else + kick_proc0(); + } } static void diff --git a/sys/kern/kern_thread.c b/sys/kern/kern_thread.c index bb8779b..800a1d1 100644 --- a/sys/kern/kern_thread.c +++ b/sys/kern/kern_thread.c @@ -504,6 +504,22 @@ thread_unlink(struct thread *td) /* Must NOT clear links to proc! */ } +static int +recalc_remaining(struct proc *p, int mode) +{ + int remaining; + + if (mode == SINGLE_EXIT) + remaining = p->p_numthreads; + else if (mode == SINGLE_BOUNDARY) + remaining = p->p_numthreads - p->p_boundary_count; + else if (mode == SINGLE_NO_EXIT) + remaining = p->p_numthreads - p->p_suspcount; + else + panic("recalc_remaining: wrong mode %d", mode); + return (remaining); +} + /* * Enforce single-threading. * @@ -551,12 +567,7 @@ thread_single(int mode) p->p_flag |= P_STOPPED_SINGLE; PROC_SLOCK(p); p->p_singlethread = td; - if (mode == SINGLE_EXIT) - remaining = p->p_numthreads; - else if (mode == SINGLE_BOUNDARY) - remaining = p->p_numthreads - p->p_boundary_count; - else - remaining = p->p_numthreads - p->p_suspcount; + remaining = recalc_remaining(p, mode); while (remaining != 1) { if (P_SHOULDSTOP(p) != P_STOPPED_SINGLE) goto stopme; @@ -587,18 +598,17 @@ thread_single(int mode) wakeup_swapper |= sleepq_abort(td2, ERESTART); break; + case SINGLE_NO_EXIT: + if (TD_IS_SUSPENDED(td2) && + !(td2->td_flags & TDF_BOUNDARY)) + wakeup_swapper |= + thread_unsuspend_one(td2); + if (TD_ON_SLEEPQ(td2) && + (td2->td_flags & TDF_SINTR)) + wakeup_swapper |= + sleepq_abort(td2, ERESTART); + break; default: - if (TD_IS_SUSPENDED(td2)) { - thread_unlock(td2); - continue; - } - /* - * maybe other inhibited states too? - */ - if ((td2->td_flags & TDF_SINTR) && - (td2->td_inhibitors & - (TDI_SLEEPING | TDI_SWAPPED))) - thread_suspend_one(td2); break; } } @@ -611,12 +621,7 @@ thread_single(int mode) } if (wakeup_swapper) kick_proc0(); - if (mode == SINGLE_EXIT) - remaining = p->p_numthreads; - else if (mode == SINGLE_BOUNDARY) - remaining = p->p_numthreads - p->p_boundary_count; - else - remaining = p->p_numthreads - p->p_suspcount; + remaining = recalc_remaining(p, mode); /* * Maybe we suspended some threads.. was it enough? @@ -630,12 +635,7 @@ stopme: * In the mean time we suspend as well. */ thread_suspend_switch(td); - if (mode == SINGLE_EXIT) - remaining = p->p_numthreads; - else if (mode == SINGLE_BOUNDARY) - remaining = p->p_numthreads - p->p_boundary_count; - else - remaining = p->p_numthreads - p->p_suspcount; + remaining = recalc_remaining(p, mode); } if (mode == SINGLE_EXIT) { /* diff --git a/sys/kern/subr_sleepqueue.c b/sys/kern/subr_sleepqueue.c index 01fcc73..781c186 100644 --- a/sys/kern/subr_sleepqueue.c +++ b/sys/kern/subr_sleepqueue.c @@ -341,6 +341,8 @@ sleepq_add(void *wchan, struct lock_object *lock, const char *wmesg, int flags, if (flags & SLEEPQ_INTERRUPTIBLE) { td->td_flags |= TDF_SINTR; td->td_flags &= ~TDF_SLEEPABORT; + if (flags & SLEEPQ_STOP_ON_BDRY) + td->td_flags |= TDF_SBDRY; } thread_unlock(td); } diff --git a/sys/nfsclient/nfs_bio.c b/sys/nfsclient/nfs_bio.c index 22e2a79..d5d426e 100644 --- a/sys/nfsclient/nfs_bio.c +++ b/sys/nfsclient/nfs_bio.c @@ -1255,7 +1255,7 @@ nfs_getcacheblk(struct vnode *vp, daddr_t bn, int size, struct thread *td) sigset_t oldset; nfs_set_sigmask(td, &oldset); - bp = getblk(vp, bn, size, PCATCH, 0, 0); + bp = getblk(vp, bn, size, NFS_PCATCH, 0, 0); nfs_restore_sigmask(td, &oldset); while (bp == NULL) { if (nfs_sigintr(nmp, NULL, td)) @@ -1292,7 +1292,7 @@ nfs_vinvalbuf(struct vnode *vp, int flags, struct thread *td, int intrflg) if ((nmp->nm_flag & NFSMNT_INT) == 0) intrflg = 0; if (intrflg) { - slpflag = PCATCH; + slpflag = NFS_PCATCH; slptimeo = 2 * hz; } else { slpflag = 0; @@ -1371,7 +1371,7 @@ nfs_asyncio(struct nfsmount *nmp, struct buf *bp, struct ucred *cred, struct thr } again: if (nmp->nm_flag & NFSMNT_INT) - slpflag = PCATCH; + slpflag = NFS_PCATCH; gotiod = FALSE; /* @@ -1440,7 +1440,7 @@ again: mtx_unlock(&nfs_iod_mtx); return (error2); } - if (slpflag == PCATCH) { + if (slpflag == NFS_PCATCH) { slpflag = 0; slptimeo = 2 * hz; } diff --git a/sys/nfsclient/nfs_socket.c b/sys/nfsclient/nfs_socket.c index 1ae31a5..2398695 100644 --- a/sys/nfsclient/nfs_socket.c +++ b/sys/nfsclient/nfs_socket.c @@ -516,7 +516,7 @@ nfs_reconnect(struct nfsreq *rep) KASSERT(mtx_owned(&nmp->nm_mtx), ("NFS mnt lock not owned !")); if (nmp->nm_flag & NFSMNT_INT) - slpflag = PCATCH; + slpflag = NFS_PCATCH; /* * Wait for any pending writes to this socket to drain (or timeout). */ @@ -768,7 +768,7 @@ tryagain: slpflag = 0; mtx_lock(&nmp->nm_mtx); if (nmp->nm_flag & NFSMNT_INT) - slpflag = PCATCH; + slpflag = NFS_PCATCH; mtx_unlock(&nmp->nm_mtx); mtx_lock(&rep->r_mtx); while ((rep->r_mrep == NULL) && (error == 0) && @@ -1791,7 +1791,7 @@ nfs_connect_lock(struct nfsreq *rep) td = rep->r_td; if (rep->r_nmp->nm_flag & NFSMNT_INT) - slpflag = PCATCH; + slpflag = NFS_PCATCH; while (*statep & NFSSTA_SNDLOCK) { error = nfs_sigintr(rep->r_nmp, rep, td); if (error) { @@ -1800,7 +1800,7 @@ nfs_connect_lock(struct nfsreq *rep) *statep |= NFSSTA_WANTSND; (void) msleep(statep, &rep->r_nmp->nm_mtx, slpflag | (PZERO - 1), "nfsndlck", slptimeo); - if (slpflag == PCATCH) { + if (slpflag & PCATCH) { slpflag = 0; slptimeo = 2 * hz; } diff --git a/sys/nfsclient/nfs_vnops.c b/sys/nfsclient/nfs_vnops.c index 3623fab..a8d098b 100644 --- a/sys/nfsclient/nfs_vnops.c +++ b/sys/nfsclient/nfs_vnops.c @@ -2931,7 +2931,7 @@ nfs_flush(struct vnode *vp, int waitfor, int commit) int bvecsize = 0, bveccount; if (nmp->nm_flag & NFSMNT_INT) - slpflag = PCATCH; + slpflag = NFS_PCATCH; if (!commit) passone = 0; bo = &vp->v_bufobj; @@ -3129,7 +3129,7 @@ loop: error = EINTR; goto done; } - if (slpflag == PCATCH) { + if (slpflag & PCATCH) { slpflag = 0; slptimeo = 2 * hz; } @@ -3167,7 +3167,7 @@ loop: error = nfs_sigintr(nmp, NULL, td); if (error) goto done; - if (slpflag == PCATCH) { + if (slpflag & PCATCH) { slpflag = 0; slptimeo = 2 * hz; } diff --git a/sys/nfsclient/nfsmount.h b/sys/nfsclient/nfsmount.h index 85f8501..c98a172 100644 --- a/sys/nfsclient/nfsmount.h +++ b/sys/nfsclient/nfsmount.h @@ -147,6 +147,8 @@ struct nfsmount { #define NFS_TPRINTF_DELAY 30 #endif +#define NFS_PCATCH (PCATCH | PBDRY) + #endif #endif diff --git a/sys/sys/param.h b/sys/sys/param.h index 06745f8..5ee9c16 100644 --- a/sys/sys/param.h +++ b/sys/sys/param.h @@ -186,6 +186,7 @@ #define PRIMASK 0x0ff #define PCATCH 0x100 /* OR'd with pri for tsleep to check signals */ #define PDROP 0x200 /* OR'd with pri to stop re-entry of interlock mutex */ +#define PBDRY 0x400 /* for PCATCH stop is done on the user boundary */ #define NZERO 0 /* default "nice" */ diff --git a/sys/sys/proc.h b/sys/sys/proc.h index 0a4b79c..b65db62 100644 --- a/sys/sys/proc.h +++ b/sys/sys/proc.h @@ -320,7 +320,7 @@ do { \ #define TDF_BOUNDARY 0x00000400 /* Thread suspended at user boundary */ #define TDF_ASTPENDING 0x00000800 /* Thread has some asynchronous events. */ #define TDF_TIMOFAIL 0x00001000 /* Timeout from sleep after we were awake. */ -#define TDF_UNUSED2000 0x00002000 /* --available-- */ +#define TDF_SBDRY 0x00002000 /* Stop only on usermode boundary. */ #define TDF_UPIBLOCKED 0x00004000 /* Thread blocked on user PI mutex. */ #define TDF_NEEDSUSPCHK 0x00008000 /* Thread may need to suspend. */ #define TDF_NEEDRESCHED 0x00010000 /* Thread needs to yield. */ diff --git a/sys/sys/sleepqueue.h b/sys/sys/sleepqueue.h index 0d1f361..362945a 100644 --- a/sys/sys/sleepqueue.h +++ b/sys/sys/sleepqueue.h @@ -93,6 +93,8 @@ struct thread; #define SLEEPQ_SX 0x03 /* Used by an sx lock. */ #define SLEEPQ_LK 0x04 /* Used by a lockmgr. */ #define SLEEPQ_INTERRUPTIBLE 0x100 /* Sleep is interruptible. */ +#define SLEEPQ_STOP_ON_BDRY 0x200 /* Stop sleeping thread on + user mode boundary */ void init_sleepqueues(void); int sleepq_abort(struct thread *td, int intrval); -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20090621/493668fb/attachment.pgp From marius at alchemy.franken.de Sun Jun 21 14:03:15 2009 From: marius at alchemy.franken.de (Marius Strobl) Date: Sun Jun 21 14:03:22 2009 Subject: Dynamic pcpu, arm, mips, powerpc, sun, etc. help needed In-Reply-To: References: <20090609201127.GA50903@alchemy.franken.de> <4A2F1148.9090706@freebsd.org> Message-ID: <20090621140312.GC71667@alchemy.franken.de> On Wed, Jun 17, 2009 at 12:55:52PM -1000, Jeff Roberson wrote: > > On Tue, 9 Jun 2009, Peter Grehan wrote: > > >>As for sparc64 allocating the storage for the dynamic area > >>from end probably isn't a good idea as the pmap code assumes > >>that the range from KERNBASE to end is covered by the pages > >>allocated by and locked into the TLB for the kernel by the > >>loader > > > >Ditto for ppc. It's possible to get the additional space from within or > >after return from pmap_bootstrap() (like thread0's kstack, or the msgbuf). > > http://people.freebsd.org/~jeff/dpcpu.diff > > I have updated this patch based on feedback relating to various > architectures md code. I tried to model most architectures after the way > msgbuf memory was taken. I have no capacity to test anything other than > i386 and amd64. ARM is reported to work with one minor diff. Apparently > sparc64 worked with the earlier diff but this should be cleaner. If > anyone can report back on sparc64, mips, or powerpc, I'd appreciate it. > The earlier patch worked on sparc64 as long as the kernel happened to leave enough room in the last 4MB page allocated for it. The new version unfortunately doesn't compile on sparc64 as pmap_bootstrap_alloc() is static to its pmap.c (I think it should also stay that way). Also the memory allocated with it isn't safe to be used before we've taken over the trap table. A kernel built with the sparc64 bits replaced with the following patch boots fine: http://people.freebsd.org/~marius/sparc64_dpcpu.diff Do you have some simple test case for DPCPU which can be used to verify that it actually works? Marius From attilio at freebsd.org Sun Jun 21 16:17:29 2009 From: attilio at freebsd.org (Attilio Rao) Date: Sun Jun 21 16:17:36 2009 Subject: Dynamic pcpu, arm, mips, powerpc, sun, etc. help needed In-Reply-To: <20090621140312.GC71667@alchemy.franken.de> References: <20090609201127.GA50903@alchemy.franken.de> <4A2F1148.9090706@freebsd.org> <20090621140312.GC71667@alchemy.franken.de> Message-ID: <3bbf2fe10906210855r6c98568aj7bcc9ec3e057ae01@mail.gmail.com> 2009/6/21 Marius Strobl : > On Wed, Jun 17, 2009 at 12:55:52PM -1000, Jeff Roberson wrote: >> >> On Tue, 9 Jun 2009, Peter Grehan wrote: >> >> >>As for sparc64 allocating the storage for the dynamic area >> >>from end probably isn't a good idea as the pmap code assumes >> >>that the range from KERNBASE to end is covered by the pages >> >>allocated by and locked into the TLB for the kernel by the >> >>loader >> > >> >Ditto for ppc. It's possible to get the additional space from within or >> >after return from pmap_bootstrap() (like thread0's kstack, or the msgbuf). >> >> http://people.freebsd.org/~jeff/dpcpu.diff >> >> I have updated this patch based on feedback relating to various >> architectures md code. I tried to model most architectures after the way >> msgbuf memory was taken. I have no capacity to test anything other than >> i386 and amd64. ARM is reported to work with one minor diff. Apparently >> sparc64 worked with the earlier diff but this should be cleaner. If >> anyone can report back on sparc64, mips, or powerpc, I'd appreciate it. >> > > The earlier patch worked on sparc64 as long as the kernel > happened to leave enough room in the last 4MB page allocated > for it. > The new version unfortunately doesn't compile on sparc64 as > pmap_bootstrap_alloc() is static to its pmap.c (I think it > should also stay that way). Also the memory allocated with > it isn't safe to be used before we've taken over the trap > table. A kernel built with the sparc64 bits replaced with > the following patch boots fine: > http://people.freebsd.org/~marius/sparc64_dpcpu.diff > Do you have some simple test case for DPCPU which can be > used to verify that it actually works? I can suggest to switch pc_rm_queue of rmlocks in pcpu to be used as dynamic. It should not be difficult at all. Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein From jroberson at jroberson.net Mon Jun 22 00:10:00 2009 From: jroberson at jroberson.net (Jeff Roberson) Date: Mon Jun 22 00:10:07 2009 Subject: Dynamic pcpu, arm, mips, powerpc, sun, etc. help needed In-Reply-To: <20090621140312.GC71667@alchemy.franken.de> References: <20090609201127.GA50903@alchemy.franken.de> <4A2F1148.9090706@freebsd.org> <20090621140312.GC71667@alchemy.franken.de> Message-ID: On Sun, 21 Jun 2009, Marius Strobl wrote: > On Wed, Jun 17, 2009 at 12:55:52PM -1000, Jeff Roberson wrote: >> >> On Tue, 9 Jun 2009, Peter Grehan wrote: >> >>>> As for sparc64 allocating the storage for the dynamic area >>>> from end probably isn't a good idea as the pmap code assumes >>>> that the range from KERNBASE to end is covered by the pages >>>> allocated by and locked into the TLB for the kernel by the >>>> loader >>> >>> Ditto for ppc. It's possible to get the additional space from within or >>> after return from pmap_bootstrap() (like thread0's kstack, or the msgbuf). >> >> http://people.freebsd.org/~jeff/dpcpu.diff >> >> I have updated this patch based on feedback relating to various >> architectures md code. I tried to model most architectures after the way >> msgbuf memory was taken. I have no capacity to test anything other than >> i386 and amd64. ARM is reported to work with one minor diff. Apparently >> sparc64 worked with the earlier diff but this should be cleaner. If >> anyone can report back on sparc64, mips, or powerpc, I'd appreciate it. >> > > The earlier patch worked on sparc64 as long as the kernel > happened to leave enough room in the last 4MB page allocated > for it. > The new version unfortunately doesn't compile on sparc64 as > pmap_bootstrap_alloc() is static to its pmap.c (I think it > should also stay that way). Also the memory allocated with > it isn't safe to be used before we've taken over the trap > table. A kernel built with the sparc64 bits replaced with > the following patch boots fine: > http://people.freebsd.org/~marius/sparc64_dpcpu.diff > Do you have some simple test case for DPCPU which can be > used to verify that it actually works? Thanks very much Marius. I have updated the patch at: http://people.freebsd.org/~jeff/dpcpu.diff I intend to commit this, minus the kern_synch.c diff tomorrow. There was an id in the previous patch that caused each area to be accessed as it was added but you'd have to have done a 'show pcpu' in ddb after boot to access the area. I added a counter in kern_synch.c as a better test. new in this diff: 1) I made each access cheaper by one instruction by making the pc_dynamic pointer relative to the start of the percpu area. 2) I added two helper functions for sysctl ints and quads that can be used for stats. See the temporary kern_synch.c diff for an example. 3) sparc64/sun4v by marius 4) ia64 fixes suggested by marcel. Thanks, Jeff > > Marius > From bugmaster at FreeBSD.org Mon Jun 22 11:06:51 2009 From: bugmaster at FreeBSD.org (FreeBSD bugmaster) Date: Mon Jun 22 11:07:25 2009 Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org Message-ID: <200906221106.n5MB6oWt017934@freefall.freebsd.org> Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/120749 arch [request] Suggest upping the default kern.ps_arg_cache 1 problem total. From sam at freebsd.org Mon Jun 22 16:47:34 2009 From: sam at freebsd.org (Sam Leffler) Date: Mon Jun 22 16:47:49 2009 Subject: Dynamic pcpu, arm, mips, powerpc, sun, etc. help needed In-Reply-To: References: <20090609201127.GA50903@alchemy.franken.de> <4A2F1148.9090706@freebsd.org> <20090621140312.GC71667@alchemy.franken.de> Message-ID: <4A3FB2BA.80803@freebsd.org> Blows up on arm (xscale) during boot. I will try to look at why later. You might want to boot a UP kernel. Sam From sam at freebsd.org Mon Jun 22 16:47:34 2009 From: sam at freebsd.org (Sam Leffler) Date: Mon Jun 22 16:47:50 2009 Subject: Dynamic pcpu, arm, mips, powerpc, sun, etc. help needed In-Reply-To: References: <20090609201127.GA50903@alchemy.franken.de> <4A2F1148.9090706@freebsd.org> <20090621140312.GC71667@alchemy.franken.de> Message-ID: <4A3FB1DF.80006@freebsd.org> Jeff Roberson wrote: > On Sun, 21 Jun 2009, Marius Strobl wrote: > >> On Wed, Jun 17, 2009 at 12:55:52PM -1000, Jeff Roberson wrote: >>> >>> On Tue, 9 Jun 2009, Peter Grehan wrote: >>> >>>>> As for sparc64 allocating the storage for the dynamic area >>>>> from end probably isn't a good idea as the pmap code assumes >>>>> that the range from KERNBASE to end is covered by the pages >>>>> allocated by and locked into the TLB for the kernel by the >>>>> loader >>>> >>>> Ditto for ppc. It's possible to get the additional space from >>>> within or >>>> after return from pmap_bootstrap() (like thread0's kstack, or the >>>> msgbuf). >>> >>> http://people.freebsd.org/~jeff/dpcpu.diff >>> >>> I have updated this patch based on feedback relating to various >>> architectures md code. I tried to model most architectures after the >>> way >>> msgbuf memory was taken. I have no capacity to test anything other than >>> i386 and amd64. ARM is reported to work with one minor diff. Apparently >>> sparc64 worked with the earlier diff but this should be cleaner. If >>> anyone can report back on sparc64, mips, or powerpc, I'd appreciate it. >>> >> >> The earlier patch worked on sparc64 as long as the kernel >> happened to leave enough room in the last 4MB page allocated >> for it. >> The new version unfortunately doesn't compile on sparc64 as >> pmap_bootstrap_alloc() is static to its pmap.c (I think it >> should also stay that way). Also the memory allocated with >> it isn't safe to be used before we've taken over the trap >> table. A kernel built with the sparc64 bits replaced with >> the following patch boots fine: >> http://people.freebsd.org/~marius/sparc64_dpcpu.diff >> Do you have some simple test case for DPCPU which can be >> used to verify that it actually works? > > Thanks very much Marius. I have updated the patch at: > > http://people.freebsd.org/~jeff/dpcpu.diff > > I intend to commit this, minus the kern_synch.c diff tomorrow. There > was an id in the previous patch that caused each area to be accessed > as it was added but you'd have to have done a 'show pcpu' in ddb after > boot to access the area. I added a counter in kern_synch.c as a better > test. > > new in this diff: > > 1) I made each access cheaper by one instruction by making the > pc_dynamic pointer relative to the start of the percpu area. > > 2) I added two helper functions for sysctl ints and quads that can be > used for stats. See the temporary kern_synch.c diff for an example. > > 3) sparc64/sun4v by marius > > 4) ia64 fixes suggested by marcel. Does not compile on !SMP systems (s/dpcpu_ptr/dcpu_off/ in sysctl_dpcpu_quad). Sam From jroberson at jroberson.net Mon Jun 22 20:00:32 2009 From: jroberson at jroberson.net (Jeff Roberson) Date: Mon Jun 22 20:01:17 2009 Subject: Dynamic pcpu, arm, mips, powerpc, sun, etc. help needed In-Reply-To: <4A3FB2BA.80803@freebsd.org> References: <20090609201127.GA50903@alchemy.franken.de> <4A2F1148.9090706@freebsd.org> <20090621140312.GC71667@alchemy.franken.de> <4A3FB2BA.80803@freebsd.org> Message-ID: On Mon, 22 Jun 2009, Sam Leffler wrote: > Blows up on arm (xscale) during boot. I will try to look at why later. You > might want to boot a UP kernel. I had before I did the pcpu_ptr change. I will again. I'm doing the module stuff now so I'll do another round of testing when that's done. Which xscale platfom? There are 5 or so. Thanks, Jeff > > Sam > From ino-news at spotteswoode.dnsalias.org Mon Jun 22 23:08:11 2009 From: ino-news at spotteswoode.dnsalias.org (clemens fischer) Date: Mon Jun 22 23:08:24 2009 Subject: On errno References: <49D1492C.5050101@freebsd.org> <200903310620.n2V6Kudd072936@hergotha.csail.mit.edu> Message-ID: <6p45h6xbmp.ln2@nntp.spotteswoode.dnsalias.org> On Tue-2009/03/31-08:20 Garrett Wollman wrote in gmane.os.freebsd.architechture (MID <200903310620.n2V6Kudd072936@hergotha.csail.mit.edu>): > [on strings accompanying errno] > > But all this is really irrelevant if no other operating system or > standard adopts the interface. Interfaces which are peculiar to > FreeBSD are rarely useful. plan9 already has that: programs return a string instead of a number. plan9port (user space implementation for unix) takes an empty string to mean SUCCESS and anything else as FAILURE. clemens From jhb at freebsd.org Tue Jun 23 18:01:05 2009 From: jhb at freebsd.org (John Baldwin) Date: Tue Jun 23 18:01:12 2009 Subject: [PATCH] SYSV IPC ABI rototill Message-ID: <200906231341.43104.jhb@freebsd.org> There have been a several issues with the existing ABI of the SYSV IPC structures over the past several years and it has been on the todo list for at least both 7.0 and 8.0. Rather than putting it off until 9.0 I sat down and worked on it this week. The patch is not super complex. First, the ABI changes done to each structure: - struct ipc_perm - The uid/cuid members are now of type uid_t instead of unsigned short. - The gid/cgid members are now of type gid_t instead of unsigned short. - The mode member is now of type mode_t instead of unsigned short. This is just a comsetic tweak though on current architectures since mode_t == uint16_t. - struct msqid_ds - The various pad fields have been removed. - struct semid_ds - The various pad fields have been removed. The comments suggest that these fields were added to follow the SV/I386 ABI. However, if FreeBSD ever supports SYSV binaries, I imagine they will use a separate system call ABI. In that case there is no reason that the FreeBSD ABI needs the padding, so I have removed them. - struct shmid_ds and struct shmid_kernel - shm_segsz is now a size_t instead of an int. As a result of this fix I've retired shm_bsegsz from shmid_kernel as it is no longer needed. - shm_nattch is now an int instead of a short. The structure padding was such that the space was already there anyway. - shm_internal is gone. There is no good reason to be exposing random kernel pointers to userland. I've replaced the in-kernel use by moving the VM object pointer into a new 'object' field in shmid_kernel. As far as system calls, the only functions in the SYSV IPC API that are affected are msgctl(), semctl(), and shmctl(). As a result, I have made new versions of __semctl(), msgctl(), and shmctl() and marked the old slots as COMPAT7. However, another set of system calls provide an old interface to the SYSV IPC API: msgsys(), semsys(), and shmsys(). Rather than adding compat shims for these kludgy syscalls, I am simply deprecating them altogether and they will only exist under COMPAT7. Binaries and libraries (including libc) have not used the foosys() system calls to implement the SYSV IPC API since FreeBSD 4.x. They proved problematic on certain architectures such as sparc64, etc. Given that, I think that they can safely be relegated to the legacy bin. The patch is at http://www.FreeBSD.org/~jhb/patches/sysvipc_abi.patch -- John Baldwin From jhb at freebsd.org Tue Jun 23 21:06:50 2009 From: jhb at freebsd.org (John Baldwin) Date: Tue Jun 23 21:07:22 2009 Subject: [PATCH] SYSV IPC ABI rototill In-Reply-To: <863a9q3c7a.fsf@ds4.des.no> References: <200906231341.43104.jhb@freebsd.org> <863a9q3c7a.fsf@ds4.des.no> Message-ID: <200906231706.33465.jhb@freebsd.org> On Tuesday 23 June 2009 4:52:09 pm Dag-Erling Sm?rgrav wrote: > John Baldwin writes: > > There have been a several issues with the existing ABI of the SYSV IPC > > structures over the past several years and it has been on the todo list for > > at least both 7.0 and 8.0. Rather than putting it off until 9.0 I sat down > > and worked on it this week. > > Have you given any thought to virtualization, i.e. separate namespaces > for each jail? Will your patch make this any easier or harder to > implement? It likely has zero effect on that. The global variables one would need to virtualize are unchanged by this. -- John Baldwin From des at des.no Tue Jun 23 21:07:56 2009 From: des at des.no (=?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?=) Date: Tue Jun 23 21:08:03 2009 Subject: [PATCH] SYSV IPC ABI rototill In-Reply-To: <200906231341.43104.jhb@freebsd.org> (John Baldwin's message of "Tue, 23 Jun 2009 13:41:42 -0400") References: <200906231341.43104.jhb@freebsd.org> Message-ID: <863a9q3c7a.fsf@ds4.des.no> John Baldwin writes: > There have been a several issues with the existing ABI of the SYSV IPC > structures over the past several years and it has been on the todo list for > at least both 7.0 and 8.0. Rather than putting it off until 9.0 I sat down > and worked on it this week. Have you given any thought to virtualization, i.e. separate namespaces for each jail? Will your patch make this any easier or harder to implement? DES -- Dag-Erling Sm?rgrav - des@des.no From julian at elischer.org Tue Jun 23 21:15:45 2009 From: julian at elischer.org (Julian Elischer) Date: Tue Jun 23 21:15:52 2009 Subject: [PATCH] SYSV IPC ABI rototill In-Reply-To: <200906231706.33465.jhb@freebsd.org> References: <200906231341.43104.jhb@freebsd.org> <863a9q3c7a.fsf@ds4.des.no> <200906231706.33465.jhb@freebsd.org> Message-ID: <4A414600.4060008@elischer.org> John Baldwin wrote: > On Tuesday 23 June 2009 4:52:09 pm Dag-Erling Sm?rgrav wrote: >> John Baldwin writes: >>> There have been a several issues with the existing ABI of the SYSV IPC >>> structures over the past several years and it has been on the todo list for >>> at least both 7.0 and 8.0. Rather than putting it off until 9.0 I sat down >>> and worked on it this week. >> Have you given any thought to virtualization, i.e. separate namespaces >> for each jail? Will your patch make this any easier or harder to >> implement? > > It likely has zero effect on that. The global variables one would need to > virtualize are unchanged by this. > Marko did this and it is in P4 somewhere... From alfred at freebsd.org Tue Jun 23 23:23:34 2009 From: alfred at freebsd.org (Alfred Perlstein) Date: Tue Jun 23 23:23:40 2009 Subject: [PATCH] SYSV IPC ABI rototill In-Reply-To: <200906231706.33465.jhb@freebsd.org> References: <200906231341.43104.jhb@freebsd.org> <863a9q3c7a.fsf@ds4.des.no> <200906231706.33465.jhb@freebsd.org> Message-ID: <20090623230501.GH84786@elvis.mu.org> * John Baldwin [090623 14:07] wrote: > On Tuesday 23 June 2009 4:52:09 pm Dag-Erling Sm??rgrav wrote: > > John Baldwin writes: > > > There have been a several issues with the existing ABI of the SYSV IPC > > > structures over the past several years and it has been on the todo list for > > > at least both 7.0 and 8.0. Rather than putting it off until 9.0 I sat down > > > and worked on it this week. > > > > Have you given any thought to virtualization, i.e. separate namespaces > > for each jail? Will your patch make this any easier or harder to > > implement? > > It likely has zero effect on that. The global variables one would need to > virtualize are unchanged by this. John, would it make sense to check for overflow in ipcperm_new2old and return some error so that callers get back some nasty error so that they don't make a mistake about permissions when an overflow happens? A crash/error sounds better than silent truncating of credential information, but I could be wrong. -Alfred From jhb at freebsd.org Wed Jun 24 14:23:25 2009 From: jhb at freebsd.org (John Baldwin) Date: Wed Jun 24 14:23:39 2009 Subject: [PATCH] SYSV IPC ABI rototill In-Reply-To: <20090623230501.GH84786@elvis.mu.org> References: <200906231341.43104.jhb@freebsd.org> <200906231706.33465.jhb@freebsd.org> <20090623230501.GH84786@elvis.mu.org> Message-ID: <200906240833.04028.jhb@freebsd.org> On Tuesday 23 June 2009 7:05:01 pm Alfred Perlstein wrote: > * John Baldwin [090623 14:07] wrote: > > On Tuesday 23 June 2009 4:52:09 pm Dag-Erling Sm??rgrav wrote: > > > John Baldwin writes: > > > > There have been a several issues with the existing ABI of the SYSV IPC > > > > structures over the past several years and it has been on the todo list for > > > > at least both 7.0 and 8.0. Rather than putting it off until 9.0 I sat down > > > > and worked on it this week. > > > > > > Have you given any thought to virtualization, i.e. separate namespaces > > > for each jail? Will your patch make this any easier or harder to > > > implement? > > > > It likely has zero effect on that. The global variables one would need to > > virtualize are unchanged by this. > > John, would it make sense to check for overflow in ipcperm_new2old and return > some error so that callers get back some nasty error so that they don't make > a mistake about permissions when an overflow happens? > > A crash/error sounds better than silent truncating of credential information, > but I could be wrong. Hmm, well, the truncation is what we have been doing all along for any users who used UIDs > USHRT_MAX, so adding an error now would change the behavior for existing binaries. Also, the truncation does not affect the actual permission checks (those are all done in the kernel), merely the reporting of the associated IDs to userland. -- John Baldwin From alfred at freebsd.org Wed Jun 24 15:32:37 2009 From: alfred at freebsd.org (Alfred Perlstein) Date: Wed Jun 24 15:32:43 2009 Subject: [PATCH] SYSV IPC ABI rototill In-Reply-To: <200906240833.04028.jhb@freebsd.org> References: <200906231341.43104.jhb@freebsd.org> <200906231706.33465.jhb@freebsd.org> <20090623230501.GH84786@elvis.mu.org> <200906240833.04028.jhb@freebsd.org> Message-ID: <20090624153236.GN84786@elvis.mu.org> * John Baldwin [090624 07:23] wrote: > On Tuesday 23 June 2009 7:05:01 pm Alfred Perlstein wrote: > > * John Baldwin [090623 14:07] wrote: > > > On Tuesday 23 June 2009 4:52:09 pm Dag-Erling Sm??rgrav wrote: > > > > John Baldwin writes: > > > > > There have been a several issues with the existing ABI of the SYSV IPC > > > > > structures over the past several years and it has been on the todo list for > > > > > at least both 7.0 and 8.0. Rather than putting it off until 9.0 I sat down > > > > > and worked on it this week. > > > > > > > > Have you given any thought to virtualization, i.e. separate namespaces > > > > for each jail? Will your patch make this any easier or harder to > > > > implement? > > > > > > It likely has zero effect on that. The global variables one would need to > > > virtualize are unchanged by this. > > > > John, would it make sense to check for overflow in ipcperm_new2old and return > > some error so that callers get back some nasty error so that they don't make > > a mistake about permissions when an overflow happens? > > > > A crash/error sounds better than silent truncating of credential information, > > but I could be wrong. > > Hmm, well, the truncation is what we have been doing all along for any users > who used UIDs > USHRT_MAX, so adding an error now would change the behavior > for existing binaries. Also, the truncation does not affect the actual > permission checks (those are all done in the kernel), merely the reporting of > the associated IDs to userland. OK, thank you for explaining. -- - Alfred Perlstein From kensmith at cse.Buffalo.EDU Thu Jun 25 19:05:06 2009 From: kensmith at cse.Buffalo.EDU (Ken Smith) Date: Thu Jun 25 19:05:13 2009 Subject: Time to drop the warning for uid's bigger than USHRT_MAX? Message-ID: <1245955512.20785.14.camel@bauer.cse.buffalo.edu> John's work on the SYSV IPC stuff removes the last place I'm aware of that had an issue with uid's bigger than what will fit inside an unsigned short. So, a couple questions: 1) Does anyone know of any remaining places I'm not aware of? And, if the answer to that winds up being no... 2) lib/libc/gen/pw_scan.c has some support for providing warnings about there potentially being issues with using uid's larger than a certain value. Should the code for that remain in place "for the next time we have this issue" despite it not really being needed now or should it all just get ripped out? Given John's work just arrived and it was a pre-requisite to doing anything about this RE would probably allow for any changes related to this happening after code freeze starts but it should happen soon if it's going to happen at all. ;-) Personally I'm leaning towards adjusting things so the warnings get triggered at UID_MAX for now and providing a few comments that say "We know this looks unnecessary right now but there was a time when ...". Thanks... -- Ken Smith - From there to here, from here to | kensmith@cse.buffalo.edu there, funny things are everywhere. | - Theodore Geisel | -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: This is a digitally signed message part Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20090625/cb27263a/attachment.pgp From jilles at stack.nl Fri Jun 26 15:58:26 2009 From: jilles at stack.nl (Jilles Tjoelker) Date: Fri Jun 26 15:58:33 2009 Subject: Time to drop the warning for uid's bigger than USHRT_MAX? In-Reply-To: <1245955512.20785.14.camel@bauer.cse.buffalo.edu> References: <1245955512.20785.14.camel@bauer.cse.buffalo.edu> Message-ID: <20090626155757.GA89856@stack.nl> On Thu, Jun 25, 2009 at 02:45:12PM -0400, Ken Smith wrote: > John's work on the SYSV IPC stuff removes the last place I'm aware of > that had an issue with uid's bigger than what will fit inside an > unsigned short. So, a couple questions: > 1) Does anyone know of any remaining places I'm not aware of? sa(8) has a problem with the uid 1380275712 (probably 5653842 on big endian systems). This is because uid_compare() in usrdb.c does not take the version key (#define VERSION_KEY "\0VERSION" in db.c) into account. -- Jilles Tjoelker From bugmaster at FreeBSD.org Mon Jun 29 11:06:54 2009 From: bugmaster at FreeBSD.org (FreeBSD bugmaster) Date: Mon Jun 29 11:07:25 2009 Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org Message-ID: <200906291106.n5TB6rAY046241@freefall.freebsd.org> Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/120749 arch [request] Suggest upping the default kern.ps_arg_cache 1 problem total. From gsa at arrigogroup.com.mt Mon Jun 29 19:13:47 2009 From: gsa at arrigogroup.com.mt (Arrigo Group) Date: Mon Jun 29 19:13:54 2009 Subject: Croatia Message-ID: <9a91ba98e372768280a9dc4f001217ce@arrigogroup.com.mt> Direct flights to Croatia with Special Prices Packages starting from Eu399pp which includes direct flights to Split, airport taxes, 7-nights in a self-catering apartment or Tour packages starting from Eu465pp which includes direct flights to Split, airport taxes, 7-nights in a central hotel in Trogir on bed and breakfast, transfers and tour leader Flight only also available. Special prices applies for groups of over 5 passengers!! For further details visit: www.letsgoholiday.com or call on 2345 2345