Upgrade from 8.2-STABLE to 9.0-RELEASE wedges on SuperMicro H8DGiF-based system

John Nielsen lists at jnielsen.net
Mon Jan 9 18:16:49 UTC 2012


On Jan 9, 2012, at 12:40 PM, Freddie Cash wrote:

> Just wondering if anyone else has run into a similar issue.
> 
> We have a ZFS storage server that was running 8.2-STABLE (from around
> beginning of Dec 2011) without any issues, that was upgraded to
> 9.0-RELEASE (to consolidate all the ZFS and networking fixes/updates
> and bring it up to version parity with our other ZFS storage server
> running 9.0) last Thursday.  The "svn switch" of the source tree, the
> buildworld, the buildkernel, the installkernel, the reboot with the
> new kernel, the installworld, the reboot into the new world, the
> mergemaster processes all completed successfully.  About half-way
> through the "make delete-old" process, the box locked up.  No messages
> on the console, no log entries of any kind, everything just stopped.
> Had to do a power-cycle.  And then everything went to hell.  :(
> 
> On reboot, the loader complained about not being able to determine
> which disk it was booting from (even though the new loader had already
> booted at least once), and gave strange messages about
> panic/free/something or other (didn't write that error down).
> 
> I was able to boot using a 9.0 install CD, drop to a loader prompt,
> unload the kernel/modules from CD, load the kernel/modules from the
> harddrive, set currdev to the harddrive, and boot.  But no matter what
> I did (gpart bootcode using pmbr/gptboot from CD or from HD; copy
> loader from CD, copy /boot from CD), I could not get the loader on the
> HD to load the kernel; always gave the same error message:  can't
> determine which disk we're booting from.
> 
> After trying for 24 hours to make it work, I just re-installed off the
> 9.0-RELEASE CD.
> 
> Now, this box (alphadrive) will freeze after running for between 3 and
> 10 hours.  Even when left completely idle, it will lock up after about
> 3 hours.  :(
> 
> I have another system (betadrive) that's almost identical hardware
> (chassis, backplane, SATA controllers are different, everything else
> is the same) that went from 8.2-STABLE to 9.0-RC2 to 9.0-RC3 to
> 9.0-RELEASE without any issues.  I've tried copying /boot/loader.conf,
> /etc/make.conf, /etc/src.conf, /etc/sysctl.conf, /etc/rc.conf from
> betadrive to alphadrive, without any change in the freezing behaviour.
> 
> These are ZFS storage systems, with / (UFS) and swap on SSDs, with 16
> or 24 SATA HDs in the pool (3x 5-disk raidz2 + spare and 4x 6-disk
> raidz2 resp).  All of the ZFS settings are identical between the two
> systems (pool name, pool properties, ZFS filesystems, ZFS properties
> per filesystem).  Dedupe and compression (LZJB) are enabled on both
> systems.
> 
> When alphadrive locks up, there are no entries made in any log files;
> there are no log entries on the console; there are no entries in the
> BIOS event log; there are no entries in the IPMI event log; the
> CPU/case temps are below 40C (emergency shutoff is 75C) as shown via
> IPMI; RAM usage is under 20 GB (24 GB per box) with the lowest being
> under 2 GB used (I run top on the console so I can see the stats when
> it locks up, and the time it locks up).  It just ... stops.
> 
> The system will even lock up when running in single-user mode, with
> only / mounted (ZFS not loaded, zpool not imported).
> 
> Hardware (alphadrive):
>  Chenbro 5U rackmount chassis with 24 hot-swap drive bays
>  SuperMicro H8DGi-F motherboard
>  AMD Opteron 2218 CPU (8-cores at 2.0 GHz)
>  24 GB DDR3-SDRAM
>  3x SuperMicro AOC-USAS-L8i SATA controllers (multi-lane break-out cables)
>  8x Seagate 7200.12 1.5 TB SATA harddrives
> 16x WD RE4 1.0 TB SATA harddrives
>  1x Kingston 60 GB SSD (for /, swap, L2ARC)
> 
> Hardware (betadrive):
>  SuperMicro 4U rackmount chassis with 16 hot-swap drive bays
>  SuperMicro H8DGi-F motherboard
>  AMD Opteron 2218 CPU (8-cores at 2.0 GHz)
>  24 GB DDR3-SDRAM
>  2x SuperMicro AOC-USAS2-L8i SATA controllers (multi-lane cables)
> 16x WD RE4 2.0 TB SATA harddrives
>  1x Kingston 60 GB SSD (for /, swap, L2ARC)
> 
> betadrive runs perfectly with FreeBSD 9.0-RELEASE.
> alphadrive locks up with FreeBSD 9.0-RELEASE.
> 
> We're currently investigating hardware firmware revisions to see if
> anything else is different between the two systems.
> 
> Has anyone experience anything similar?  Does anyone have any ideas on
> what to look for?  Any suggestions on what to try next?

From what you've said I strongly suspect that you have some kind of hardware issue. Dodgy RAM is my first guess, something cooling-related is my 2nd, and PSU is my 3rd. It is a little suspicious that you only started having problems after your upgrade but it could be coincidence or it could be something about the new software tickling the hardware differently than the old.

Open it up, make sure you don't have dust buildup and that all the fans are spinning, re-seat the RAM and then boot into memtest for a few hours. If you have spare similar hardware you can also try swapping components until you isolate the fault.

Good luck,

JN



More information about the freebsd-stable mailing list