3ware 7506, FreeBSD 4.x, Maxtor Disks & SMART Problems.

Carroll Kong me at carrollkong.com
Tue Sep 14 18:44:09 PDT 2004


Actually, I was using smartmontools improperly in my tests.

It isn't enough to hope to catch a smartd error.  It is better to run the
long tests with smartctl (forget about smartd) and shut most of the box's
I/O down to ensure it completes within an hour or two.  I wasn't sure how it
worked, until I tried it on a local test host.

One of the maxtor disks at the colocation with the 3ware card was failing to
complete it's long tests with read errors.  This is inline with Jason's
postings about how maxtor disk errors can cause a total I/O hang.  Note,
this is the start of a disk failure, not a total disk failure.

Mike's errors occured in a RAID5 DEGRADED mode system with Western Digital
disks.  I can only assume the 3Ware problem with handling failed IDE disks
extends itself in DEGRADED mode as well as NORMAL mode (but with Maxtor
disks).

I strongly suggest 3ware users check their automated monitoring tools and do
periodic smartctl disk checks to verify a disk's health.  It is important to
remove an ailing disk before a total failure since it can cause a total
system hang.

The bus issues do not affect us at all, or they should not  I have a freak
of nature 7450 card which is a 64bit/33 mhz card and no longer available.
The issue only affects 66 mhz and riser card users.

The bus issues were noted and documented in 3ware's advisories.  Depending
on when you purchased your card and if it is running at 66 mhz or not.

http://www.3ware.com/KB/kb.asp
https://www.3ware.com/kbadmin/attachments/TM900-0045-00%20Rev%20A_P.pdf

General idea is the older runs of the 3ware card does not handle high speed
timings well which was a big problem with riser cards + 66 mhz combinations.
You can purchase a special riser card from

www.adexelec.com

PCITX8-3R

This riser card was specially designed by Adex to help resolve 3ware's known
issue.  Any new 3ware cards should not experience this issue.

Unfortunately, this issue is not the same as mine, but I hope this
information helps you.  (consider purchasing the special PCI riser card with
additional resistors).



- Carroll Kong
----- Original Message -----
From: "Anton Ivanov" <ai1 at ipaccess.com>
To: <freebsd-stable at freebsd.org>
Cc: "Carroll Kong" <me at carrollkong.com>; "Jason Thomson"
<jason.thomson at mintel.com>
Sent: Monday, September 13, 2004 12:00 PM
Subject: Re: 3ware 7506, FreeBSD 4.x, Maxtor Disks & SMART Problems.


> Hi Carrol,
>
> Pass this to the list as for some reason it times out my posts to
> it.
>
> We had similar issues with 3ware on riser cards. 8506-8 different
> board designs will either hang or not even boot on riser cards
> that are not grounded correctly (see if the holes on the edges of
> the riser cards are connected to the chassis). It never got past a
> 30G transfer.
>
> After grounding the failures became less common dropping to a hang
> after 300-500G or so.
>
> Without riser cards it does not fail.
>
> Dropping the bus to 33 (8506-8 a 66 card) also makes it behave (so
> far).
>
> Basically, it is the most fussy adapter I have seen so far in terms
> of bus noise requirements (we have observed the failures under all
> OSes we use so it is not an OS issue).
>
> Brgds,
>
> A.
>
> On Thursday 09 September 2004 23:04, Carroll Kong wrote:
> > <sniffles>  Well, I tried your command first to see if we could
> > get a quick "test".  It ran for about 30 mins, I re-ran it on
> > another partition, and I even ran another instance of it on yet
> > another partition.  It kept going.
> >
> > Then we ran some very file intensive (updates hundreds of files
> > in a lot of different directories) and it seemed to stall my
> > writes to the /var partition, which in turn broke one of my
> > instances of dnscache, which then snowballed until doing tail
> > /var/log/messages would hang the zsh completely.
> >
> > I am thinking it "might" be because we have not done array
> > integrity checks yet because our firmware is so old and the old
> > 3dm does not support it.
> >
> > Throughout the entire time, smartd was runing and reported two
> > more errors on one of the disks, but nothing major.  I am not
> > 100% sure if there was as correlation between those errors and
> > the ultimate meltdown
> >
> > I retested this again, I was able to get it to hang again.  I am
> > beginning to wonder if the other things I use could play a role
> > in this.
> >
> > - using vnode backed disks/partitions
> > - using jails
> > - using a pci riser card (my 7450 is 64bit/33mhz, the problem
> > does not apply to me)
> > however, there is a slight chance that maybe my riser card isn't
> > seated as well?  Really trying to grasp at straws here.
> >
> > I tried truss on the "file intense" activity, all it showed was
> > after a certain time it hung at a stat call.  Basically all I/O
> > just halted magically.
> >
> > SmartD did not show any error increases when the problem
> > occurred.  Maybe it's not my maxtors then.
> >
> >
> >
> > - Carroll Kong
> > ----- Original Message -----
> > From: "Jason Thomson" <jason.thomson at mintel.com>
> > To: "Carroll Kong" <me at carrollkong.com>
> > Cc: <freebsd-stable at freebsd.org>
> > Sent: Thursday, September 09, 2004 6:12 AM
> > Subject: Re: 3ware 7506, FreeBSD 4.x, Maxtor Disks & SMART
> > Problems.
> >
> > > Hi Carroll,
> > >
> > > I posted the original problem report you referred to earlier.
> > >
> > > 3ware are looking into the problem.  It looks like it's a
> > > problem with 3ware's firmware (perhaps related to some anomaly
> > > in the way that Maxtor disks behave).
> > >
> > > It would appear that it's only a problem when the disk has
> > > errors.
> > >
> > > On one machine,  I can reproduce this problem by dd'ing from
> > > the RAID5 array:
> > >
> > > dd if=/dev/twed0s1h iseek=137510 bs=1m of=/dev/null
> > >
> > > On that machine I have the lockup will *always* be preceded by
> > > the following message on the console:
> > >
> > > twe0: AEN: <twe0: port 3: sector repair occurred>
> > >
> > > Do you have any error messages on the console?
> > >
> > >
> > > I think that the disk I have on port 3 is flakey.  I could
> > > replace the disk,  but I'm waiting until 3ware get back to me
> > > / issue a fix so I can have some reasonable idea that the
> > > problem is fixed.
> > >
> > > 3ware *have* been looking into this problem,  and I think have
> > > established that it's a firmware rather than a driver issue
> > > (it occurs with other OSes as well apparently).  I don't know
> > > how close they are to being able to fix this.
> > >
> > > We buy all our new machines with Western Digital disks (and
> > > 3ware controllers).  No problems yet (and we have about 10 of
> > > them - more than we have with Maxtor disks).
> > >
> > > (BTW I have established over a period of months that this
> > > problems existed with various versions of the driver and
> > > firmware dating back to 2003.  It still exists with the latest
> > > FBSD driver and 3ware firmware: FE7X 1.05.00.068)
> > >
> > > Carroll Kong wrote:
> > > > I tried using the SmartD 5.33 (CVS).  It appeared to work,
> > > > but did not
> >
> > pick
> >
> > > > up anything in the next crash.  I noticed some temperature
> > > > changes, and
> >
> > I
> >
> > > > plan on running some difference tests, but nothing out of
> > > > the ordinary.
> > > >
> > > > This time the crash hung a lot of httpds and got them stuck
> > > > into the D state.  We had something like this happen before
> > > > ... but now that I
> >
> > think
> >
> > > > about it, it matches the experience of Jason almost
> > > > perfectly.
> > > >
> > > > Upon lockup, sometimes we still have partial control of the
> > > > system.  The processes waiting on the 3ware card cannot be
> > > > killed.  The web sites
> >
> > that
> >
> > > > are still in cache are servable.
> > > >
> > > > It occurred when a big I/O request was going through (along
> > > > with the
> >
> > normal
> >
> > > > web traffic).  The odd thing is, it's not a function of raw
> > > > I/O, since
> >
> > our
> >
> > > > definition of big I/O was simply 3-4MB/sec according to
> > > > iostat.  It
> >
> > seems
> >
> > > > over time it just... well it just goes kaput if you push it
> > > > a bit hard
> >
> > after
> >
> > > > a long days run of non-stop I/Os.
> > > >
> > > > The initial fsck we do runs at 17MB/sec at far more
> > > > transactions per
> >
> > second.
> >
> > > > Anyway, I am convinced the problem is somehow related to the
> > > > 3ware
> >
> > system
> >
> > > > (either the disks, the controller or something).  Originally
> > > > I was
> >
> > looking
> >
> > > > at other possibilities, but seeing people's experiences
> > > > here, and a colleague of mine's experience, something fishy
> > > > is going on.
> > > >
> > > > I am leaning towards a full hdd swap, seems like I will have
> > > > to replace
> >
> > one
> >
> > > > disk at a time and let it rebuild slowly to eventually swap
> > > > out all the disks.  I am able to get this problem to occur
> > > > faster and faster now, unfortunately it is a production box
> > > > and we would much rather it not.
> >
> > And I
> >
> > > > am going to switch off to Seagate instead of Maxtor.
> > > > Despite using 3ware+maxtor on other machines here, (but they
> > > > have considerably less
> >
> > load),
> >
> > > > it's just too much of a coincidence that 3 different people
> > > > including
> >
> > myself
> >
> > > > have had problems with 3ware+maxtor whereas you can easily
> > > > find that
> >
> > many
> >
> > > > and more that have it working fine with another vendor.
> > > >
> > > >
> > > >
> > > > - Carroll Kong
> > > > ----- Original Message -----
> > > > From: "Carroll Kong" <me at carrollkong.com>
> > > > To: "Jason Thomson" <jason.thomson at mintel.com>;
> > > > <so14k at so14k.com> Cc: <vkayshap at amcc.com>;
> > > > <freebsd-stable at freebsd.org> Sent: Wednesday, September 08,
> > > > 2004 3:24 PM
> > > > Subject: Re: 3ware 7506, FreeBSD 4.x, Maxtor Disks & SMART
> > > > Problems.
> > > >
> > > >>Hi, in reference to this
> > > >>http://lists.freebsd.org/pipermail/freebsd-stable/2004-June/
> > > >>007828.html
> > > >>
> > > >>I have a FreeBSD 4.10-p2 system, using a 7450 with 4xMAXTOR
> > > >> 6L080J4  (80 gig) disks.
> > > >>
> > > >>Raid 5 setup.
> > > >>
> > > >>      Monitor version: ME7X 1.01.00.035
> > > >>      Firmware version: FE7X 1.05.00.036
> > > >>      BIOS version: BE7X 1.08.00.044
> > > >>
> > > >>
> > > >>(Firmware 7.5.3 basically).
> > > >>
> > > >>I am also having the same problems you are having.  Randomly
> > > >> under heavy
> > > >
> > > > I/O
> > > >
> > > >>the system will just halt I/O requests.  No error messages
> > > >> on the
> >
> > console,
> >
> > > >>it would just start to hang and halt completely.  (no kernel
> > > >> panics at
> > > >
> > > > all).
> > > >
> > > >>I believe I have the same problem you do.  Were you able to
> > > >> resolve the issue or narrow it down?  The machine is not
> > > >> local, but I am curious if
> > > >
> > > > you
> > > >
> > > >>did resolve it, what version of FreeBSD did you have?  What
> > > >> firmware?
> >
> > And
> >
> > > >>did you have to do the powermax testing on all the disks or
> > > >> not?
> > > >>
> > > >>I cannot easily do the powermax testing yet, and my firmware
> > > >> is older
> >
> > and
> >
> > > > I
> > > >
> > > >>am still running into this problem (which should have all
> > > >> the twe driver fixes).
> > > >>
> > > >>I tried using "Smartmontools" to verify if the Maxtor disks
> > > >> are okay
> >
> > since
> >
> > > >>they only work for Linux + 3Ware.
> > > >>
> > > >>Thanks in advance!
> > > >>
> > > >>
> > > >>
> > > >>- Carroll Kong
> > > >>
> > > >>
> > > >>
> > > >>_______________________________________________
> > > >>freebsd-stable at freebsd.org mailing list
> > > >>http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> > > >>To unsubscribe, send any mail to
> >
> > "freebsd-stable-unsubscribe at freebsd.org"
> >
> > > > _______________________________________________
> > > > freebsd-stable at freebsd.org mailing list
> > > > http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> > > > To unsubscribe, send any mail to
> >
> > "freebsd-stable-unsubscribe at freebsd.org"
> >
> >
> >
> >
> > _______________________________________________
> > freebsd-stable at freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> > To unsubscribe, send any mail to
> > "freebsd-stable-unsubscribe at freebsd.org"
>



More information about the freebsd-stable mailing list