Status of support for 4KB disk sectors

Tue Jul 19 04:05:48 UTC 2011

On Mon, Jul 18, 2011 at 11:38:00PM -0400, Glen Barber wrote:
> On 7/18/11 7:41 PM, Jeremy Chadwick wrote:
> > On Mon, Jul 18, 2011 at 03:50:15PM -0700, Kevin Oberman wrote:
> >> I just want to check on the status of 4K sector support in FreeBSD.  I read
> >> a long thread on the topic from a while back and it looks like I might hit some
> >> issues if I'm not REALLY careful. Since I will be keeping the existing Windows
> >> installation, I need to be sure that I can set up the disk correctly without
> >> screwing up Windows 7.
> >>
> >> I was planning on just DDing the W7 slice over, but I am not sure how well this
> >> would play with GPT. Or should I not try to use GPT at all? I'd like
> >> to as this laptop
> >> spreads Windows 7 over two slices and adds a third for the recovery
> >> system, leaving
> >> only one for FreeBSD and I'd like to put my files in a separate slice.
> >> GPT would offer
> >> that fifth slice.
> >>
> >> I have read the handbook and don't see any reference to 4K sectors and only a
> >> one-liner about gpart(8) and GPT. Oncew I get this all figured out,
> >> I'll see about writing
> >> an update about this as GPT looks like the way to go in e future.
> > 
> > When you say "4KB sector support", what do you mean by this?  All
> > drives on the market as of this writing, that I've seen, all claim a
> > physical/logical sector size of 512 bytes -- yes, even SSDs, and EARS
> > drives which we know use 4KB sectors.  They do this to guarantee full
> > compatibility with existing software.
> > 
> > Since you're talking about gpart and "4KB sector support", did you mean
> > to ask "what's the state of FreeBSD and aligned partition support to
> > ensure decent performance with 4KB-sector drives?"
> > 
> > If so: there have been some commits in recent days to RELENG_8 to help
> > try to address the shortcomings of the existing utilities and GEOM
> > infrastructure.  Read the most recent commit text carefully:
> > 
> > http://www.freebsd.org/cgi/cvsweb.cgi/src/sbin/geom/class/part/geom_part.c
> > 
> > But the currently "known method" is to use gnop(8).  Here's an example:
> > 
> > http://www.leidinger.net/blog/2011/05/03/another-root-on-zfs-howto-optimized-for-4k-sector-drives/
> > 
> 
> Notice: I'm reading this as "how badly do 'green drives' suck?"

It's important to note that not all WD Caviar Green drives use 4KB
sectors.  WD, as of this writing, uses the 4-letter "EARS" string in the
drive model that denotes use of 4KB sectors.

The Green series do have other problems that people have experienced,
such as bugs/quirks in the firmware causing the drive to repetitively
park its heads in the landing zone (witnessed as either really bad drive
performance, or the drive falling off the bus + reattaching).  You can
detect this situation by looking at SMART attribute 193
(Load_Cycle_Count).  A very high number (in the tens or hundreds of
thousands for a drive that has only been in use for a week or so) is an
indicator of the problem.

WD apparently has given people firmware updates to fix the issue.
However the drive firmware version number does not change after updating
the microcode, but it does fix the problem.  (For what it's worth,
Samsung pulled this same manoeuvre when it came to firmware updates for
a catastrophic bug on their SpinPoint F4 drives.)  What I'm saying is
there's no way to detect whether or not your drive is running the fixed
firmware, other than looking at said SMART attribute.

I do have references for this issue, but it will take me some time to
dig up the URLs and so on.

> FWIW, I've recently done the gnop(8) trick to two "green" drives in one
> of my machines because I was seeing horrifying performance problems with
> what I consider to be basic stuff, like 'portsnap extract', but more
> severely with copying large data (file-backed bacula files to be exact)
> into said datasets.  I have yet to retry my read/write tests with drives
> I have not converted with gnop(8).

I imagine this would have a tremendous effect on performance.  With
SSDs, the estimated performance impact is between 30-50% depending on
what the workload is.  Meaning with SSDs, drives with aligned partitions
perform 30-50% better.  When you read about how NAND cell and NAND flash
pages work (look it up on Wikipedia, look for FTL (flash transition
layer)) it makes sense.  With mechanical HDDs, I'm not sure what the
performance hit is, but I imagine it's large.

Furthermore, talking about SSDs again: I want to make folks aware of the
fact that Intel SSDs use an 8KB NAND flash page (not 4KB!).  NAND pages
are erased 256 pages at a time (8*256=2MByte).  When it comes to
alignment, flash page size is what's of concern.  So for Intel SSDs (X25
series, 320 series, and 510 series), 8KByte-aligned is the way to go.

> I have not conclusively tested all possible combinations of
> configurations, nor reverted the changes to the drives to retest, but if
> it is of any interest, here's what I'm seeing.
> 
> I have comparisons between WD "green" and "black" drives.
> Unfortunately, the machines are not completely similar - one is a
> Core2Quad, the other Core2Duo; one has 6GB RAM, the other 8GB RAM; also,
> 'orion' is running a month-old 8-STABLE; 'kaos' is running a 2-week-old
> -CURRENT.  Both machines are using ZFSv28:
> 
> orion % sysctl -n hw.ncpu; sysctl -n hw.physmem
> 4
> 6353416192
> 
> kaos % sysctl -n hw.ncpu; sysctl -n hw.physmem
> 2
> 8534401024
> 
> The drives in 'orion' are 1TB WD green drives in a ZFS mirror; the
> drives in 'kaos' are 1TB WD black drives in a raidz1 (3 drives).
> 
> First the read test:
> 
> kaos % sh -c 'time find /usr/src -type f -name \*.\[1-9\] >/dev/null'
> 	12.94 real         0.60 user        11.95 sys
> 
> orion % sh -c 'time find /usr/src -type f -name \*.\[1-9\] >/dev/null'
> 	118.02 real         0.46 user         8.74 sys
> 
> I guess no real surprise here.  'kaos' has more spindles to read from,
> on top of faster seek times.
> 
> Next the write test:
> 
> The 'compressed' and 'dedup' datasets referenced below are 'lzjb' and
> 'sha256,verify', respectively.  I'd wait for the 'compressed+dedup'
> tests to finish, but I have to wake up tomorrow morning.
> 
> orion# sh -c 'time portsnap extract -p /zstore/perftest >/dev/null'
> 	306.71 real        44.37 user       110.28 sys
> 
> orion# sh -c 'time portsnap extract -p /zstore/perftest_compress >/dev/null'
> 	166.62 real        43.87 user       109.49 sys
> 
> orion# sh -c 'time portsnap extract -p /zstore/perftest_dedup >/dev/null'
> 	3576.43 real        44.98 user       109.12 sys
> 
> kaos# sh -c 'time portsnap extract -p /perftest >/dev/null'
> 	311.31 real        51.23 user       193.37 sys
> 
> kaos# sh -c 'time portsnap extract -p /perftest_compress >/dev/null'
> 	269.85 real        49.55 user       191.56 sys
> 
> kaos# sh -c 'time portsnap extract -p /perftest_dedup >/dev/null'
> 	4655.73 real        51.86 user       196.22 sys
> 
> Like I said, I have not yet had the time to retest this on drives
> without the gnop(8) fix (another similar zpool with 2 drives), so maybe
> the data I'm providing isn't relevant, but since the gnop(8) fix for 4K
> sector drives was mentioned, I thought it might be relevant to a point.

The problem with what you're testing here is that it's not really
"testing the drive" -- it's testing multiple drives with ZFS in the
middle.  Using dd would address that.  For testing "non-aligned" offsets
(for the EARS drive), use the seek= parameter.  I would also recommend
in picking an awkwardly-sized bs= value, such as 61340.

> > Now, that's for ZFS, but I'm under the impression the exact same is
> > needed for FFS/UFS.
> > 
> > <rant> Do I bother doing this with my SSDs?  No.  Am I suffering in
> > performance?  Probably.  Why do I not care?  Because the level of
> > annoyance is extremely high -- remember, all of this has to be done from
> > within the installer environment (referring to "Emergency Shell"), which
> > on FreeBSD lacks an incredible amount of usability, and is even worse to
> > deal with when doing a remote install via PXE/serial.  Fixit is the only
> > decent environment.  Given that floppies are more or less gone, I don't
> > understand why the Fixit environment doesn't replace the "Emergency
> > Shell". </rant>
> > 
> 
> Not that it necessarily helps in a PXE environment, but a memstick of
> 9-CURRENT has helped me recover minor "oops" situations a few times over
> the past few months.  What are these "floppies" you speak of, again?  :)

Sure, USB flash drives work great.  But it's a little hard to install a
USB flash drive when you're 3000 miles away.  :-)  mm's mfsBSD is also
useful for recovery situations:

http://mfsbsd.vx.sk/

My point, though, was this: Fixit was separate from Emergency Shell
because of space concerns on floppy disks (Fixit wouldn't fit).  Since
floppies really aren't used much any more, this concern should be
revised.  IMHO Fixit should be removed and Emergency Shell should
provide the same environment/utilities/etc. as Fixit.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                   Mountain View, CA, US |
| Making life hard for others since 1977.               PGP 4BD6C0CB |