svn commit: r265864 - head/sys/dev/vt/hw/ofwfb

Sun May 11 15:34:00 UTC 2014

On 05/10/14 23:51, Bruce Evans wrote:
> On Sun, 11 May 2014, Nathan Whitehorn wrote:
>
>> Log:
>>  Make ofwfb not be painfully slow. This reduces the time for a 
>> verbose boot
>>  on my G4 iBook by more than half. Still 10% slower than syscons, but 
>> that's
>>  much better than a factor of 2.
>>
>>  The slowness had to do with pathological write performance on 8-bit
>>  framebuffers, which are almost universally used on Open Firmware 
>> systems.
>>  Writing 1 byte at a time, potentially nonconsecutively, resulted in 
>> many
>>  extra PCI write cycles. This patch, in the common case where it's 
>> writing
>>  one or several characters in an 8x8 font, gangs the writes together 
>> into
>>  a set of 32-bit writes. This is a port of r143830 to vt(4).
>
> Only 10% slower?  Bitmapped mode with 256 colors is inherently 4 times
> slower for an 8x8 font (8 bytes/char instead 2) of and 8 times slower for
> an 8x16 font.  That's without any I/O pathology.  Perhaps you are 
> comparing
> with a syscons that is already very slow due to the hardware not 
> supporting
> text mode.
>
> However, syscons has buffering that should limit this problem.

This is indeed comparison to syscons in bitmap mode. PowerPC has no VGA 
text mode, so that's the best we could do. That using newcons's bitmap 
console instead of syscons's bitmap console almost tripled my boot time, 
however, was totally unreasonable and needed fixing. Whatever buffering 
syscons may have beyond what newcons has is at most a 10% thing.

>>  The EFI framebuffer is also extremely slow, probably for the same 
>> reason,
>>  and the same patch will likely help there.
>>
>> Modified:
>>  head/sys/dev/vt/hw/ofwfb/ofwfb.c
>>
>> Modified: head/sys/dev/vt/hw/ofwfb/ofwfb.c
>> ============================================================================== 
>>
>> --- head/sys/dev/vt/hw/ofwfb/ofwfb.c    Sun May 11 01:44:11 2014    
>> (r265863)
>> +++ head/sys/dev/vt/hw/ofwfb/ofwfb.c    Sun May 11 01:58:56 2014    
>> (r265864)
>> @@ -136,6 +136,10 @@ ofwfb_bitbltchr(struct vt_device *vd, co
>>     uint32_t fgc, bgc;
>>     int c;
>>     uint8_t b, m;
>> +    union {
>> +        uint32_t l;
>> +        uint8_t     c[4];
>> +    } ch1, ch2;
>>
>>     fgc = sc->sc_colormap[fg];
>>     bgc = sc->sc_colormap[bg];
>> @@ -147,36 +151,70 @@ ofwfb_bitbltchr(struct vt_device *vd, co
>>         return;
>>
>>     line = (sc->sc_stride * top) + left * sc->sc_depth/8;
>> -    for (; height > 0; height--) {
>> -        for (c = 0; c < width; c++) {
>> -            if (c % 8 == 0)
>> +    if (mask == NULL && sc->sc_depth == 8 && (width % 8 == 0)) {
>> +        for (; height > 0; height--) {
>> +            for (c = 0; c < width; c += 8) {
>>                 b = *src++;
>> -            else
>> -                b <<= 1;
>> -            if (mask != NULL) {
>> +
>
> Style bug (extra newline).

This newline is an artifact of the weird way SVN has chosen to make a 
diff. There is no actual newline in the inside of this loop.

>> +                /*
>> +                 * Assume that there is more background than
>> +                 * foreground in characters and init accordingly
>> +                 */
>> +                ch1.l = ch2.l = (bg << 24) | (bg << 16) |
>> +                    (bg << 8) | bg;
>> +
>> +                /*
>> +                 * Calculate 2 x 4-chars at a time, and then
>> +                 * write these out.
>> +                 */
>> +                if (b & 0x80) ch1.c[0] = fg;
>> +                if (b & 0x40) ch1.c[1] = fg;
>> +                if (b & 0x20) ch1.c[2] = fg;
>> +                if (b & 0x10) ch1.c[3] = fg;
>> +
>> +                if (b & 0x08) ch2.c[0] = fg;
>> +                if (b & 0x04) ch2.c[1] = fg;
>> +                if (b & 0x02) ch2.c[2] = fg;
>> +                if (b & 0x01) ch2.c[3] = fg;
>
> Style bugs (missing newlines).

This is copied and pasted from the syscons driver. I'd prefer to keep it 
the same while we have both in the tree and are trying to keep them in sync.

>> +
>> +                *(uint32_t *)(sc->sc_addr + line + c) = ch1.l;
>> +                *(uint32_t *)(sc->sc_addr + line + c + 4) =
>> +                    ch2.l;
>> +            }
>> +            line += sc->sc_stride;
>> +        }
>> +    } else {
>> +        for (; height > 0; height--) {
>> +            for (c = 0; c < width; c++) {
>>                 if (c % 8 == 0)
>> -                    m = *mask++;
>> +                    b = *src++;
>>                 else
>> -                    m <<= 1;
>> -                /* Skip pixel write, if mask has no bit set. */
>> -                if ((m & 0x80) == 0)
>> -                    continue;
>> -            }
>> -            switch(sc->sc_depth) {
>> -            case 8:
>> -                *(uint8_t *)(sc->sc_addr + line + c) =
>> -                    b & 0x80 ? fg : bg;
>> -                break;
>> -            case 32:
>> -                *(uint32_t *)(sc->sc_addr + line + 4*c) =
>> -                    (b & 0x80) ? fgc : bgc;
>> -                break;
>> -            default:
>> -                /* panic? */
>> -                break;
>> +                    b <<= 1;
>> +                if (mask != NULL) {
>> +                    if (c % 8 == 0)
>> +                        m = *mask++;
>> +                    else
>> +                        m <<= 1;
>> +                    /* Skip pixel write, if mask not set. */
>> +                    if ((m & 0x80) == 0)
>> +                        continue;
>> +                }
>> +                switch(sc->sc_depth) {
>> +                case 8:
>> +                    *(uint8_t *)(sc->sc_addr + line + c) =
>> +                        b & 0x80 ? fg : bg;
>> +                    break;
>> +                case 32:
>> +                    *(uint32_t *)(sc->sc_addr + line + 4*c)
>> +                        = (b & 0x80) ? fgc : bgc;
>> +                    break;
>> +                default:
>> +                    /* panic? */
>> +                    break;
>> +                }
>>             }
>> +            line += sc->sc_stride;
>>         }
>> -        line += sc->sc_stride;
>>     }
>> }
>
> A correctly-implemented console driver doesn't have itty-bitty hardware
> i/o like the old version of this or itty-bitty buffering like the changed
> version.

There are many deficiencies in the general approach being used here. I'm 
trying to patch it just to work for the time being so that it isn't a 
huge regression in console performance compared to syscons. Hopefully, 
the general architectural issues -- which you outline well below -- get 
solved in due course. This patch at least fixes the immediate problem.
-Nathan

> I thought that syscons always had correct buffering. Actually, it
> uses a hybrid scheme where, at least in text mode, the initial i/o is
> itty-bitty 1 character+attribute at a time (16-bit i/o), but scrolling
> and screen refresh is done bcopy, bcopy_io(), bcopy_fromio() and
> bcopy_toio() and a couple of other functions (bzero_io(), fill*())
> from/to a properly cached buffer in normal memory.  It used to use
> only bcopy() and a couple of others (bzero(), fill*()), so it
> automatically did 64-bit i/o's on 64-bit systems, except for fillw*()
> which was intentionally 16 bits for compatibilty (but it didn't use
> bcopy() which is needed for even more compatibility).  It is unclear
> which old systems break with frame buffer i/o's larger (or smaller)
> than 16 bits.  I never had any (x86) hardware that didn't work with any
> size.  The video card might be 16-bit only, but then it should just
> tell the CPU this so that the CPU reduces to 16 bits using standard
> x86 mechanisms.  Video cards have been PCI or better for about 20
> years.  PCI should support precisely 32-bits, but 64-bit frame buffer
> accesses to PCI and AGP video cards always worked for me.
>
> bcopy*io() is more technically correct, but is very badly implemented
> and much slower than bcopy() on most systems.  Its misimplementation
> includes not even using bus-space on x86.  All bcopy*io() functions
> use copyw() on x86, and copyw() is just a dumb 16-bit memcpy() written
> in C.  Writing it in C doesn't lose anything when it is used for a
> slow i/o memory, but doing 16-bit i/o's does.  And doing 16-bit i/o's
> doesn't even give compatibility, since bzero_io() is just bzero() on
> x86, so it always does wider i/o's.  syscons has always used fillw*()
> and never plain fill() since it doesn't the corresponding 32-bit
> writes that might be given by fill().  fill() actually does 8-bit
> writes.  fb also uses the badly named and implemented filll_io().
> This doesn't actually support longs, but only u_int32_t. fill_io()
> is at least ifdefed on ${ARCH}, so its access size is not completely
> hard-coded.  On arm and mips, all the ifdefed "io" functions except
> fill_io use plain memcpy() or memset() so they get a maximum access
> size and minimum hardware compatibilty.  fillw() is 16 bits on these
> arches since the access size is hard-coded in the API (and conversion
> to memset() is not done).
>
> Pessimizations in syscons have made it about twice as slow as in 
> FreeBSD-5.
> This is probably mostly due to switching from bcopy() to copyw(). There
> is a lot of bloat in upper layers, but with 2GHz CPUs it would take a
> factor of about 10 pessimizations there to be comparable with i/o
> pessimizations.
>
> A correctly-implemented console driver assembles an image of the frame
> buffer in fast memory and copies from there to the frame buffer in
> large chunks.  It is tricky to keep track of changed regions so as to
> not copy unchanged regions.  Copying everything at a refresh rate of
> not much slower than 20 Hz works well.  200 Hz for animation, but that
> is rarely needed.  The bandwidth for 80x25 text mode at 20 Hz is 80 kB/
> second.  That was easy in 1982.  I aimed for 100 Hz refresh on 2 MHz
> 6809 systems in 1987.  PC hardware at 5 MHz was about twice as slow,
> especially for frame buffers.  But it could do 80 kB/second.  The
> bandwidth for 80x25 8x16 256 color bitmapped mode is 640kB/second.
> This was difficult in 1982, but very easy now.  Yet the WindowsXP
> safe mode with command prompt console is about as slow at scrolling
> as a 1982 system in graphics mode.  It uses similar techniques to
> implement the slowness:
> - a large bitmapped screen.  640x200 8 colors in 1982.  Quite
>   a bit larger (something like 1024x768 256 colors) in 20XX.
> - write to the screen very slowly.  Use 8-bit writes with i/o artifacts
>   if possible.  The 1982 system had to do 8-bit writes to 3 color planes.
>   256-color mode is simpler than most.  Writes can also be done very
>   slowly by using another mode and misaligning text so that every
>   character written needs merging with pixels from adjacent characters.
> - do scrolling in software by copying 1 pixel at a time, using 
> read-modify-
>   write
> - I only tested this on 5-10 year old hardware, with a 1920x1080 screen
>   but not all of it used for the console window, and with a laptop
>   1024x768 screen.  A good way to be slow, one that has been portable to
>   PC systems since 1982, is to use the BIOS for video.  The console was
>   about twice as fast on the laptop.  This might be due to a combination
>   of fewer pixels and a less well pessimized BIOS.
>
> Some old screen benchmarks.  The benchmark is basically to write lines
> of the screen width and scroll.  I stopped updating this often about 15
> years ago when frame buffers and CPUs became fast enough.  But it appears
> that software bloat and design errors have caught up.
>
> % ISA ET4000: 2.4MB/sec read, 5.9MB/sec write
> % VLB ET4000/W32i: 6.8MB/sec read, 25.5MB/sec write
> % PCI S3/868: 3.5MB/sec read, 23.1MB/sec write
> % PCI S3/Virge: 4.1MB/sec read, 40.0MB/sec write
> % PCI S3/Savage: 3.3MB/sec read, 25.8MB/sec write
> % PCI Xpert: 5.3MB/sec read, 21.8MB/sec write
> % PCI R9200SE: 5.8MB/sec read, 60.2MB/sec write (but 120MB/sec fpu, 
> 250/sec sfpu)
> % -o means stty flag -opost
> % % No-scroll:
>
> Scrolling is avoided by repositioning the cursor after every screenful.
>
> % % machine     video        O/S              where      real user  
> sys    speed
> % ---------   -------      --------------   ---------  ----- ----  
> -----  -----
> % A/2223 PCI  R9200SE      FreeBSD-5.2m     onscreen-o  .026 0.00   
> .026 76.9
> % A/2223 PCI  R9200SE      FreeBSD-5.2m     offscreen-o .026 0.00   
> .026 76.9
> % A/2223 PCI  R9200SE      FreeBSD-5.2m     onscreen    .031 0.00   
> .031 64.5
> % A/2223 PCI  R9200SE      FreeBSD-5.2m     offscreen   .031 0.00   
> .031 64.5
>
> An 11 year old system.
>
> 'onscreen' means output to an active vty, 'offscreen' to an inactive vty.
> The mere existence of vtys requires full buffering to fast memory for
> inactive vtys, since there is no hardware frame buffer memory to write
> to for the inactive vtys.  You have to buffer the writes in a form that
> can be replayed when an inactive vty becomes active, and converting
> immediately to the final form is a good method (it does take more memory
> and limits history to a raw form).  'offscreen' is potentially much 
> faster,
> but in most cases it is only slightly faster, due to delayed refreshes
> for 'onscreen' and relatively fast frame buffer memory.
>
> -opost is tested separately because the Linux console driver was 
> amazingly
> slow without it.  This shows that it is possible for the software bloat
> to be so large that it dominates hardware slowness.  FreeBSD also has
> lots of bloat in the tty and syscons layers near opost, but it is in the
> noise compared with the old console Linux driver.
>
> I forget the units for these measurements, except that the speed column
> gives a bandwidth in MB/sec.  I don't remember if this is for write(2)
> bandwidth or is related to frame buffer bandwidth).  Interpret them as
> relative.
>
> On a system similar to the above, syscons scrolls at 50000 lines/sec.
> Non-virtually, this would require a frame buffer bandwidth of 200MB/sec,
> which is several times faster than possible.  Since syscons only does
> a direct update for bytes written, it needs only about 1/25 of this
> bandwidth or 800KB/sec.  This is not quite in the noise compared with
> a frame buffer bandwidth of 60.2MB/sec.
>
> % K6/233 PCI  S3/Virge     minix-1.6.25++   offscreen   0.2 0.00   
> 0.12 16.0
> % K6/233 PCI  S3/Virge     minix-1.6.25++   onscreen    0.2 0.00   
> 0.12 16.0
>
> The Minix driver from 1990 (rewritten to support virtual consoles and to
> be efficient) is faster than syscons of course.  It is smarter about
> buffering, so the onnscreen case goes at almost the same speed as the
> offscreen case.
>
> % K6/233 PCI  S3/Virge     FreeBSD-current  onscreen-o  0.23 0.00   
> 0.23  8.85
> % K6/233 PCI  S3/Virge     FreeBSD-current  offscreen-o 0.23 0.00   
> 0.23  8.85
>
> syscons is just slightly slower for the offscreen case.  -current was 
> only
> current in ~2004.
>
> % K6/233 PCI  S3/Virge     FreeBSD-current  onscreen    0.34 0.00   
> 0.34  5.83
> % K6/233 PCI  S3/Virge     FreeBSD-current  offscreen   0.34 0.00   
> 0.34  5.81
>
> But in the onscreen case, syscons is more than 50% slower, due to less
> virtualization.  This slowness became slower with faster frame buffers,
> but is still noticeable in benchmarks with the S3/Virge's write bandwidth
> of 40.0MB/sec.
>
> % P5/133 PCI  S3/868       FreeBSD-current  onscreen-o  0.39 0.00   
> 0.39  5.10
> % P5/133 PCI  S3/868       FreeBSD-current  offscreen-o 0.40 0.00   
> 0.40  5.00
> % P5/133 PCI  S3/868       FreeBSD-current  onscreen    0.51 0.00   
> 0.50  3.92
> % P5/133 PCI  S3/868       FreeBSD-current  offscreen   0.51 0.00   
> 0.51  3.92
> % K6/233 PCI  S3/Virge     linux-2.1.63     offscreen-o 0.97 0.00   
> 0.97  2.06
> % K6/233 PCI  S3/Virge     linux-2.1.63     onscreen-o  1.03 0.00   
> 1.03  1.93
> % K6/233 PCI  S3/Virge     linux-2.1.63     offscreen   1.18 0.00   
> 1.18  1.69
> % DX2/66 VLB  ET4000/W32i  FreeBSD-current  offscreen-o 1.18 0.00   
> 1.16  1.69
> % DX2/66 VLB  ET4000/W32i  FreeBSD-current  onscreen-o  1.27 0.02   
> 1.23  1.57
> % K6/233 PCI  S3/Virge     linux-2.1.63     onscreen    1.38 0.00   
> 1.38  1.45
> % 486/33 ISA  ET4000       minix-1.6.25++   offscreen   2 0.01   1.45  
> 1.37
> % 486/33 ISA  ET4000       minix-1.6.25++   onscreen    2 0.01   1.60  
> 1.24
> % DX2/66 VLB  ET4000/W32i  FreeBSD-current  offscreen   1.60 0.00   
> 1.59  1.25
> % DX2/66 VLB  ET4000/W32i  FreeBSD-current  onscreen    1.70 0.01   
> 1.66  1.18
> % 486/33 ISA  ET4000       FreeBSD-current  offscreen-o 2.30 0.01   
> 2.28  0.87
> % 486/33 ISA  ET4000       FreeBSD-current  onscreen-o  2.39 0.02   
> 2.32  0.84
> % 486/33 ISA  ET4000       FreeBSD-current  offscreen   3.15 0.03   
> 3.10  0.63
> % 486/33 ISA  ET4000       FreeBSD-current  onscreen    3.27 0.00   
> 3.21  0.61
> % DX2/66 VLB  ET4000/W32i  linux-1.2.13     offscreen-o 3.63 0.01   
> 3.62  0.15
> % DX2/66 VLB  ET4000/W32i  linux-1.2.13     onscreen-o  3.65 0.01   
> 3.63  0.55
> % DX2/66 VLB  ET4000/W32i  linux-1.2.13     offscreen  12.48 0.01  
> 12.47  0.16
> % 486/33 ISA  ET4000       linux-1.1.36     offscreen  20.80 0.00  
> 20.80  0.10
> % DX2/66 VLB  ET4000/W32i  linux-1.2.13     onscreen   26.98 0.01  
> 26.95  0.07
> % 486/33 ISA  ET4000       linux-1.1.36     onscreen   38.34 0.02  
> 38.38  0.05
>
> The speedup from the worst case (old Linux on old hardware) to the 
> best case
> (old Minix on new hardware) is a factor of 38.34/0.26 = 1475. Hardware
> speeds only increased by a factor of about 223/33 = 67.  Minix was only
> 1.5 times faster than syscons and 10-20 times faster than Linux on old
> hardware.
>
> % % Scroll:
> % % machine     video        O/S              where      real user  
> sys    speed
> % ---------   -------      --------------   ---------  ----- ----  
> -----  -----
> % A/2223 PCI  R9200SE      FreeBSD-5.2m     onscreen-o  .047 0.00   
> .047 42.6
> % A/2223 PCI  R9200SE      FreeBSD-5.2m     offscreen-o .047 0.00   
> .047 42.6
> % A/2223 PCI  R9200SE      FreeBSD-5.2m     onscreen    .051 0.00   
> .051 39.2
> % A/2223 PCI  R9200SE      FreeBSD-5.2m     offscreen   .051 0.00   
> .051 39.2
> % K6/233 PCI  S3/Virge     minix-1.6.25++   offscreen   0.2 0.00   
> 0.14 14.0
> % K6/233 PCI  S3/Virge     minix-1.6.25++   onscreen    0.2 0.00   
> 0.14 14.0
> % K6/233 PCI  S3/Virge     FreeBSD-current  onscreen-o  0.36 0.00   
> 0.36  5.54
> % K6/233 PCI  S3/Virge     FreeBSD-current  offscreen-o 0.40 0.00   
> 0.40  5.01
> % K6/233 PCI  S3/Virge     FreeBSD-current  onscreen    0.47 0.00   
> 0.47  4.22
> % K6/233 PCI  S3/Virge     FreeBSD-current  offscreen   0.51 0.00   
> 0.51  3.92
>
> Scrolling makes no difference for Minix due to the better virtualization.
> It slows down syscons by about 50%.  Strangely, the onscreen case is now
> faster?!
>
> % P5/133 PCI  S3/868       FreeBSD-current  onscreen-o  1.24 0.00   
> 1.23  1.61
> % P5/133 PCI  S3/868       FreeBSD-current  offscreen-o 1.28 0.00   
> 1.27  1.56
> % P5/133 PCI  S3/868       FreeBSD-current  onscreen    1.35 0.00   
> 1.34  1.48
> % P5/133 PCI  S3/868       FreeBSD-current  offscreen   1.39 0.00   
> 1.38  1.44
> % K6/233 PCI  S3/Virge     linux-2.1.63     onscreen-o  1.49 0.00   
> 1.49  1.34
> % 486/33 ISA  ET4000       minix-1.6.25++   offscreen   2 0.00   1.70  
> 1.18
> % 486/33 ISA  ET4000       minix-1.6.25++   onscreen    2 0.00   1.81  
> 1.10
> % K6/233 PCI  S3/Virge     linux-2.1.63     onscreen    1.85 0.00   
> 1.85  1.08
> % K6/233 PCI  S3/Virge     linux-2.1.63     offscreen-o 2.88 0.00   
> 2.88  0.69
> % K6/233 PCI  S3/Virge     linux-2.1.63     offscreen   3.10 0.00   
> 3.10  0.65
> % DX2/66 VLB  ET4000/W32i  FreeBSD-current  offscreen-o 3.39 0.02   
> 3.36  0.59
> % DX2/66 VLB  ET4000/W32i  FreeBSD-current  onscreen-o  3.67 0.02   
> 3.63  0.54
> % DX2/66 VLB  ET4000/W32i  FreeBSD-current  offscreen   3.82 0.00   
> 3.81  0.52
> % DX2/66 VLB  ET4000/W32i  FreeBSD-current  onscreen    4.14 0.03   
> 4.06  0.48
> % DX2/66 VLB  ET4000/W32i  linux-1.2.13     onscreen-o  4.34 0.01   
> 4.32  0.46
> % 486/33 ISA  ET4000       FreeBSD-current  offscreen-o 5.54 0.03   
> 5.48  0.36
> % 486/33 ISA  ET4000       FreeBSD-current  onscreen-o  5.73 0.00   
> 5.61  0.35
> % 486/33 ISA  ET4000       FreeBSD-current  offscreen   6.41 0.03   
> 6.34  0.31
> % 486/33 ISA  ET4000       FreeBSD-current  onscreen    6.62 0.01   
> 6.45  0.30
>
> The old systems didn't have the CPU or frame buffer bandwidth to scroll
> at 50000 lines/sec.  Rescaling 50000 by this 6.62 divided by the above 
> 0.026
> gives only 196 lines/sec.  That was usable, but since you can see the
> scroll move it is not very good.  Rescaling Minix's 2.0 gives 650 
> lines/sec,
> or a full screen refresh rate of 26 Hz.  You can probably see the scroll
> flicker but not move at this rate.  Of course, the implementation does
> delayed refresh to reach this rate, so most of the scrolling steps are
> virtual and you can only see the screen flicker for other reasons.  
> syscons'
> scrolling is also virtual.
>
> % DX2/66 VLB  ET4000/W32i  linux-1.2.13     offscreen-o13.48 0.01  
> 13.47  0.15
> % DX2/66 VLB  ET4000/W32i  linux-1.2.13     offscreen  22.60 0.01  
> 22.42  0.09
> % 486/33 ISA  ET4000       linux-1.1.36     offscreen  23.56 0.03  
> 23.60  0.08
> % DX2/66 VLB  ET4000/W32i  linux-1.2.13     onscreen   27.73 0.01  
> 27.72  0.08
> % 486/33 ISA  ET4000       linux-1.1.36     onscreen   40.26 0.00  
> 40.27  0.05
>
> Rescaling 50000 by this 40.26 divided by the above 0.026 gives 26 
> lines/sec.
> That is only a bit better than 1982 pixel mode quality.  But this is for
> text mode.
>
> Bruce
>