i386 4/4 change

Sun Apr 1 07:05:23 UTC 2018

On Sun, 1 Apr 2018, Dimitry Andric wrote:

> On 31 Mar 2018, at 17:57, Bruce Evans <brde at optusnet.com.au> wrote:
>>
>> On Sat, 31 Mar 2018, Konstantin Belousov wrote:
>>
>>> the change to provide full 4G of address space for both kernel and
>>> user on i386 is ready to land.  The motivation for the work was to both
>>> mitigate Meltdown on i386, and to give more breazing space for still
>>> used 32bit architecture.  The patch was tested by Peter Holm, and I am
>>> satisfied with the code.
>>>
>>> If you use i386 with HEAD, I recommend you to apply the patch from
>>> https://reviews.freebsd.org/D14633
>>> and report any regressions before the commit, not after.  Unless
>>> a significant issue is reported, I plan to commit the change somewhere
>>> at Wed/Thu next week.
>>>
>>> Also I welcome patch comments and reviews.
>>
>> It crashes at boot time in getmemsize() unless booted with loader which
>> I don't want to use.

> For me, it at least compiles and boots OK, but I'm one of those crazy
> people who use the default boot loader. ;)

I found a quick fix and sent it to kib.  (2 crashes in vm86 code for memory
sizing.  This is not called if loader is used && the system has smap.  Old
systems don't have smap, so they crash even if loader is used.)

> I haven't yet run any performance tests, I'll try building world and a
> few large ports tomorrow.  General operation from the command line does
> not feel "sluggish" in any way, however.

Further performance tests:
- reading /dev/zero using tinygrams is 6 times slower
- read/write of a pipe using tinygrams is 25 times slower.  It also gives
   unexpected values in wait statuses on exit, hopefully just because the
   bug is in the test program is exposed by the changed timing (but later
   it also gave SIGBUS errors).  This does a context switch or 2 for every
   read/write.  It now runs 7 times slower using 2 4.GHz CPUs than in
   FreeBSD-5 using 1 2.0 GHz CPU.  The faster CPUs and 2 of them used to
   make it run 4 times faster.  It shows another slowdown since FreeBSD-5,
   and much larger slowdowns since FreeBSD-1:

   1996 FreeBSD on P1  133MHz:   72k/s
   1997 FreeBSD on P1  133MHz:   44k/s (after dyson's opts for large sizes)
   1997 Linux   on P1  133MHz:   93k/s (simpler is faster for small sizes)
   1999 FreeBSD on K6  266MHz:  129k/s
   2018 FBSD-~5 on AthXP 2GHz:  696k/s
   2018 FreeBSD on i7  2x4GHz: 2900k/s
   2018 FBSD4+4 on i7  2x4GHz:  113k/s (faster than Linux on a P1 133MHz!!)

Netblast to localhost has much the same 6 times slowness as reading
/dev/zero using tinygrams.  This is the slowdown for syscalls.
Tinygrams are hard to avoid for UDP.  Even 1500 bytes is a tinygram
for /dev/zero.  Without 4+4, localhost is very slow because it does
a context switch or 2 for every packet (even with 2 CPUs when there is
no need to switch).  Without 4+4 this used to cost much the same as the
context switches for the pipe benchmark.  Now it costs relatively much
less since (for netblast to localhost) all of the context switches are
between kernel threads.

The pipe benchmark uses select() to avoid busy-waiting.  That was good
for UP.  But for SMP with just 2 CPUs, it is better to busy-wait and
poll in the reader and writer.

netblast already uses busy-waiting.  It used to be a bug that select()
doesn't work on sockets, at least for UDP, so blasting using busy-waiting
is the only possible method (timeouts are usually too coarse-grained to
go as fast as blasting, and if they are fine-grained enough to go fast
then they are not much better than busy-waiting with time wasted for
setting up timeouts).  SMP makes this a feature.  It forces use of busy-
waiting, which is best if you have a CPU free to run it and this method
doesn't take to much power.

Context switches to task queues give similar slowness.  This won't be
affected by 4+4 since task queues are in the kernel.  I don't like
networking in userland since it has large syscall and context switch
costs.  Increasing these by factors of 6 and 25 doesn't help.  It
can only be better by combining i/o in a way that the kernel neglects
to do or which is imposed by per-packet APIs.  Slowdown factors of 6
or 25 require the combined i/o to be 6 or 25 larger to amortise the costs.

Bruce