Pointyhat packages

Tue Dec 8 20:17:11 UTC 2009

On Tue, Dec 08, 2009 at 09:15:13AM -0300, oren.almog at gmail.com wrote:
> For the last couple of days I have been following the pointyhat build  
> statistics provided at
> http://pointyhat.freebsd.org/errorlogs/packagestats.html

Brave man :-)  That's one I set up.

> As seen on that page, the building process started on Dec 3rd but had
> not been completed yet.

Apparently the build for www/p5-Gtk2-WebKit is now hanging on all buildenvs.
Pav already marked it so on amd64.

It will continue to run until a reaper process kills it off (or one of
us portmgrs does it manually).  I'd like to see the error log so I'm
going to let it run for now.  The reaper process is IIRC 24 hours.

> Why is there such a large difference between the build times on amd64
> and i386?  Are the i386 machines really that underpowered?

Two data points: one, it looks like Pav having marked www/p5-Gtk2-WebKit
as broken had already been taken into account for the amd64 build, so it
didn't have that problem.  And two, some of our i386 machines are indeed
underpowered.  We've added several new, more modern, ones this year that
were donated to us: these are dual 2.4 or 2.8GHz machines, mostly with
2G of RAM.  (One of my background tasks is to try to characterize
performance on the nodes with various setups; my intuition is that 4G
would allow us to raise throughput, but I need to make a 'use case' for
that before I go ask for funding.)

fwiw, I continually look for new ways to scrounge more package building
nodes (I seem to have inherited the task of looking after them).

> Next I found this page which keeps track on the upload status of  
> packages to the various ftp sites  
> http://portsmon.freebsd.org/portsuploadstatus.py

That's mine too :-)

> If the statistics on that page are correct then it seems to me that  
> there is a lot of inefficiency in the build and upload process.

With 11 active buildenvs, we have saturated the amount of data that the
sites can upload.  We've discussed the matter before but no one has
come up with a solution.  We try not to upload different package sets
at the same time, as a workaround.

> Some sites are rarely updated

Not all of the sites carry all of the buildenvs, and some of those that
do can run days behind.

Also, I don't have up-to-date contact information for the various sites.
If anyone has that, please let me know.

> and some poinyhat build runs are never uploaded.

Hmm, they should be.  I'll forward this on to pav.  (The way we have the
work divided up is that pav does amd64; erwin does i386; I do sparc64 and
the nascent ia64; and various portmgrs, including miwi, do the *-exp runs
which are intentionally not uploaded, but do constitute a load on both
pointyhat and the nodes.)

> I simply want to make sense of all this and understand how it all fits
> together

I've been trying to understand it for several years, so don't worry :-)
And I'm one of the people "in charge".

Longer explanation:

pointyhat throughput depends on a lot of factors, some of which I am in
the early stages of understanding.

 - if a node hangs (but only in certain ways), the dispatch scheduler
   can get into a state where it still tries to schedule builds on that
   node, over and over.  This causes an overall slowdown in build
   dispatch.  I'm not exactly sure of the root cause of the hangs, but
   one of them is likely to be swap exhaustion which leads to sshd being
   killed.  (The most recent -current fixes this).  Since the failures
   are statistical, they are hard to catch.  I have added some error
   logging code to try to figure this out.  As for the scheduler, there
   is some missing functionality there.  The code is complex so it's not
   trivial to fix.

 - pointyhat itself is a very heavily loaded machine.  The most recent
   problems we have been chasing are a) disk space exhaustion, and b)
   disk controller saturation.  For the former, we keep finding things
   to evict.  OTOH, with 16 buildenvs (counting the *-exp ones) there is
   only so much we can do.  When space is low, the rate of builds slows
   down significantly, for reasons I do not understand yet.  For the latter,
   there are two processes that busy the controller: 1) compression of
   saved logfiles, and 2) the ZFS backup process.  I think I may have an
   idea of how to fix 1); I will have to learn more about the way ZFS is
   set up on pointyhat to fix that.

 - pointyhat can get into situations where nfs timeouts from nfs mounted
   filesystems (such as /home) crash the system.  I don't know much about
   this.  Once that happens, we have to restart all the builds.  Sometimes
   it can take a little while for one of us to notice the crash.

 - the scheduler has a bug where it occasionally crashes.  I am actively
   investigating this and have added a bunch of debug code to catch it
   in the act.  Again, when this happens, all the builds have to be
   restarted.  This was happening a lot in the first few days of December,
   but seems to have settled down now.

The code that runs pointyhat is hundreds of lines of sh, awk, perl, and
python, and quite complex.  Although these days I understand most of it
from a static sense, I'm still learning about its dynamic characteristics.

But now you know the contents of (part of) my todo list.

mcl