Poudriere question

Tue May 10 14:25:44 UTC 2016

On Tue, 10 May 2016 14:35:35 +0200
Guido Falsi wrote:

> On 05/10/16 13:35, RW via freebsd-ports wrote:
> > On Mon, 9 May 2016 20:15:12 +0200
> > Guido Falsi wrote:
> >   
> >> On 05/09/16 19:52, Fernando Apesteguía wrote:  
> >>> Hi all,
> >>>
> >>> Is it safe to use different invocations of poudriere concurrently
> >>> for different jails but using the same ports collection?
> >>>     
> >>
> >> Yes it is, or at least should be.
> >>
> >> The ports trees are mounted read only in the jails, the wrkdir is
> >> defined at a different path.  
> > 
> > What about the distfiles directory? 
> > 
> > Having two "make checksums" running on the same file used to work
> > fairly well, but not any more because the target now deletes an
> > incomplete file rather than trying to resume it.
> > 
> > This wont damage packages, but it can cause two "make checksums" to
> > get locked in a cycle of deleting each other's files and end w
> > one getting a failed checksum.   
> 
> Yes it happens, I even have used the same disfiles over NFS with more
> than one machine/poudriere accessing it.
> 
> The various instances do overwrite each other and checksums do fail
> but usually in the end one of them "wins" and the correct file ends
> up being completed, with other instances reading that one. I agree
> this happens just by chance and not due to good design.

Only the last process will terminate with a complete file and without
error, when another process runs out of retries, the file with the
directory entry is a download in progress which will fail the checksum.

If it commonly ends-up working in poudriere that's probably a property
of how  poudriere orders things. But you still have the problem of
wasted time and bandwidth. This problem is most likely with large
distfiles and there's at least one that's 1 GB.

The way this used to work is that the second process would try to
resume the download which presumably  involved getting a lock on the
file. For smaller files it would just work. Worst case was that the
second process would fail after a timeout.

I think the change came in to delete possible re-rolled distfiles
automatically (a relatively minor problem), but in the process it
created this problem and also broke resuming downloads. 

I don't see the reason for checking and deleting the file before
attempting to resume it.

> As far as I understand Unix Filesystem semantics each download
> actually creates a new file, with only the last one to start
> referencing the actual file visible on the filesystem. So the last
> one starting to download is the one which will "win" creating the
> correct file on the FS, then checksumming it and going on. The other
> files have actuay been deleted and are simply removed from disk as
> soon as the download ends, if at that point the "winning" one has
> finished the download, they will checksum that file.
> 
> There is a chance of the loosing download to end before the winning
> one ends and overwriting it again, but in my experience with at most
> 3-4 instances over NFS it usually fixes itself in the long run.
> 
> IMHO best solution is to make sure you already have distfiles on disk
> for what you are going to build.
>