Re: PortsDB: a program that imports the ports tree into an SQLite database

From: Peter Jeremy <peterj_at_freebsd.org>
Date: Sun, 15 Jan 2023 09:48:40 UTC
[Freshly updated key - please update your keyring]

On 2023-Jan-04 14:30:59 -0800, Yuri <yuri@FreeBSD.org> wrote:
>I developed the PortsDB project that imports FreeBSD ports into an 
>SQLite database: https://github.com/yurivict/freebsd-portsdb
>The port is ports-mgmt/portsdb.

Thanks.  That looks quite useful.

>The database can be fully rebuilt in ~20 minutes, after which in can be 
>quickly (in seconds) updated with new commits.

That seems a long time but it seems that most of it is running various
"make describe" commands across all ports.

Unfortunately, "ports-import" failed for me, reporting:
...
actions that will be performed:
 - import the ports tree into the database /home/peter/ports.sqlite
memory                                                             
Parse error near line 135079: near "net": syntax error
  _PHASE,CHILD_PKGORIGIN,CHILD_FLAVOR,KIND) VALUES('net/rubygem-net-protocol',''
                                      error here ---^
portsdb-import  1042.79s user 2088.03s system 479% cpu 10:53.06 total

Looking at the intermediate SQL, the problem is that the "INSERT INTO Port"
statement for print/ghostscript7-base gets interleaved with several
"INSERT OR IGNORE INTO Depends" statements at 2048 character boundaries.
Eliding most of the relevant lines, I get:
INSERT INTO Port(...) VALUES ('print/ghostscript7-base',...GS_t4693d8 GS_tekINSERT OR IGNORE INTO Depends(...) VALUES('net/rubygem-net-protocol','',null,'mail/rubygem-net-smtp','','R');
4696 GS_bmpmono GS_bmpgray ... GS_ljet4d GS_ljetplus GSINSERT OR IGNORE INTO Depends(...) VALUES('graphics/gd','',null,'graphics/libwmf','','L');
_lp1800 GS_lp1900 GS_lp2000 ... GS_vgalib',null);

It looks like several other ports (www/apache24, math/octave-forge,
www/nginx-full, print/ghostscript8-base, print/ghostscript9-base) also
result in "INSERT INTO Port" statements exceeding 2048 characters, at
least for me, only 1 port actually failed the race with another
process.

Looking at the actual code, the problem is that the "echo" in
/usr/local/libexec/portsdb/include/functions-sql.sh doesn't do atomic
writes of arbitrary length arguments.  The actual size isn't clear but
when I tried ktrace(1)'ing a script that contained
'echo "very long string" >> somefile'
it did a series of 1024 character writes.

I can't explain why I've only seen truncation at 2KiB and only in one
case but race condition behaviour isn't obvious.

My suggestion is that you use something other than sh(1) for writing
the intermediate file.  Based on a quick check, awk(1) can write at
least 5358 characters in a single write(1) call but the write(2) man
page doesn't document any atomicity requirements so it seems possible
that large writes could be interleaved at the filesystem level in any
case. Ideally, this tool would be written with explicit serialisation
operations around cases where write(1) operations are being issued
by multiple processes/threads.

-- 
Peter Jeremy