FreeBSD ports USE_XZ critical issue on low-RAM computers

Sun Jun 20 17:02:11 UTC 2010

On Sun, 20 Jun 2010 18:23:03 +0300
Lasse Collin <lasse.collin at tukaani.org> wrote:

> On 2010-06-20 Matthias Andree wrote:
> > Am 19.06.2010 15:41, schrieb Lasse Collin:
> > > Perhaps FreeBSD provides a good working way to limit the amount of
> > > memory that a process actually can use. I don't see such a way
> > > e.g. in Linux, so having some method in the application to limit
> > > memory usage is definitely nice. It's even more useful in the
> > > compression library, because a virtual-memory-hog application on
> > > a busy server doesn't necessarily want to use tons of RAM for
> > > decompressing data from untrusted sources.
> > 
> > Even there the default should be "max", and the library SHOULD NOT
> > second-guess what trust level of data the application might to
> > process with libxz's help.
> 
> There is no default value for the memory limit in liblzma (not libxz, 
> for historical reasons). You can specify UINT64_MAX if you want.
> Please don't complain how the library sucks without looking at its
> API first. Don't confuse the limiter _feature_ with its _default
> value_; there is a default value only in the command line tools.

Personally I'd suggest keeping the option to limit the memory, but as
an option, not as default.

One thing I would really love to see going away is the default to
delete the archive on decompression. I really see no reason for it and
it makes it very easy to shoot yourself. (And no, I never understood
some other compression tools default to deleting the archive file).

> > Expose the limiter interface in the API if you want, but
> > particularly for the library in particular, any other default than
> > "unlimited memory" is a nuisance.  And there's still an
> > application, and unlike the xz library, the application should know
> > what kind of data from what sources it is processing, and if - for
> > instance - a virus inspector wants to impose memory limits and
> > quarantine an attachment with what looks like an zip bomb.
> 
> Yes, this is exactly what I have done in liblzma, except that there
> is no default value (typing UINT64_MAX isn't too much to ask).
> 
> > >> For compression, it's less critical because service is degraded,
> > >> not denied, but I'd still think -M max would be the better
> > >> default. I can always put "export XZ_OPT=-3" in
> > >> /etc/profile.d/local.sh or wherever it belongs on the OS of the
> > >> day.
> > > 
> > > If a script has "xz -9", it overrides XZ_OPT=-3.
> > 
> > I know. This isn't a surprise for me. The memory limiting however
> > is. And the memory limiting overrides xz -9 to something lesser,
> > which may not be what I want either.
> 
> I have only one computer with over 512 MiB RAM (this has 8 GiB). Thus 
> "xz -9" is usable only on one of my computers. I cannot go and fix
> all scripts so that they first check how much RAM I have and then
> pick a reasonable compression level. It doesn't look so good to make
> "xz -9" so low either that it would be usable on all systems with
> e.g. 256 MiB RAM or more (you can have higher settings than the
> current "xz -9", they just aren't so useful usually, even -9 is not
> always so useful compared to a bit lower settings).
> 
> What do you think is the best solution to the above problem without 
> putting a default memory usage limit in xz? Setting something in
> XZ_OPT might work in many cases, but sometimes scripts set it
> themselves e.g. to pass compression settings to some other script
> calling xz. Maybe xz should support a config file? Or maybe another
> environment variable, which one could assume that scripts won't
> touch? These are honest questions and answering them would help much
> more than long descriptions of how the current method is bad.

Generally, I think programs should support both, the later overriding
the first: .conf -> env -> command line 

> > >> I still think utilities and applications should /not/ impose
> > >> arbitrarily lower limits by default though.
> > > 
> > > There's no multithreading in xz yet, but when there is, do you
> > > want xz to use as many threads as there are CPU cores _by
> > > default_? If so, do you mind if compressing with "xz -9" used
> > > around 3.5 GiB of memory on a four-core system no matter how much
> > > RAM it has?
> > 
> > Multithreading in xz is worth discussion if the tasks can be
> > parallelized, which is apparently not the case.  You would be
> > duplicating effort, because we have tools to run several xz on
> > distinct files at the same time, for instance BSD portable make or
> > GNU make with a "-j" option.
> 
> That's a nice way to avoid answering the question. xargs works too
> when you have multiple small files (there's even an example on recent
> man page of xz). Please explain how any of these help with a
> multigigabyte file. That's where people want xz to use threads. There
> is more than one way to parallelize the compression, and some of them
> increase encoder memory usage quite a lot.

At the moment, what are the plans and the advantages of multithreding
(both on compression and decompression)?

> > > I think it is quite obvious that you want the number of threads to
> > > be limited so that xz won't accidentally exceed the total amount
> > > of physical RAM, because then it is much slower than using fewer
> > > threads.
> > 
> > This tells me xz cannot fully parallelize its effort on the CPUs,
> > and should be single-threaded so as not to waste the parallelization
> > overhead.
> 
> Sure, it cannot "fully" parallelize, whatever that means. But the
> amount of parallelization that is possible is welcomed by many others
> (you are the very first person to think it's useless). For example,
> 7-Zip can use any number of threads with .xz files and there are some
> liblzma-based experimental tools too.
> 
> Next question could be how to determine how many threads could be OK
> for multithreaded decompression. It doesn't "fully" parallelize
> either, and would be possible only in certain situations. There too
> the memory usage grows quickly when threads are added. To me, a
> memory usage limit together with a limit on number of threads looks
> good; with no limits, the decompressor could end up reading the whole
> file into RAM (and swap). Threaded decompression isn't so important
> though, so I'm not even sure if I will ever implement it.

I'd say offer an option if you want. 

> > If I specify -9 or --best, but no memory option, that means
> > "compress as hard as you can".
> 
> With xz it isn't and will never be. The "compress has hard as you
> can" option would currently use 1.5 GiB dictionary, which is waste of
> memory when compressing files that are a lot smaller than that. The
> format of the LZMA2 algorithm used in .xz supports dictionaries up to
> 4 GiB. The decoder supports all dictionary sizes, but the current
> encoder is limited to 1.5 GiB for implementation reasons.
> 
> You would need a little bit over 16 GiB memory to compress with 1.5
> GiB dictionary using the BT4 match finder. I don't think you honestly
> want -9 to be that. Instead, -9 is set to an arbitrary point of 64
> MiB dictionary, which still can make sense in many common situations.
> That currently uses 674 MiB of memory to compress and a little more
> than the dictionary size to decompress, so I round it up to 65 MiB.
> 
> The dictionary size is only one thing to get high compression. It 
> depends on the file. Some files benefit a lot when dictionary size 
> increases while others benefit mostly from spending more CPU cycles. 
> That's why there is the --extreme option. It allows improving the 
> compression ratio by spending more time without requiring so much RAM.
> 
> The existence of --extreme (-e) naturally makes things slightly more 
> complicated for a user than using only a linear single-digit scale
> for compression levels, but makes it easier to specify what is wanted 
> without requiring the user to read about the advanced options. Note
> that I plan to revise what settings exactly are bound to different 
> compression levels before the 5.0.0 release.

We've pondered a bit about switching our packages from .tbz to .xz or
tar.xz. Given that a package is made once, and downloaded and
decompressed by a lot of users a lot of times, it would probably make
sense to go for the smallest possible size; however, if this would mean
that some users won't be able to decompress the packages, then probably
xz isn't the tools for us.

Speaking of sizes, do you have any statistical data regarding: source
size, compression options, compression speed and decompression speed
(and memory usage, since we're talking about it)?

Thanks,

-- 
IOnut - Un^d^dregistered ;) FreeBSD "user"
  "Intellectual Property" is   nowhere near as valuable   as "Intellect"
FreeBSD committer -> itetcu at FreeBSD.org, PGP Key ID 057E9F8B493A297B
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-ports/attachments/20100620/2655560c/signature.pgp