FreeBSD ports USE_XZ critical issue on low-RAM computers
lasse.collin at tukaani.org
Sun Jun 20 15:23:07 UTC 2010
On 2010-06-20 Matthias Andree wrote:
> Am 19.06.2010 15:41, schrieb Lasse Collin:
> > Perhaps FreeBSD provides a good working way to limit the amount of
> > memory that a process actually can use. I don't see such a way e.g.
> > in Linux, so having some method in the application to limit memory
> > usage is definitely nice. It's even more useful in the compression
> > library, because a virtual-memory-hog application on a busy server
> > doesn't necessarily want to use tons of RAM for decompressing data
> > from untrusted sources.
> Even there the default should be "max", and the library SHOULD NOT
> second-guess what trust level of data the application might to
> process with libxz's help.
There is no default value for the memory limit in liblzma (not libxz,
for historical reasons). You can specify UINT64_MAX if you want. Please
don't complain how the library sucks without looking at its API first.
Don't confuse the limiter _feature_ with its _default value_; there is a
default value only in the command line tools.
> Expose the limiter interface in the API if you want, but particularly
> for the library in particular, any other default than "unlimited
> memory" is a nuisance. And there's still an application, and unlike
> the xz library, the application should know what kind of data from
> what sources it is processing, and if - for instance - a virus
> inspector wants to impose memory limits and quarantine an attachment
> with what looks like an zip bomb.
Yes, this is exactly what I have done in liblzma, except that there is
no default value (typing UINT64_MAX isn't too much to ask).
> >> For compression, it's less critical because service is degraded,
> >> not denied, but I'd still think -M max would be the better
> >> default. I can always put "export XZ_OPT=-3" in
> >> /etc/profile.d/local.sh or wherever it belongs on the OS of the
> >> day.
> > If a script has "xz -9", it overrides XZ_OPT=-3.
> I know. This isn't a surprise for me. The memory limiting however is.
> And the memory limiting overrides xz -9 to something lesser, which
> may not be what I want either.
I have only one computer with over 512 MiB RAM (this has 8 GiB). Thus
"xz -9" is usable only on one of my computers. I cannot go and fix all
scripts so that they first check how much RAM I have and then pick a
reasonable compression level. It doesn't look so good to make "xz -9" so
low either that it would be usable on all systems with e.g. 256 MiB RAM
or more (you can have higher settings than the current "xz -9", they
just aren't so useful usually, even -9 is not always so useful compared
to a bit lower settings).
What do you think is the best solution to the above problem without
putting a default memory usage limit in xz? Setting something in XZ_OPT
might work in many cases, but sometimes scripts set it themselves e.g.
to pass compression settings to some other script calling xz. Maybe xz
should support a config file? Or maybe another environment variable,
which one could assume that scripts won't touch? These are honest
questions and answering them would help much more than long descriptions
of how the current method is bad.
> >> I still think utilities and applications should /not/ impose
> >> arbitrarily lower limits by default though.
> > There's no multithreading in xz yet, but when there is, do you want
> > xz to use as many threads as there are CPU cores _by default_? If
> > so, do you mind if compressing with "xz -9" used around 3.5 GiB of
> > memory on a four-core system no matter how much RAM it has?
> Multithreading in xz is worth discussion if the tasks can be
> parallelized, which is apparently not the case. You would be
> duplicating effort, because we have tools to run several xz on
> distinct files at the same time, for instance BSD portable make or
> GNU make with a "-j" option.
That's a nice way to avoid answering the question. xargs works too when
you have multiple small files (there's even an example on recent man
page of xz). Please explain how any of these help with a multigigabyte
file. That's where people want xz to use threads. There is more than one
way to parallelize the compression, and some of them increase encoder
memory usage quite a lot.
> > I think it is quite obvious that you want the number of threads to
> > be limited so that xz won't accidentally exceed the total amount
> > of physical RAM, because then it is much slower than using fewer
> > threads.
> This tells me xz cannot fully parallelize its effort on the CPUs, and
> should be single-threaded so as not to waste the parallelization
Sure, it cannot "fully" parallelize, whatever that means. But the amount
of parallelization that is possible is welcomed by many others (you are
the very first person to think it's useless). For example, 7-Zip can use
any number of threads with .xz files and there are some liblzma-based
experimental tools too.
Next question could be how to determine how many threads could be OK for
multithreaded decompression. It doesn't "fully" parallelize either, and
would be possible only in certain situations. There too the memory usage
grows quickly when threads are added. To me, a memory usage limit
together with a limit on number of threads looks good; with no limits,
the decompressor could end up reading the whole file into RAM (and
swap). Threaded decompression isn't so important though, so I'm not even
sure if I will ever implement it.
> If I specify -9 or --best, but no memory option, that means "compress
> as hard as you can".
With xz it isn't and will never be. The "compress has hard as you can"
option would currently use 1.5 GiB dictionary, which is waste of memory
when compressing files that are a lot smaller than that. The format of
the LZMA2 algorithm used in .xz supports dictionaries up to 4 GiB. The
decoder supports all dictionary sizes, but the current encoder is
limited to 1.5 GiB for implementation reasons.
You would need a little bit over 16 GiB memory to compress with 1.5 GiB
dictionary using the BT4 match finder. I don't think you honestly want
-9 to be that. Instead, -9 is set to an arbitrary point of 64 MiB
dictionary, which still can make sense in many common situations. That
currently uses 674 MiB of memory to compress and a little more than the
dictionary size to decompress, so I round it up to 65 MiB.
The dictionary size is only one thing to get high compression. It
depends on the file. Some files benefit a lot when dictionary size
increases while others benefit mostly from spending more CPU cycles.
That's why there is the --extreme option. It allows improving the
compression ratio by spending more time without requiring so much RAM.
The existence of --extreme (-e) naturally makes things slightly more
complicated for a user than using only a linear single-digit scale for
compression levels, but makes it easier to specify what is wanted
without requiring the user to read about the advanced options. Note that
I plan to revise what settings exactly are bound to different
compression levels before the 5.0.0 release.
Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
More information about the freebsd-ports