SMP Version of tar

Tue Oct 2 05:17:02 UTC 2012

On Oct 1, 2012, at 9:51 AM, Brandon Falk wrote:

> I would be willing to work on a SMP version of tar (initially just gzip or something).
> 
> I don't have the best experience in compression, and how to multi-thread it, but I think I would be able to learn and help out.
> 
> Note: I would like to make this for *BSD under the BSD license. I am aware that there are already tools to do this (under GPL), but I would really like to see this existent in the FreeBSD base.
> 
> Anyone interested?

Great!

First rule:  be skeptical.  In particular, tar is so entirely disk-bound that many performance optimizations have no impact whatsoever.  If you don't do a lot of testing, you can end up wasting a lot of time.

There are a few different parallel command-line compressors and decompressors in ports; experiment a lot (with large files being read from and/or written to disk) and see what the real effect is.  In particular, some decompression algorithms are actually faster than memcpy() when run on a single processor.  Parallelizing such algorithms is not likely to help much in the real world.

The two popular algorithms I would expect to benefit most are bzip2 compression and lzma compression (targeting xz or lzip format).  For decompression, bzip2 is block-oriented so fits SMP pretty naturally.  Other popular algorithms are stream-oriented and less amenable to parallelization.

Take a careful look at pbzip2, which is a parallelized bzip2/bunzip2 implementation that's already under a BSD license.  You should be able to get a lot of ideas about how to implement a parallel compression algorithm.  Better yet, you might be able to reuse a lot of the existing pbzip2 code.

Mark Adler's pigz is also worth studying.  It's also license-friendly, and is built on top of regular zlib, which is a nice technique when it's feasible.

There are three fundamentally different implementation approaches with different complexity/performance issues:

  * Implement as a stand-alone executable similar to pbzip2.  This makes your code a lot simpler and makes it reasonably easy for people to reuse your work.  This could work with tar, though it could be slightly slower than the in-process version due to the additional data-copying and process-switch overhead.

  * Implement within libarchive directly.  This would benefit tar and a handful of other programs that use libarchive, but may not be worth the complexity.

  * Implement as a standalone library with an interface similar to zlib or libbz2 or liblzma.

The last would be my personal preference, though it's probably the most complex of all.   That would easily support libarchive and you could create a simple stand-alone wrapper around it as well, giving you the best of all worlds.

If you could extend the pigz technique, you might be able to build a multi-threaded compression library where the actual compression was handled by an existing single-threaded library.  Since zlib, bzlib, and liblzma already have similar interfaces, your layer might require only a thin adapter to handle any of those three.  *That* would be very interesting, indeed.

Sounds like a fun project.  I wish I had time to work on something like this.

Cheers,

Tim