bsdtar and locale

Tim Kientzle tim at kientzle.com
Thu Dec 9 07:17:42 UTC 2010


On Dec 8, 2010, at 12:43 PM, Gennady Proskurin wrote:
> bsdtar ... if you archive some file with utf-8 name
> in "C" locale (env LC_ALL=C tar -c ...), and then extract it in some UTF-8
> locale, it's name will be corrupted. Such a behaviour is somewhat documented in
> archive_entry(3) and bsdtar(1) manpages, so this is not a bug, but feature.
> 
> I agree, such conversions can be usefull in some cases, but should be disabled
> by default (we are unix, filenames are just binary data).
> It is very annoying, it makes you to always think about locales while creating
> and extracting archive.

The extended tar format used by bsdtar comes from the POSIX standard:

http://www.opengroup.org/onlinepubs/9699919799/utilities/pax.html

The issue you mention is discussed in the standard:

> Translating filenames and other attributes from a locale's encoding to UTF-8 and then back again can lose information, as the resulting filename might not be byte-for-byte equivalent to the original. To avoid this problem, users can specify the -o hdrcharset=binary option, which will cause the resulting archive to use binary format for all names and attributes. Such archives are not portable among hosts that use different native encodings (e.g., EBCDIC versus ASCII-based encodings), but they will allow interchange among the vast majority of POSIX file systems in practical use. Also, the -o hdrcharset=binary option will cause pax in copy mode to behave more like other standard utilities such as cp.

bsdtar does not yet implement an option equivalent to the -o hdrcharset=binary option, but most of the logic is already implemented in libarchive.  Libarchive's write support for pax format does automatically switch to hdrcharset=binary for entries if the names cannot be translated to UTF-8. It should be easy to add a way to explicitly request this handling for all entries.

Cheers,

Tim



More information about the freebsd-arch mailing list