[Bug 268189] BSD tar incorectly encode UTF-8 sequences

From: <bugzilla-noreply_at_freebsd.org>
Date: Tue, 06 Dec 2022 08:22:58 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=268189

            Bug ID: 268189
           Summary: BSD tar incorectly encode UTF-8 sequences
           Product: Base System
           Version: 13.1-RELEASE
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Many People
          Priority: ---
         Component: bin
          Assignee: bugs@FreeBSD.org
          Reporter: aeder@list.ru

BSD tar incorectly encode UTF-8 sequences

How to repeat:
Create two directories with (UTF-8) names:

d0 bf d0 be d0 bb d0 b5 d0 b2 d0 be d0 b8 cc 86
d0 bf d0 be d0 bb d0 b5 d0 b2 d0 be d0 b9

("полевой" and "полевой"). It looks exactly the same, but actually it's
different names.

The difference is that sequence 'd0 b9' encode cyrillic 'й' symbol, but 'd0 b8
cc 86' encode actually two symbols: cyrillic 'и' and diacritic symbol which I
can't enter here.

You can create such directories or files, but if archived using BSD tar, second
name become replaced by first name.

Adding --posix option or LC_ALL=C doesn't help.

GNU tar handle such files correctly - as separate files/directories.

I think at least --posix (or some another option) must allow to COMPLETELY
disable all filename encoding/decoding operations.

Problem arise in 12.3-RELEASE also, but seems to absent in 10-RELEASEs.

-- 
You are receiving this mail because:
You are the assignee for the bug.