Problems submitting patch containing UTF-8 characters

Michael Gmelin freebsd at grem.de
Sun Sep 30 12:55:53 UTC 2012



On Sun, 30 Sep 2012 05:08:03 +0200
Michael Gmelin <freebsd at grem.de> wrote:

> Hi,
> 
> I recently ran into a problem submitting a PR containing UTF-8
> characters, they ended up garbled, so the maintainer couldn't apply
> the patch cleanly.
> 
> http://www.freebsd.org/cgi/query-pr.cgi?pr=171645
> 
> The characters included were 0xe4 0xb8 0xad and 0xe5 0x9b 0xbd (two
> three byte characters). The code affected is about testing utf-8, so
> the characters are required. And even if not, patching them away would
> require stating them as part of the patch.
> 
> The original e-mail was created using porttools and therefore had no
> character set specification, which usually shouldn't be a problem. The
> patch was just inline as part of the body.
> 
> http://www.freebsd.org/cgi/query-pr.cgi?pr=171645&getpatch=1
> 
> The character sequence had been recoded to
> 0xc3 0xa4 0xc2 0xb8 0xc2 0xad 0xc3 0xa5 0xc2 0x9b 0xc2 0xbd
> 
> It seems like it had been interpreted as latin1 on receipt and then
> reencoded as utf-8:
> 0xe4 => 0xc3 0xa4
> 0xb8 => 0xc2 0xb8
> 0xad => 0xc2 0xad
> 0xe5 => 0xc3 0xa5
> 0x9b => 0xc2 0x9b
> 0xbd => 0xc2 0xbd
> 
> Which is obviously not what should happen. The recipient shouldn't
> make any assumptions about the character set used.
> 
> The next attempt was sending the patch as a bug-followup through a
> graphical MUA. The patch was attached and had been encoded as
> quoted-printable (no specific charset specification):
> 
> +-configPath =3D u"./config/=E4=B8=AD=E5=9B=BD_client.config"
> ++configPath =3D
> u"./config/=E4=B8=AD=E5=9B=BD_client.config".encode("utf-8=")
> 
> http://www.freebsd.org/cgi/query-pr.cgi?pr=171645&getpatch=2
> 
> Unfortunately the results are the same. I did not try forcing a
> charset by manually modifying the email (not sure if this will work,
> I'm willing to test, but I don't want to further litter that PR).
> 
> At this point I figured, that sending the patch in gzipped format
> might help. Said and done, the patch shows up as base64 in the PR.
> When copy and pasting and decoding the base64 text, the resulting .gz
> can be decompressed correctly and the content is what I expected. When
> clicking the download link though:
> 
> http://www.freebsd.org/cgi/query-pr.cgi?pr=171645&getpatch=3
> 
> The resulting .gz file has the correct file size, but is corrupted.
> Checking it using the hex editor it looks like it has been reencoded
> as utf-8 (and then truncated at the expected file size):
> 
> Hex of the original file (first 16 bytes):
> 1f 8b 08 08 ad 79 65 50  00 03 70 79 32 37 2d 49
> 
> Hex of the file downloaded by using the link:
> 1f c2 8b 08 08 c2 ad 79  65 50 00 03 70 79 32 37
> 
> As you can see, all non 7bit characters have been utf-8 encoded, which
> is pretty suboptimal in a binary file.
> 
> 0x8b => 0xc2 0x8b
> 0xad => 0xc2 0xad
> ...
> 
> As a result the truncated and utf-8 encoded gzip file cannot be
> decompressed.
> 
> I'm relatively certain that this has worked at some point in the past.
> 
> Ideas anyone?
> 
> Thanks,
> 

By the way, the two three byte sequences mean
"China", see also http://goo.gl/4muUF

-- 
Michael Gmelin


More information about the freebsd-ports mailing list