www/172195: PR database corrupts patches
Michael Gmelin
freebsd at grem.de
Sun Sep 30 23:20:11 UTC 2012
The following reply was made to PR www/172195; it has been noted by GNATS.
From: Michael Gmelin <freebsd at grem.de>
To: bug-followup at FreeBSD.org
Cc:
Subject: Re: www/172195: PR database corrupts patches
Date: Mon, 1 Oct 2012 01:12:00 +0200
Analysis:
1. The PR system assumes some different encoding than UTF-8 to be the
default. This means:
a) Patches uploaded through the web form will corrupt
b) Patches mailed as attachments without explicit charset
specification will corrupt
c) standard send-pr patches break - adding a charset UTF-8 header
manually will probably work, but is too easy to forget. Also
won't fix the download option.
2. The PR system can handle binary attachments correctly in its base64
view
3. Downloaded patches are corrupted in all cases!
a) File attached via webform:
fetch -o - "http://www.freebsd.org/cgi/query-pr.cgi?pr=172195&getpatch=1" | hd
c3 a4 c2 b8 c2 ad
(should have been: e4 b8 ad e5 9b bd)
This looks like the input has been assumed to be latin1,
transcoded to UTF-8 and truncated.
b) File sent as follow up attachment without UTF-8 charset:
fetch -o -
"http://www.freebsd.org/cgi/query-pr.cgi?pr=172195&getpatch=2" |
hd c3 a4 c2 b8 c2 ad c3 a5
(should have been: e4 b8 ad e5 9b bd)
This looks like the input has been assumed to be latin1 and
transcoded to UTF-8.
c) File sent as follow up attachment WITH UTF-8 charset:
(this one shows up correctly on the web page, the download is
still broken though):
fetch -o - "http://www.freebsd.org/cgi/query-pr.cgi?pr=172195&getpatch=3" | hd
e4 b8 ad e5
(should have been: e4 b8 ad e5 9b bd)
This looks like it got the encoding right, but can't handle three
byte characters (string length calculation problem?!)
d) Gzipped version of the patch:
The base64 encoded version shown on the PR webpage is correct:
md5 china.txt.gz
MD5 (china.txt.gz) = 29009c79690c58b0762274da0e3ad80d
echo "H4sICIG7aFAAA2NoaW5hLnR4dAB7smPt09l7uQC1SPS1BwAAAA==" \
| openssl enc -d -a | md5
29009c79690c58b0762274da0e3ad80d
Downloading through the download link fails though:
fetch -o - "http://www.freebsd.org/cgi/query-pr.cgi?pr=172195&getpatch=4" | md5
ae9f2f3531871be8c4af662863eb542e
Taking a deeper look into the gzip file shows, that there has
been an attempt to somehow UTF-8 encode the binary content:
Original:
00000000 1f 8b 08 08 81 bb 68 50 00 03 63 68 69 6e 61 2e
00000010 74 78 74 00 7b b2 63 ed d3 d9 7b b9 00 b5 48 f4
00000020 b5 07 00 00 00
00000025
File as downloaded from the PR website:
00000000 1f c2 8b 08 08 c2 81 c2 bb 68 50 00 03 63 68 69
00000010 6e 61 2e 74 78 74 00 7b c2 b2 63 c3 ad c3 93 c3
00000020 99 7b c2 b9 00
00000025
As you can see, 8bit characters have been UTF-8 encoded, and the
resulting file got truncated at the original file size.
Conclusion:
There is no simple way of submitting a patch through the PR system so
that it can be downloaded using the download link. Right now the
options are:
1. Send the file as an email attachment, making sure that the character
encoding in the mime header is set to UTF-8 (not all email clients
will do this automatically). This way a patch can be acquired by
using copy and paste - the download link will not work correctly
though and yield surprising results. A patch acquired this way might
actually apply, but cause unintended behavior.
2. Send the file gzipped and make people use base64 decode to get the
gzip. This way when the download link is used people will at least
realize something went wrong.
3. Base64 encode the patch before sending it, this way everything
stays us-ascii and cannot be messed with by the PR system. Requires
users to base64 decode on their own and makes it hard to argue about
the patch in a way that's transparent to users of the web page.
None of these options seem very appealing, especially since it makes it
easy for people to get it wrong and hard to get it right - also various
tools used by port maintainers (porttools, send-pr etc.) might not be
prepared to support the user to get it right. There will be more and
more UTF-8 encoded patches in the future, so I think this should be
fixed.
Suggested fixes:
- Change the default encoding (the coding assumed when no encoding is
specified) to UTF-8. This might not be practical in all cases, but
should be discussed.
- Make sure that the download option provides correct files (it should
treat all files as binary and not try to alter them in any way).
I hope all of this makes sense.
--
Michael Gmelin
More information about the freebsd-www
mailing list