printf(1) and UTF-8 multi-byte chars
Polytropon
freebsd at edvax.de
Mon Oct 19 15:33:09 UTC 2020
On 18 Oct 2020 14:05:46 -0400, John R. Levine wrote:
> > There are good reasons for using all three levels, here are some:
> >
> > Bytes: Content length headers, malloc calls - storage related
>
> Sure.
>
> > Glyphs: Truncation, apparent length, sorting - appearance related
>
> Not so much. I suppose it's preferable to truncate at a glyph boundary,
> [...]
Depends. Some gylphs depicting ligatures decay to different
single characters upon truncation / hyphenation.
> [...]
> but sorting UTF-8 bytes gives you the same order as sorting the glyphs,
> and for useful sorting you need to deal with issues like normalized forms
> and case folding. Not sure what use apparent length would be since the
> number of glyphs tells you neither the number of visible characters nor
> how wide they are.
Exactly that is the main problem with "byte length != string
length" as a general problem with non-ASCII text. Processing
and even displaying can be quite tricky, and printf() is not
trivial anymore... ;-)
--
Polytropon
Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...
More information about the freebsd-questions
mailing list