printf(1) and UTF-8 multi-byte chars

Mon Oct 19 15:33:09 UTC 2020

On 18 Oct 2020 14:05:46 -0400, John R. Levine wrote:
> > 	There are good reasons for using all three levels, here are some:
> >
> > Bytes: Content length headers, malloc calls - storage related
> 
> Sure.
> 
> > Glyphs: Truncation, apparent length, sorting - appearance related
> 
> Not so much.  I suppose it's preferable to truncate at a glyph boundary, 
> [...]

Depends. Some gylphs depicting ligatures decay to different
single characters upon truncation / hyphenation.

> [...]
> but sorting UTF-8 bytes gives you the same order as sorting the glyphs, 
> and for useful sorting you need to deal with issues like normalized forms 
> and case folding.  Not sure what use apparent length would be since the 
> number of glyphs tells you neither the number of visible characters nor 
> how wide they are.

Exactly that is the main problem with "byte length != string
length" as a general problem with non-ASCII text. Processing
and even displaying can be quite tricky, and printf() is not
trivial anymore... ;-)

-- 
Polytropon
Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...