printf(1) and UTF-8 multi-byte chars

Sun Oct 18 18:05:49 UTC 2020

> 	There are good reasons for using all three levels, here are some:
>
> Bytes: Content length headers, malloc calls - storage related

Sure.

> Glyphs: Truncation, apparent length, sorting - appearance related

Not so much.  I suppose it's preferable to truncate at a glyph boundary, 
but sorting UTF-8 bytes gives you the same order as sorting the glyphs, 
and for useful sorting you need to deal with issues like normalized forms 
and case folding.  Not sure what use apparent length would be since the 
number of glyphs tells you neither the number of visible characters nor 
how wide they are.

> Unicode Characters: UTF-8/16/32 conversions - encoding related

That and a lot of composition and display issues.

Regards,
John Levine, johnl at taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly