"Unprintable" 8-bit characters

Polytropon freebsd at edvax.de
Wed Nov 9 02:10:27 UTC 2011


On Wed, 09 Nov 2011 02:51:31 +0100, Michael Ross wrote:
> Am 09.11.2011, 01:42 Uhr, schrieb Conrad J. Sabatier <conrads at cox.net>:
> 
> > Pardon me if this may seem like a stupid question, but this is
> > something that's been bugging me for a long time, and none of my
> > research has turned up anything useful yet.
> >
> > I've been trying to understand what the deal is with regards to the
> > displaying of the "extended" 8-bit character set, i.e., 8-bit characters
> > with the MSB set.
> >
> > More specifically, I'm trying to figure out how to get the "ls" command
> > to properly display filenames containing characters in this extended
> > set.  I have some MP3 files, for instance, whose names contain certain
> > European characters, such as the lowercase "u" with umlaut (code 0xfc
> > in the Latin set, according to gucharmap), that I just can't get ls to
> > display properly.  These characters seem to be considered by ls as
> > "unprintable", and the best I've been able to produce in the ls
> > output is backslash interpretations of the characters using either the
> > -B or -b options, otherwise the default "?" is displayed in their place.
> 
> Unsure if I understand you correctly.
> ("extended" 8-bit character set with MSB? utf-16?)
> I'm confused by this charset stuff in general.
> 
> Assuming you want \0xfc displayed as "ü",
> 
> > cat test.py && python test.py && ls -l
> 
> #!/usr/local/bin/python
> # -*- coding: utf-8 -*-
> 
> f=open('\xfc','w')
> f.close()
> total 2
> 
> -rw-r--r--  1 michael  wheel  29  9 Nov 02:43 test.py
> -rw-r--r--  1 michael  wheel   0  9 Nov 02:44 ü
> 
> 
> here is what works for me:
> 
> in my login class in /etc/login.conf:
> 
>          :charset=ISO-8859-1:\
>          :lang=de_DE.ISO8859-1:\
> 
> ``cap_mkdb /etc/login.conf'' after changes

Ah, thanks - that seems to be the proper way to have
the environmental variables set - instead of my (ab)use
of setenv's in the csh config file. :-)

Note the "precedence" of $LANG vs. $LC_* (as they can
be used to configure things more precisely, e. g.
regarding system messages or date formats; see example
following).



> in /etc/rc.conf:
> 
> 	scrnmap="iso-8859-1_to_cp437"

Hm? CP437? Codepage? Isn't that some MS-DOS thing?
I've never needed a screenmap to make "extended
characters" (everything beyong US-ASCII) work.



> 	font8x8="cp850-8x8"
> 	font8x14="cp850-8x14"
> 	font8x16="cp850-8x16"
> 
> 
> and in /etc/ttys, console type is set to ``cons25l1''

I have a similar setting here, but that does _not_ work
wuth UTF-8 codec characters. If I want to use them, I
have to change some environmental variables, from

	#-------GERMAN/ENGLISH------------------------ <=== DEFAULT
	setenv	LC_ALL		en_US.ISO8859-1
	setenv	LC_MESSAGES	en_US.ISO8859-1
	setenv	LC_COLLATE	de_DE.ISO8859-1
	setenv	LC_CTYPE	de_DE.ISO8859-1
	setenv	LC_MONETARY	de_DE.ISO8859-1
	setenv	LC_NUMERIC	de_DE.ISO8859-1
	setenv	LC_TIME		de_DE.ISO8859-1
	unsetenv LANG

to

	#-------INTERNATIONAL-------------------------
	setenv	LC_ALL		en_US.UTF-8
	setenv	LC_MESSAGES	en_US.UTF-8
	setenv	LC_COLLATE	de_DE.UTF-8
	setenv	LC_CTYPE	de_DE.UTF-8
	setenv	LC_MONETARY	de_DE.UTF-8
	setenv	LC_NUMERIC	de_DE.UTF-8
	setenv	LC_TIME		de_DE.UTF-8
	setenv	LANG		de_DE.UTF-8

Then I can use UTF-8 characters inside rxvt-unicode. Of
course, text mode console is limited to the first set
of configuration, using the ISO 8859-1 character set.

This worked long before UTF-8 arrived with the glorious
idea that I should have 2 bytes where one is sufficient,
to describe our (german) 6 umlauts and the Eszett ligature. :-)

Improper settings will result in [][] or A-tilde three
quarters upside-down question mark, depending on editor
or terminal used.


But returning to the original question, I think Robert
did explain it very well: There is no real consensus
about what the different codings should mean. They
were meant to unify the representation of a very large
set of characters, but basically there are many inter-
pretations now, and how they show up to the user depends
on the font in use, _if_ it has this mapping or that,
or none.

For running ls, -w is the right option to use - but IN
COMBINATION with correct settings for the terminal
emulation AND the presence of a font that will do.

Again a fine demonstration why file names should be
limited to printable ASCII and no spaces if you want
them to work everywhere. :-)



-- 
Polytropon
Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...


More information about the freebsd-questions mailing list