any shortcuts to doc to ascii?

Nikos Vassiliadis nvass9573 at gmx.com
Fri May 28 20:46:06 UTC 2010


Polytropon wrote:
> On Thu, 27 May 2010 16:36:08 -0700, Gary Kline <kline at thought.org> wrote:
>> 	i don't see any ascii suffix [for OOo].  i saved as .txt.
> 
> This should be right. The .txt extension refers to ASCII text,
> at least in standard-compliant operating systems.
> 
> 
> 
>> 	same krap.  the \x94, x9d, \x9c...  same with catdoc.  i'll
>> 	try antiword.  [forgot about that.  ]
> 
> This makes me believe that the original DOC file has been created
> with a wrong character set or language setting. "Windows" - as far
> as I know - does not use standard locales such as all other systems
> do, but uses an arbitrary setting.
> 

It is a valid UTF-8 encoded text:
[nik at moby ~]$ python -c 'print "Don%c%c%ct" % (0xe2, 0x80, 0x99)' | file -
/dev/stdin: UTF-8 Unicode text

You'll be able to see the character if you fire up a UTF-8 capable 
terminal with proper locale settings.
[nik at moby ~]$ LC_ALL=en_US.UTF-8 xterm -u8

After that, just print the char:
python -c 'print "Don%c%c%ct" % (0xe2, 0x80, 0x99)'
and use copy & paste to pass it to tr to translate it to something else, 
for example:
tr ' "'" < $file > $output

> Another idea may be that the character that you think should be
> an apostrophe isn't an apostrophe. I often do see this in german
> texts with misplaces apostrophes that are in fact accent grave
> or accent acute, or a character from UTF-8 that just looks like
> an apostrophe. For example, if the original document contains
> 
> 	We don`t
> 
> and this ` is not a real ', then conversion tools will of course
> use the "escape notation" for this unknown character.

Indeed, the standard tool for encoding translations, iconv, chocks on 
this. Yet, it worked when I tried to convert from utf-8 to greek 
encoding('iconv -f utf-8 -t iso-8859-7'). Some info on the char:
http://www.fileformat.info/info/unicode/char/2019/index.htm

HTH, Nikos


More information about the freebsd-questions mailing list