any shortcuts to doc to ascii?
Nikos Vassiliadis
nvass9573 at gmx.com
Fri May 28 20:46:06 UTC 2010
Polytropon wrote:
> On Thu, 27 May 2010 16:36:08 -0700, Gary Kline <kline at thought.org> wrote:
>> i don't see any ascii suffix [for OOo]. i saved as .txt.
>
> This should be right. The .txt extension refers to ASCII text,
> at least in standard-compliant operating systems.
>
>
>
>> same krap. the \x94, x9d, \x9c... same with catdoc. i'll
>> try antiword. [forgot about that. ]
>
> This makes me believe that the original DOC file has been created
> with a wrong character set or language setting. "Windows" - as far
> as I know - does not use standard locales such as all other systems
> do, but uses an arbitrary setting.
>
It is a valid UTF-8 encoded text:
[nik at moby ~]$ python -c 'print "Don%c%c%ct" % (0xe2, 0x80, 0x99)' | file -
/dev/stdin: UTF-8 Unicode text
You'll be able to see the character if you fire up a UTF-8 capable
terminal with proper locale settings.
[nik at moby ~]$ LC_ALL=en_US.UTF-8 xterm -u8
After that, just print the char:
python -c 'print "Don%c%c%ct" % (0xe2, 0x80, 0x99)'
and use copy & paste to pass it to tr to translate it to something else,
for example:
tr ' "'" < $file > $output
> Another idea may be that the character that you think should be
> an apostrophe isn't an apostrophe. I often do see this in german
> texts with misplaces apostrophes that are in fact accent grave
> or accent acute, or a character from UTF-8 that just looks like
> an apostrophe. For example, if the original document contains
>
> We don`t
>
> and this ` is not a real ', then conversion tools will of course
> use the "escape notation" for this unknown character.
Indeed, the standard tool for encoding translations, iconv, chocks on
this. Yet, it worked when I tried to convert from utf-8 to greek
encoding('iconv -f utf-8 -t iso-8859-7'). Some info on the char:
http://www.fileformat.info/info/unicode/char/2019/index.htm
HTH, Nikos
More information about the freebsd-questions
mailing list