Removing BOM from UTF-8

Gerard Seibert gerard at seibercom.net
Sat Feb 18 08:28:19 PST 2006


I have a large number of text files created in MS Word and saved in
UTF-8 format. Unfortunately, MS Word adds the BOM to each file. I need
to remove the BOM.

Information regarding BOM and UTF-8 can be found here:

http://www.cl.cam.ac.uk/~mgk25/unicode.html
http://www.w3.org/International/questions/qa-utf8-bom

A brief excerpt:

It has also been suggested to use the UTF-8 encoded BOM (0xEF 0xBB 0xBF)
as a signature to mark the beginning of a UTF-8 file. This practice
should definitely not be used on POSIX systems for several reasons:

    * On POSIX systems, the locale and not magic file type codes define
     the encoding of plain text files. Mixing the two concepts would add a
     lot of complexity and break existing functionality.

    * Adding a UTF-8 signature at the start of a file would interfere
     with many established conventions such as the kernel looking for “#!” at
     the beginning of a plaintext executable to locate the appropriate
     interpreter.

    * Handling BOMs properly would add undesirable complexity even to
     simple programs like cat or grep that mix contents of several files into
     one.

It has been suggested that a script could be written to eliminate the
BOM from a file(s). My script writing skills suck. I have been unable to
locate one using Google, so I was hoping that someone might know where I
could either locate such a program, or perhaps give me an idea on how to
script one.

Thanks!

-- 
Gerard Seibert
gerard at seibercom.net


     I'm interested in the fact that the less secure a man is, the more
     likely he is to have extreme prejudice.

          Clint Eastwood


More information about the freebsd-questions mailing list