How can you tell if a file is UTF-8 encoded or not?
Say you want to know if a particular file is encoded using UTF-81. On a UNIX box, you could just use the file
command:
$ file *.txt
Housman - Loveliest of trees.txt: ASCII English text
Millay - First fig.txt:UTF-8 Unicode English text
Yeats - When You Are Old.txt: ASCII English text
Now, I know that’s not right. I created the Housman & the Yeats files using vim, & vim is set to use UTF-82, so something is funny somewhere.
In poking around to try to figure out a better method to find out if a file is UTF-8 or not, I discovered just the command I needed: isutf8
. Yes, the name of the command is “is UTF8” all crammed together & lowercased, which certainly makes it easy to remember. It’s part of the moreutils
package that you can download & install. Here’s how I did it.
On my Linux box running Debian:
# apt-get install moreutils
…
Need to get 53.3 kB of archives.
After this operation, 188 kB of additional disk space will be used.
Get:1 http://http.us.debian.org/debian/ squeeze/main moreutils amd64 0.41 [53.3 kB]
Fetched 53.3 kB in 0s (163 kB/s)
…
On my Mac, using Homebrew3:
$ brew install moreutils
==> Downloading http://mirrors.kernel.org/debian/pool/main/m/moreutils/moreutils_0.45.tar.gz
######################################################################## 100.0%
==> make isutf8 ifne pee sponge mispipe lckdo parallel
/usr/local/Cellar/moreutils/0.45: 15 files, 148K, built in 3 seconds
Now that isutf8
was installed, I tried again to see if those text files were UTF-8:
$ isutf8 *.txt
$
That’s right—nothing. As it should be. In typical UNIX fashion, no news is good news, & means that the command did NOT find any files that were NOT UTF-8. Or, to put it another way, all three text files were in fact UTF-8, so the command did nothing.
Let’s see what happens with some other files:
$ isutf8 *
Messenger Bags.numbers: line 1, char 1, byte offset 12: invalid UTF-8 code
Student Paper.doc: line 1, char 1, byte offset 1: invalid UTF-8 code
Tix.jpg: line 1, char 1, byte offset 1: invalid UTF-8 code
Yep. Those were definitely not UTF-8 encoded.
I don’t think I’ll be using isutf8
constantly, but it’s sure a handy little tool to have around.4
- chronic: runs a command quietly unless it fails
- combine: combine the lines in two files using boolean operations
- ifdata: get network interface info without parsing ifconfig output
- ifne: run a program if the standard input is not empty
- isutf8: check if a file or standard input is utf-8
- lckdo: execute a program with a lock held
- mispipe: pipe two commands, returning the exit status of the first
- parallel: run multiple jobs at once
- pee: tee standard input to pipes
- sponge: soak up standard input and write to a file
- ts: timestamp standard input
- vidir: edit a directory in your text editor
- vipe: insert a text editor into a pipe
- zrun: automatically uncompress arguments to command
-
If you don’t know what UTF-8 is, read the Wikipedia article. Here’s the upshot: you want all your text editors & operating systems & web browsers to support & use UTF-8 by default. It makes life a lot easier. ↩
-
By putting
set enc=utf-8
in my.vimrc
file, of course. ↩ -
What? You’re not using Homebrew? Head over to https://github.com/mxcl/homebrew & get that sucker installed! It’s far better than fink or MacPorts. More on Homebrew some other time. ↩
-
Eagle-eyed readers might have noticed a list of software packages that were installed along with
isutf8
when I gave the Homebrew listing. Looking over the list at themoreutils
site, I think I’m going to have a lot to play with & write about over the coming months: ↩