How can you tell if a file is UTF-8 encoded or not?
Say you want to know if a particular file is encoded using UTF-81. On a UNIX box, you could just use the file
command:
Now, I know that’s not right. I created the Housman & the Yeats files using vim, & vim is set to use UTF-82, so something is funny somewhere.
In poking around to try to figure out a better method to find out if a file is UTF-8 or not, I discovered just the command I needed: isutf8
. Yes, the name of the command is “is UTF8” all crammed together & lowercased, which certainly makes it easy to remember. It’s part of the moreutils
package that you can download & install. Here’s how I did it.
On my Linux box running Debian:
On my Mac, using Homebrew3:
Now that isutf8
was installed, I tried again to see if those text files were UTF-8:
That’s right—nothing. As it should be. In typical UNIX fashion, no news is good news, & means that the command did NOT find any files that were NOT UTF-8. Or, to put it another way, all three text files were in fact UTF-8, so the command did nothing.
Let’s see what happens with some other files:
Yep. Those were definitely not UTF-8 encoded.
I don’t think I’ll be using isutf8
constantly, but it’s sure a handy little tool to have around.4
- chronic: runs a command quietly unless it fails
- combine: combine the lines in two files using boolean operations
- ifdata: get network interface info without parsing ifconfig output
- ifne: run a program if the standard input is not empty
- isutf8: check if a file or standard input is utf-8
- lckdo: execute a program with a lock held
- mispipe: pipe two commands, returning the exit status of the first
- parallel: run multiple jobs at once
- pee: tee standard input to pipes
- sponge: soak up standard input and write to a file
- ts: timestamp standard input
- vidir: edit a directory in your text editor
- vipe: insert a text editor into a pipe
- zrun: automatically uncompress arguments to command
-
If you don’t know what UTF-8 is, read the Wikipedia article. Here’s the upshot: you want all your text editors & operating systems & web browsers to support & use UTF-8 by default. It makes life a lot easier. ↩
-
By putting
set enc=utf-8
in my.vimrc
file, of course. ↩ -
What? You’re not using Homebrew? Head over to https://github.com/mxcl/homebrew & get that sucker installed! It’s far better than fink or MacPorts. More on Homebrew some other time. ↩
-
Eagle-eyed readers might have noticed a list of software packages that were installed along with
isutf8
when I gave the Homebrew listing. Looking over the list at themoreutils
site, I think I’m going to have a lot to play with & write about over the coming months: ↩