PHP and the UTF-8 Byte Order Mark
May 18th, 2009Another day, another curious problem to be solved…
Today I found that a recently developed project wasn’t rendering too well in (every web developers favourite) Internet Explorer. There were a few tell-tale signs that pointed to IE not understanding the doc-type declaration at the beginning of the file. The first few lines of the generated file were:
< ?xml version="1.0" ?>
< !DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
Well, there’s nothing obviously wrong with that. As is usually the case, Firefox rendered everything correctly. The HTML output was fully XHTML strict compliant, and the CSS validated too. To add to my confusion, the exact same code hosted on my development server was rendering correctly in all browsers. This gave me a chance to compare the output of the two installations. Here’s what I found:
For my development server:
wget -O- http://dev.server/index.php | hexdump -C | head -n1
gave the following output:
00000000 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 |< ?xml version="1|
For the live server:
wget -O- http://live.server/index.php | hexdump -C | head -n1
resulted in:
00000000 ef bb bf ef bb bf 3c 3f 78 6d 6c 20 76 65 72 73 |......< ?xml vers |
Aha! The dreaded double byte order mark! UTF-8 files don't need a byte order mark, but somehow at least two had slipped into my source files on the live server, and were causing IE to get all confused. But which files? You can't just view the files, because a byte order marker is not displayed in a text editor. I managed to track down the offending source files using the following script:
for i in `find ./ -type f -name '*.php'`; do hexdump -C $i | head -n1 | grep -i 'ef bb bf' && echo $i; done
This nice bit of bash script finds all files with a “.php” extension, does a hex dump of each and searches for the UTF-8 BOM. If the BOM is found, the first line of the hex dump is displayed and the filename printed below.