PHP: UTF-16 to UTF-8

16th March 2006

Recently I’ve been doing some work on a PHP script that has to process a bunch of XML files (in this case they’re imsmanifest files) however a few of them weren’t being parsed successfully.

The problem was soon quite clear, some of the files had been encoded using UTF-16 which wasn’t playing nicely with PHP. To solve this I’ve written a function that attempts to detect if a string is encoded using UTF-16 (little endian or big endian) and then converts it to a slightly more PHP friendly UTF-8. All the complicated stuff is copied from these JavaScript functions for converting between UTF-8 and UTF-16.

function utf16_to_utf8($str) {
    $c0 = ord($str[0]);
    $c1 = ord($str[1]);

    if ($c0 == 0xFE && $c1 == 0xFF) {
        $be = true;
    } else if ($c0 == 0xFF && $c1 == 0xFE) {
        $be = false;
    } else {
        return $str;
    }

    $str = substr($str, 2);
    $len = strlen($str);
    $dec = '';
    for ($i = 0; $i < $len; $i += 2) {
        $c = ($be) ? ord($str[$i]) << 8 | ord($str[$i + 1]) : 
                ord($str[$i + 1]) << 8 | ord($str[$i]);
        if ($c >= 0x0001 && $c <= 0x007F) {
            $dec .= chr($c);
        } else if ($c > 0x07FF) {
            $dec .= chr(0xE0 | (($c >> 12) & 0x0F));
            $dec .= chr(0x80 | (($c >>  6) & 0x3F));
            $dec .= chr(0x80 | (($c >>  0) & 0x3F));
        } else {
            $dec .= chr(0xC0 | (($c >>  6) & 0x1F));
            $dec .= chr(0x80 | (($c >>  0) & 0x3F));
        }
    }
    return $dec;
}

Note this only does something if the string has a BOM, otherwise it is assumed that the string isn’t UTF-16 and it is returned unmodified.

I don’t know, but hopefully someone might find this useful. If anyone can see any problems with it please point them out, however at the moment it seems to be working for me.

Permalink. Posted on 16th March 2006 in PHP, Unicode, XML.

Comments

  1. Thanks for this. Worked flawlessly.
    I am using this to make a PHP processor/editor/organizer for text files created with my Nokia 6630 phone, and they are saved as UTF-16.
    Wonderful :)

    # Posted by Ricardo on 16th January 2007.

Sorry, comments for this item are currently closed.

Of Interest

Hangouts

Listening