Filesystem encoding and PHP

Many PHP applications save files to a local filesystem. Most of the times for the bulk of readers here you'll likely only ever store files using US-ASCII encoding, either because your filenames are simply based on databasefields (as you should try in most cases), or simply because most of your users never have a need for non-english characters.

When you do though, it's important to know how operating systems cope with these characters. Unsurprising, all of them do this differently.

To illustrate the differences, I'm going to do some tests on Ubuntu, OS/X 10.6.3 and Windows XP and 7.

Linux

In Linux filenames are binary. Linux does not care what encoding your filenames are, and it will accept anything besides 0x00. This means filenames can contain carriage-returns (\n), tabs (\t) or even a bell (ascii code 07).

To illustrate this, I'm going to make a tiny file using a php script:

  1. <?php
  2. file_put_contents("saved by the \x07.txt","contents");
  3. ?>

After running this I simply get a questionmark when viewing the file using 'ls', but when I auto-complete it, it expands to ^G (which is bell). In Nautilus, this is displayed:

fsencoding_gnome.png

If I run this script:

  1. <?php
  2. print_r(glob('saved*'));
  3. ?>

The output is simply missing my bell character, and I get a short beep.

This doesn't mean it's a good idea to do this. Even though the underlying filesystem is binary-safe, applications that list filenames will still have to make a decision on an encoding to display the characters to the user. You can't even show this character in any PHP page, and firewalls might even block this if you used this in a url.

This also applies to the applications on your linux machine. Most of them, such as Gnome Terminal and Nautilus, default to UTF-8. However, I believe for the PuTTY application this was for the longest time ISO-8859-1 (latin1). A symptom of this is that any non-ascii characters look different when read them from Putty vs. Nautilus.

The other thing I wanted to test on linux is how it behaves if I create a file in the filemanager using a special character. For this example I'm using ü, because it's a bit ambiguous as there's multiple ways to encode it using unicode (more on this later) and it also appears in ISO-8859-1.

Back to the test. I'm now creating a new file from the Nautilus interface, and want to see how it shows up for PHP. Im creating a file called test_ü.txt and listing it with the following script:

  1. <?php
  2. list($file) = glob('test_*');
  3. echo urlencode($file) . "\n";
  4. ?>

Output:

  1. test_%C3%BC.txt

%C3%BC is the UTF-8 encoding of codepoint U+00FC, which is the most common way to encode ü. Great!

The last test is to create this file using ISO-8859-1/latin1 encoding. The latin1 representation of ü is 0xFC. The script for this:

  1. <?php
  2. file_put_contents("uumlaut_\xFC.txt","contents");
  3. ?>

Linux stores the file with that exact byte sequence. 'ls' shows the questionmark again, and this type in gnome I'm getting the typical 'incorrect encoding' question mark.

OS/X

On OS/X all filenames are encoded as UTF-16. You don't have to know about this, because the API's PHP uses are UTF-8, and are transparently translated for you.

We'll start with the bell test. The result is the same as on linux. The bell character is represented by ?. When checking it out in finder, the character is missing altogether. It's definitely still there though, as the following script illustrates:

  1. <?php
  2. list($filename) = glob('saved*');
  3. echo urlencode($filename) . "\n";
  4. ?>

Output:

  1. saved+by+the+%07.txt

Next, we're going to do the ü test. First, I'll encode it as latin-1, which would be invalid for this UTF-8 filesystem.

  1. <?php
  2. file_put_contents("uumlaut_\xFC.txt","contents");
  3. ?>

This one is weird. If I now do 'ls', the result is this:

  1. drwxr-xr-x 10 evert2 staff 340 16 Apr 17:08 .
  2. drwxr-xr-x 32 evert2 staff 1088 16 Apr 16:53 ..
  3. -rw-r--r-- 1 evert2 staff 8 16 Apr 16:54 saved by the ?.txt
  4. -rw-r--r-- 1 evert2 staff 121 16 Apr 16:54 test1.php
  5. -rw-r--r-- 1 evert2 staff 8 16 Apr 16:54 test2.php
  6. -rw-r--r-- 1 evert2 staff 101 16 Apr 17:07 test3.php
  7. -rw-r--r-- 1 evert2 staff 57 16 Apr 17:08 test4.php
  8. -rw-r--r-- 1 evert2 staff 8 16 Apr 17:08 uumlaut_%FC.txt

Instead of taking the literal bytes, OS/X urlencoded them, and stored those sequences instead. This translation is transparent; but it might be confusing if you ever try to store latin1 filenames from your users.

The last test is to store the umlaut again, but this time using the correct utf-8 sequence:

  1. <?php
  2. file_put_contents("uumlaut2_\xC3\xBC.txt","contents");
  3. ?>

Upon first sight this seems to have worked as expected, but it gets weird when we check out how this was actually stored:

  1. <?php
  2. list($file) = glob('uumlaut2_*');
  3. echo urlencode($file) . "\n";
  4. ?>

Output:

  1. uumlaut2_u%CC%88.txt

OS/X stored u0xCC88 instead of 0xC3BC. Note that the u is not a typo. OS/X uses a different way to store the ü. The encoding we used is unicode codepoint U+00FC, which is ü. OS/X first stores the u and the two little dots as separate characters, taking up 3 bytes instead of 2.

This is called normalization. Unicode defines a few different normalization models which dictate how these combinations of characters are stored. So even though they are different byte-sequences and different codepoints they are still considered equivalent.

The PHP intl extension includes a class that allows you to do the unicode normalization yourself, namely the normalizer class. The documentation also includes a short description of what the 4 different normalization forms are. OS/X uses a slightly modified version of Normalization Form D (yes, nobody can ever standardize on anything).

This is how you would do this conversion yourself:

  1. <?php
  2.  
  3. $before = "\xC3\xBC";
  4. $after = Normalizer::normalize($before, Normalizer::FORM_D);
  5.  
  6. echo 'Before: ', urlencode($before), "\n";
  7. echo 'After: ', urlencode($after), "\n";
  8. ?>

Output:

  1. Before: %C3%BC
  2. After: u%CC%88

This normalization process for OS/X is also transparent. Whenever you will try to open a file with the wrong normalization form, OS/X will put it in form D before opening.

Windows

Windows also uses UTF-16 to store filenames (using NTFS). Just like OS/X, this translation is done automatically, due to the filesystem api's php uses. We'll start with the bell-test:

  1. <?php
  2. file_put_contents("saved by the \x07.txt","contents");
  3. ?>

Output:

  1. Warning: file_put_contents(saved by the .txt): failed to open stream: Invalid argument in C:\Documents and Settings\Administrator\test\test.php on line 2

Indeed, windows does not allow control characters such as bell. The second thing we'll try is the latin-1 encoded ü:

  1. <?php
  2. file_put_contents("uumlaut_\xFC.txt","contents");
  3. list($file) = glob('uumlaut_*');
  4. echo urlencode($file) . "\n";
  5. ?>

Output:

  1. uumlaut_%FC.txt

Not only did windows accept this encoding, it also displayed correctly in both cmd.exe, and the windows explorer. So it appears that windows and PHP actually translate from and to ISO-8859-1/latin1 instead of UTF-8. When trying this with the UTF-8 encoding of ü this gets confirmed.

  1. <?php
  2. file_put_contents("uumlaut2_\xC3\xBC.txt","contents");
  3. list($file) = glob('uumlaut2_*');
  4. echo urlencode($file) . "\n";
  5. ?>

Output:

  1. uumlaut2_%C3%BC.txt

While windows stores this correctly, the filename is now garbled in cmd.exe and windows explorer. Here it looks like ü. This is pretty bad. I do know that Windows does support UTF-8, so I can't help but wonder what would happen if I do the exact opposite: making a file containing non-ascii characters in windows explorer, and reading out the filename in PHP.

The results were interesting. I used the ü again, and 한글, which is the name of the korean writing system, hangul. With 2 files in this directory, I simply did:

  1. <?php
  2. $files = glob('*');
  3. foreach($files as $file) {
  4. echo urlencode($file), "\n";
  5. }
  6. echo "total: " . count($files) . "\n";
  7. ?>

Output:

  1. test.php
  2. uumlaut%FC.txt
  3. total: 2

My korean file was completely missing. Just to make sure I did the same with scandir:

  1. <?php
  2. $files = scandir('.');
  3. foreach($files as $file) {
  4. echo urlencode($file), "\n";
  5. }
  6. echo "total: " . count($files) . "\n";
  7. ?>

Output:

  1. .
  2. ..
  3. hangul_%3F%3F.txt
  4. test.php
  5. uumlaut%FC.txt
  6. total: 5

Oddly enough it did show up here. This time however, the korean characters were replaced by %3F, which is, surprise: the question mark. We've seen characters replaced by question marks before, but this is the first time it ends up in a literal string.

Conclusion

Using non-latin characters in filenames is messy. It would be possible to provide a consistent experience, if it weren't for windows. Windows does have all the proper api's to deal with international filenames, but I can only assume PHP simply does not support them. I do believe this was scheduled for PHP6, but now that's off the hook. I hope the filesystem api's are replaced even before the entire language is unicode-based.

While the Linux solution (treat everything as binary, allow everything besides 0x00) might seem like the most straightforward, in the end filenames are meant to be written or read by people which means it will be encoded.

The best system in this case really is OS/X, which not only treats everything as UTF-8, it also handles incorrect sequences well and makes sure that characters with an identical meaning are also always stored the same way (normalization).

Here's what I recommend:

If you want to support all characters on all operating systems in a consistent matter, you have no other option than to use an intermediate encoding. You could for instance simply urlencode all your filenames before writing them to disk.

Url-encoding does not mean you can forget about the encoding though. urlencoding means that a different way is used to store certain bytes, but the characters they represent remain the same. Therefore, you should always make sure that the filenames you're using are valid UTF-8 sequences. UTF-8 is today's encoding of choice.

If you know absolutely sure you will only use characters in the ISO-8859-1/latin-1 character-set, the following table applies:

WindowsEncode using ISO-8859-1
LinuxEncode using UTF-8 (will accept other encodings, but not recommended).
OS/XEncode using UTF-8. Will transparently encode to normalization-form D

Here's a table of sequences and what happens on specific operating systems:

url-encoded filenamedescriptionLinuxOS/XWindows
%07bell%07 on disk%07 on diskthrows error and doesn't save
%FCü in ISO-8859-1%FC on disk, question marks in UI's%25FC on disk (%25 = %, so the literal string %FC on disk).%FC on disk, correct in UI
%C3%BCü in UTF-8 normalization form C%C3%BC on disk. correct in UIu%CC%88 on disk, correct in UI%C3%BC on disk, shows up as ü in UI's
u%CC%88ü in UTF-8 normalization form Du%CC%88 on disk, correct in UIu%CC%88 on disk, correct in UIuntested, but assumed to be similar to the last testcase.

Configuration list

Lastly, the list of relevant software I used for this:

  • Windows
    • Tested on XP SP3 and 7
    • PHP 5.3.2 VC9 x86 build from windows.php.net
    • NTFS filesystem
  • Linux
    • Ubuntu 9.10
    • PHP 5.2.10 from ubuntu package repository
    • ext3 filesystem
  • OS/X
    • v10.6.3
    • PHP 5.3.1 as shipped with OS/X
    • HFS+ filesystem

22 Responses to Filesystem encoding and PHP

  1. 1918 Pierre 2010-04-20 6:08 pm

    hi,

    It is important to keep in mind that most of the POSIX (and affiliate) operating system accepts ASCII and UTF-8 for non ASCII file names. I'm talking about the APIs here. How it behaves in a shell is a different story and is dependent on the user settings.

    For windows, we have to use the Unicode API to support Unicode file names (along other things). But the current PHP implementation uses the ANSI API. To work around this limitation, you can convert the file name to the desired windows encoding. I do that with mbstring for example. But it would be easier if we would have a function to fetch the encoding used for the current process, to help to do the conversion correctly. I could add it in trunk already.

  2. 1935 Evert 2010-04-21 3:21 am

    Pierre,

    It would be nice if there's an API for this. Add it! =)

    I'm understanding there's many different ANSI codepages, CP-1252 being the most common. CP-1252 is a superset of ISO-8859, which is what I assumed throughout my blog.

    So even though you can do some conversions for windows ANSI api in your applications, you will never be able to use the full unicode character spectrum.

    Is this correct?

    Evert

  3. 1937 Steve Clay 2010-04-21 3:25 am

    More caveats about Windows in my StackOverflow answer.

    1) If any individual byte matches an invalid Windows filesystem character in ISO-8859-1, you're probably out of luck.
    2) I'm betting Windows uses encodings other than ISO-8859-1 in non-English locales.

    Sending normalized UTF-8 through urlencode() is certainly the way to go on Windows IMO.

  4. 1941 Nicolas Grekas 2010-04-21 7:15 am

    Thank you for this article, this is the current painful truth, especially on Windows...

    Btw, on Windows, I've managed to deal with UTF-8 partialy, using a combo of COM Scripting.FileSystemObject object and so called 8.3 ShortPath.
    Here is the code :
    http://github.com/nicolas-grekas/Patchwork/blob/lab/windows/class/WIN.php

    It is not bullet proof, as for example ShortPath support can be disabled on NTFS, but it should work quite well for experimenting at least.

  5. 1942 Pierre 2010-04-21 11:13 am

    Urlencode is not the way to go sorry. I cannot imaginge to have to deal with such file names :)

    The only clean way right now is to know the actual code page used by the process
    (easy if you are in control of the box). Then it is possible to safely convert the file name to this codepage using mbstring. But I'd to say that it is not very user friendly.

    I was thinking about accepting UTF-8 by default as well on windows on PHP-next (whatever it will be). But that's something we have to do carefully :)

  6. 1943 Evert 2010-04-21 12:16 pm

    Pierre,

    I think urlencoding is a decent suggestion if preservation of all characters is desired, using for example a file manager.

    No doubt that it will be annoying to work with outside of a PHP application, but there really isn't another solution.

    Attempting to detect the current codepage, cumbersome or not does not solve the problem. What would I do with my Korean filename if I found out the codepage is CP-1252?

    Detecting the codepage also has use if you also only plan to store files with names that fit in that codepage.

    Evert

  7. 1944 Johannes 2010-04-21 12:21 pm

    POSIX Standard, section 3.170 specifieces a filename as "A name consisting of 1 to {NAME_MAX} bytes used to name a file. The characters composing the name may be selected from the set of all character values excluding the <slash> character and the null byte. The filenames dot and dot-dot have special meaning. A filename is sometimes referred to as a "pathname component". " So all bytes are explicitly allowed. PHP has to follow this to allow access to all files.

  8. 1946 Evert 2010-04-21 1:11 pm

    @Johannes: what about windows? Clearly windows is not Posix based nor will it allow all bytes.

  9. 1956 Evert Pot’s Blog: Filesystem encoding and PHP | Development Blog With Code Updates : Developercast.com 2010-04-21 5:58 pm

    ...w post to his blog about working with files in your applications, more specifically in dealing with filesystem encodings other than some of the defaults. Many PHP applications save files to a local filesystem. Most of th...

  10. 6161 Philippe Verdy 2010-09-26 8:53 am

    The output from your PHP installation on Windows is easy to explain : you installed the wrong version of PHP, and used a version not compiled to use the Unicode version of the Win32 API. For this reason, the filesystem calls used by PHP will use the legacy "ANSI" API and so the C/C++ libraries linked with this version of PHP will first try to convert yout UTF-8-encoded PHP string into the local "ANSI" codepage selected in the running environment (see the CHCP command before starting PHP from a command line window)

    Your version of Windows is MOST PROBABLY NOT responsible of this weird thing. Actually, this is YOUR version of PHP which is not compiled correctly, and that uses the legacy ANSI version of the Win32 API (for compatibility with the legacy 16-bit versions of Windows 95/98 whose filesystem support in the kernel actually had no direct support for Unicode, but used an internal conversion layer to convert Unicode to the local ANSI codepage before using the actual ANSI version of the API).

    Recompile PHP using the compiler option to use the UNICODE version of the Win32 API (which should be the default today, and anyway always the default for PHP installed on a server that will NEVER be Windows 95 or Windows 98...)

    Then Windows will be able to store UTF-16 encoded filenames (including on FAT32 volumes, even if, on these volumes, it will also generate an aliased short name in 8.3 format using the filesystem's default codepage, something that can be avoided in NTFS volumes).

    All what you describe are problems of PHP (incorrect porting to Windows, or incorrect system version identification at runtime) : reread the README files coming with PHP sources explaining the compilation flags. I really think that the makefile on Windows should be able to configure and autodetect if it really needs to use ONLY the ANSI version of the API. If you are compiling it for a server, make sure that the Configure script will effectively detect the full support of the UNICODE version of the Win32 aPI and will use it when compiling PHP and when selecting the runtime libraries to link.

    I use PHP on Windows, correctly compiled, and I absolutely DON'T know the problems you cite in your article.

    Let's forget now ***forever*** these non-UNICODE versions of the Win32 API (which are using inconsistantly the local ANSI codepage for the Windows graphical UI, and the OEM codepage for the filesystem APIs, the DOS/BIOS-compatible APIs, the Console APIs) : these non-Unicode versions of the APIs are even MUCH slower and more costly than the Unicode versions of the APIs, because they are actually translating the codepage to Unicode before using the core Unicode APIs (the situation on Windows NT-based kernels is exactly the reverse from the situation on versions of Windows based on a virtual DOS extender, such as Windows 95/98/ME).

    When you don't use the native version of the API, your API call will pass through a thunking layer that will transcode the strings between Unicode and one of the legacy ANSI or CHCP-selected OEM codepages, or the OEM codepage hinted on the filesystem: this requires additional temporary memory allocation within the non-native version of the Win32 API. This takes additional time to convert things before doing the actual work by calling the native API.

    In summary: the PHP binary you install on Windows MUST be different depending on if you compiled it for Windows 95/98/SE (or the old Win16s emulation layer for Windows 3.x, which had a very mimimum support of UTF-8, only to support the Unicode subsets of Unicode used by the ANSI and OEM codapges selected when starting Windows from a DOS extender) or if it was compiled for any other version of Windows based on the NT kernel.

    The best proof that this is a problem of PHP and not Windows, is that your weird results will NOT occur in other languages like C#, Javascript, VB, Perl, Ruby... PHP has a very bad history in tracking versions (and too many historical source code quirks and wrong assumptions that should be disabled today, and an inconsistant library that has inherited all those quirks initially made in old versions of PHP for old versions of Windows that are even no longer officially supported, by Microsoft or even by PHP itself !).

    In other words : RTFM ! Or download and install a binary version of PHP for Windows precompield with the correct settings : I really think that PHP should distribute Windows binaries already compiled by default for the Unicode version of the Win32 API, and using the Unicode version of the C/C++ libraries : internally the PHP code will convert its UTF-8 strings to UTF-16 before calling the Win32 API, and back from UTF-16 to UTF-8 when retrieving Win32 results, instead of converting PHP's internal UTF-8 strings back/to the local OEM codepage (for the filesystem calls) or the local ANSI codepage (for all other Win32 APIs, including the registry or process).

    Another thing in your conclusion: you say that on Windows, PHP used ISO-8859-1, this is plain wrong ! It used your local OEM codepage (probably 437 if you're in US, or 850 if you're in Western Europe). This is determined at runtime in the C/C++ library linked to PHP.

  11. 6164 Evert 2010-09-26 9:49 am

    Big wall of text!

    1. If other languages an use the Unicode api's, this does not imply that PHP does as well.. and this is certainly not proof.
    2. I used a binary from the PHP site, so I didn't build from source. Why would I want to use a binary compatible for a 12 year old operating system (Windows 98, really?).
    3. The default 'codepage' for me was 1252, 437/850 are older codepages, which are now replaced. cp-1252 is a superset of ISO-8859-1. My advice was to use ISO-8859-1 if you need to encode paths on windows. This still standards, as valid ISO-8859-1 strings will also be valid CP-1252 (just not the other way around).

    Now take a deep breath, and try again.

  12. 6165 Evert 2010-09-26 9:49 am

    s/standards/stands.

  13. 7148 Robert Johnson 2010-11-15 4:49 pm

    I found this a very useful article, thank you.

    In reply to Philippe's rant, I can confirm that only filenames encoded with the local language are returned by both glob() and SPL classes like DirectoryIterator. This is on Windows XP with PHP 5.3.3 compiles on VC9, from the production binaries. Everything in the main article is factually correct.

    It is a great shame because as noted by everyone, Windows stores files in Unicode and files in all languages can sit side by side. I first noticed it when I tried to store the file name returned by DirectoryIterator in a MySQL UTF8 field, and MySQL rejected it.

    Until PHP6 is out, COM is the best solution, as pointed out by Nicolas. When a COM class is created, add 'UTF-8' as the third argument to COM() - it will convert all Unicode UCS2 strings returned from the COM object to UTF-8 in PHP, which is exactly what is needed in the directory functions.

  14. 7149 Robert Johnson 2010-11-15 5:01 pm

    I'd also like to add that Philippe is correct that it is not the fault of Windows - it is a PHP coding problem on Windows. One of Windows' biggest strengths since NT has been its excellent multilingual support.

  15. 7431 Philippe Verdy 2010-12-31 10:38 am

    My « rant » was absolutely not a rant.

    And you absolutely don't understand encoding issues when you say « only filenames encoded with the local language are returned by... ». There's NO SUCH thing like encoding by language ! And the SPL is completely unrelated here. DirectoryIterator is not the issue.

    You have absolutely not understood how PHP interacts with Windows : PHP only stores strings as sequences of 8-bit bytes (one for each of its "character". It requires an internal encoding, but it does not say how the directory iterator is implemented; On the root, it uses the Win32 API which exists in distinct TWO forms : One works with Unicode but only supports UTF-16 (code units are WCHAR), the other one is 8 bit only and uses the local system' code page for the File I/O Win32 API (another API also exists to emulate MSDOS and will convert one of these to the codepage of the user's current codepage selected in the user's environment, i.e. the OEMCP codepage).
    The OEMCP codepage may change dynamically, there's also the ANSICP codepage which never varies on a system, but is used for compatibility with the old Win16 API : the exact codepage depends on lthe localization of Windows.
    Independantly of these settings, the fielsystems (FAT/FAT32/exFAT or NTFS or others) MAY store filenames in some 8-bit encoding (for short filenames in the 8.3 format, which is the only one exposed by the Win16 API and the MSDOS API), as well as in a Unicode-encodied string (compatible with ling filenames, and preferably used as the reference on Win32 kernels). This means that Windows will store multiple filenames for the same file, and the one selected will preferably be th

  16. 7432 Philippe Verdy 2010-12-31 10:49 am

    will be preferably this Unicde (UTF-16) encoded string (if it's present and stored, for example on NTFS, or FAT32 or CDROM with Joliet extensions). Otherwise Windows will use the 8-bit encoded filename and will have to "guess" (or infer) in which 8-bit encoding it was stored (FAT filesystems do not explicitly state in a property of the volume which encoding it uses ; the same is true for SMBFS for remote directories), in order to convert it to Unicode for internal handling.

    Then windows will reencode this name into the encoding supported by the API used to enumerate the directory contents.

    What you don't undersand is that PHP (independantly of the SPL which is built on top of it) uses the WIN32 API, but does not explicitly state which version (Unicode or ANSICP) it uses : each of these APIs actually have two names, and a compiler macro (in C/C++) maps the generic name to one or the other API. When PHP is compiled, it will use the definition of the "_UNICODE" macro to indicate to the compiler's declaration headers of the Win32 APIs, which one it will use.

    If PHP is not compiled with _UNICODE defined, it will use the ANSICP API, so the Windows kernel will convert ANY unicode string read from (or converted from) the stored filesystem into the ANSICP codepage (which depends on instalaltion parameters of Windows): this means that lots of Unicode characters will not be mapped and lots of differences will be lost. PHP cannot recover those differences, independantly of how it will store its internal strings and expose them to the SPL.

  17. 7433 Philippe Verdy 2010-12-31 10:51 am

    Only one person ranted here : Evert, who simply does not understand what he says.

  18. 7434 Philippe Verdy 2010-12-31 11:07 am

    In conclusion, the problem is not Windows, but the incorrect port of PHP on Windows, with incorrect usage of its API, and incorrect management/tracking of which encoding is used (and assumed) by each API.
    Yes the PHP binary you download from the PHP.net site has been compiled incorrectly and incorrectly ported, because it is still compiled to be binary compatible with Windows 95/98 (which did not have Unicode support in its kernel). So this API assumes the ANSICP codepage (which varies across Windows installations).

    And no, there's no such support for ISO 8859-1. The ANSICP codepage MAY be Windows-1252 or something else (for example in Greece, Russia, Asia, India), but never ISO 8859-1 ! And if you don't use the Unicode version for the Win32 API when porting/compiling PHP, you will NEVER get the benefits of the native Unicode support in the Win32 kernel (independantly of the stored filesystem format and its encoding capabilities).

    Morality : your PHP installation is bogous, not Windows itself. Recompile PHP with Unicode support as the default for the Win32 API (just #define the _UNICODE macro in the build file for the C/C++ compiler), and don't use the binary you find on PHP.net which was not compiled with it (and does not correctly manage the Win32 API for File I/Os), because it is definitely NOT compatible with Unicode-enabled filesystems.

    And DON'T assume that even ISO 8859-1 is safe (the article above just demonstrates that this does not even work for the U WITH DIAERESIS, despite it is fully present in Unicode (UTF-8 or UTF-16), in ISO 8859-1, in the "Windows-1252" ANSICP codepage, and in the "CP467" or "CP850" codepages for OEMCP.

    You can even enable a C/C++ compiler option that will display compiler warnings if you ever use a non-Unicode Windows API instead of the Unicode-enabled API. Read the compiler manual ! PHP developers have completely forgotten to do that, only because they only tested their code for ASCII and did not care about anything else.

  19. 7437 Evert 2011-01-02 1:22 am

    Thanks for the insights Phillipe, not so much for the insults.

  20. 7441 Pierre 2011-01-02 1:17 pm

    @Philippe Verdy

    I suppose by wrongly compiled you mean something like not using the unicode Flag, right? That has nothing to do with wrong compilation or builds but a matter of compatibility. Windows has Wide chars APIs or ANSI APIs. We use (and 99% of the libraries PHP relies on) the ANSI version.

    It can work perfectly with unicode filesystems as well, just like it does on all the servers where PHP is used (NTFS is unicode). However it has issues to deal with non ANSI filenames as you have to guess the runtime encoding.

    Also a side note I forgot to mention earlier, windows do not use UTF-16 but UCS-2.

  21. 7443 Bruce Weirdan 2011-01-02 4:02 pm

    To confirm Philippe's observation that iso-8859-1 is not safe here's another post (in Russian): http://community.livejournal.com/ru_php/1319838.html
    That guy was having a problem reading filenames containing umlauts (created from outside the php); the umlauts were dropped when reading a directory via scandir on Russian WinXP, turning 'Hän' to 'Han'.
    The post is a bit old, but I don't think much has changed since then.

    @Pierre: now that Microsoft and PHP Team started collaborating do you think we can expect official unicode php builds anytime soon?

  22. 7444 Pierre 2011-01-02 8:47 pm

    @Bruce Weirdan
    I'm working (being alone working on the core btw, from a win pov) on a patch to make PHP IO functions accept UTF-8 as it does on other platforms (converting to UCS-2, aka Unicode on windows, behind the scene).

    We can't and won't provide builds with the UNICODE flag enabled. The reasons are all the unknown place (underlying libs) with no unicode support. I will provide patches (or patch them for our binaries distributions) to each of them. The critical ones support custom IO layers (like libxml, gd, curl), so that's less a problem :)

Leave a Reply



About

My name is Evert, and I've been writing semi-regularly on this blog since 2006.

I'm currently available for contract work.

more info.

Subscribe

Dropbox

Dropbox is a simple cross-platform online backup and sync application. The first 2GB of space is free, and both you and me get an extra 250MB extra space if you sign up through this link.