basename() is locale-aware

For years I've always just assumed:

  1. $baseName = basename('dir/file');

Was just an easy way to do:

  1. $file = 'dir/file';
  2. $baseName = substr($file,strrpos($file,'/')+1);

It turns out basename does a bit more than just splicing the string at the last slash, because it's locale aware. In my case I was dealing with a multi-byte UTF-8 string. It took me quite some time figuring out what was going on, because I was testing from the console which had the en_US.UTF-8 locale, and the bug was appearing on Apache, which defaults to the C locale.

Example:

  1. <?php
  2.  
  3. $str = urldecode('%C3%A0fo%C3%B3');
  4.  
  5. setlocale(LC_ALL,'C');
  6. echo urlencode(basename($str)) . "\n";
  7.  
  8. setlocale(LC_ALL,'en_US.UTF-8');
  9. echo urlencode(basename($str)) . "\n";
  10.  
  11. ?>

Output:

  1. fo%C3%B3
  2. %C3%A0fo%C3%B3

What bugs me about this, is that there was no way for me to know basename() operates on anything else than bytes. The PHP manual also doesn't point this out. It makes me wonder how many other string functions change behaviour based on their locale.

 1

About

My name is Evert, and I've been writing semi-regularly on this blog since 2006.

I'm currently available for contract work.

more info.

Subscribe

Dropbox

Dropbox is a simple cross-platform online backup and sync application. The first 2GB of space is free, and both you and me get an extra 250MB extra space if you sign up through this link.