Creating streams from strings in PHP

I'm in the process of writing an API that relies on (file-)streams to be passed around.

There are situations where a string instead needs to be used, and for these purposes the data: stream wrapper is used. Initially I thought it was only possible to encode the actual string in base64, which I didn't like because of the added footprint.

  1. <?php
  2.  
  3. $string = "I should have really done some laundry tonight.";
  4.  
  5. $stream = fopen('data://text/plain;base64,' . base64_encode($string),'r');
  6.  
  7. echo stream_get_contents($stream);
  8.  
  9. ?>

Quickly checking out the rfc, it turns out that ';base64' can be omitted to just pass along the raw data, which makes a lot more sense in the context of PHP.

Thankfully, PHP gladly supports it:

  1. <?php
  2.  
  3. $string = "I tried, honestly!";
  4.  
  5. $stream = fopen('data://text/plain,' . $string,'r');
  6.  
  7. echo stream_get_contents($stream);
  8.  
  9. ?>

Apache speed and reverse proxies

In our environment we use Apache everywhere. It's PHP integration has so far proven superiour. Now we're dealing with higher loads and we've hit some limitations.

One of the problems we had, is Apache's heaviness. Our apache2 worker processes eat up around 20 Megabytes of memory, and with 3 GB of memory will bring us up to a setting of around 150 MaxClients. Rasmus seems to think that's a pretty high setting, but based off the easy calculation (memory available for apache / size of an apache process) it works out for us.

Effectively this means we can serve approximately this much parallel request on this machine. It is therefore in our greatest benefit to get every response out as quickly as possible, increasing the amount of requests we can handle per second.

Going beyond this 150 number could cause Linux to start using swap. This is bad, because it will add latency to the response, which in turn will result connections staying open longer.

Since we're sending everything over the web, there is a standard latency. Information traveling to the other side of the globe will at least take 67ms because we're restricted to the speed of light. This doesn't even take non-direct routes nor other hardware latency into account. According to Till this all adds up to the time a single Apache process takes up before working on the next request.

The reverse proxy

There are a couple of webservers which seem to be optimized for serving lots of clients. Lighttpd got a lot of traction earlier, but the project seems to have slowed down a lot as the much anticipated 1.5 release has been under development for almost 2 years. nginx seems have taken it's place in terms of disruptiveness. These servers are much more lightweight, and are supposed to be faster in delivery of static files.

Much like Till, we've had issues hooking PHP directly into these servers. Till suggests the solution of actually placing nginx in front of Apache (on the same machine) as a reverse proxy. Nginx takes care of serving static files and proxies any PHP request to Apache. The concept is that Apache can push out the response as quickly as possible, and while Nginx is working on delivering it to the (slow) client Apache can take on other work.

The thing that bothers me with this setup, is that the need for 2 webserver products to achieve a single task. This implies that neither of them is adequate on it's own to do the job.

On the other hand, this type of setup is also what a lot of people seem to be doing by placing Squid in front of their webservers, although that tends to happen on separate hardware.

HTTP/1.1 100 Continue

All of a sudden we noticed a problem we saw earlier with Lighttpd (Bug #1017) was also an issue in nginx (couldn't find bug or bug tracker at all). Neither of them seems to support the Expect: 100-continue header. While no browser actually sends these headers, we have webservices running which are directly accessed by other types of HTTP clients. Losing support for this HTTP functionality would instantly break their applications, which is unacceptable.

So now we're actually looking at Squid for performing that task. Squid is powerful and well tested. We're going to start load testing this reasonably soon, and I have no problems reporting back here if people are interested in numbers. I'm wondering if there's other people who have tried a similar setup or if there's better ways to approach this problem.

A case against pagers

Rant warning -. Paging seems such a common interface element on many websites. I personally have some issues with them, as I feel they serve to solve a technical problem, and not so much a usability problem. In fact, I would argue that a pager works against usability.

The number one means to properly organize large amounts of content in any desktop application is fortunately built right in to your browser. In fact, you might see one right now on the far right of this screen, it's a scrollbar!

The biggest reason against pagers, is that I can not utilize Option-F or Apple-F when I quickly want to search through a big table of information. I'd actually need to either rely on the Web Application to provide a search option, or manually have to scroll through the pages to find what I want. Besides that scrolling functionality is built straight into my mouse and I don't have to look & click for a pager.

So I would argue that Pagers mostly solve a technical issue. It is unacceptable to download large amounts of data in a browser. This would increase bandwidth (cost! speed!). It makes me wonder if there are better ways possible to present this.

Worse!

A lot of high profile news sites even utilize pagers within their articles. We're really just talking about textual content here.

I have a fair idea where these pagers stem from, which is actually another point of annoyance for myself: the outdated perception ad value directly relates to page-views.

Conclusively

I've needed them quite a bit myself, especially since we're doing a lot with media and galleries which are notoriously heavy. If it's not because of the bandwidth, I've definitely seen a huge amount of images slow my browser down quite a bit. The lesser, the better though.

PHPUnit: A second look

PHPUnit

Somewhere in 2007 I had a deep dive into PHPUnit, and there were a couple of things that bugged me.

Looking into it again, it turns out that since then
everything has been fixed, making it perfect for integrating it intosabredav. Most of the protocol-level
WebDAV stuff is all tested with litmus,
but having good unit tests will help ensuring a high quality of the inner business logic.
Already it has identified 2 spelling mistakes :).

My highlights:

  • No need to write 'TestSuites' anymore.
  • A 'bootstrap' setting. This file contains all the application specific
    logic to setup the test.
  • The Code coverage analysis is perfected.
  • The ability to specify which directories to include in the code coverage.
  • As a bonus, all these settings can be specified through a single XML
    configuration file
    . This way I can simply tell users to run 'phpunit --configuration=config.xml' and they're off..

SabreDAV 0.5

A new year, a new SabreDAV. This version adds some new features and fixes some bugs.

SabreDAV is a library, intended to easily create WebDAV frontends for existing PHP applications. At the office we're using it to access certain parts of our web application directly from our filesystem.

Download it here.

Most notable changes:

  • Added: Added a simple example for implementing a mapping to PHP file streams.
    This should allow easy implementation of for example a WebDAV to FTP proxy.
  • Added: HTTP
    Basic Authentication helper class
    .
  • Updated: Backwards compatibility break: all require_once() statements are removed
    from all the files. It is now recommended to use autoloading of
    classes, or just including lib/Sabre.includes.php. This fix was made
    to allow easier integration into applications not using this standard
    inclusion model.

Full Changelog.

BBC drops microformats from programmes section.

A few days ago, I read on the BBC Radio Labs Blog they are removing microformats from their programmes section.

Although I can't watch the BBC from where I'm living, I find this very interesting news. The reason they want to drop microformats mainly seems to be related to misuse of HTML tags, causing problems for people with disabilities. Although they clearly mention this is mainly causing problems for the formats using the so-called Abbr design pattern, I think this brings up a bigger problem.

Microformats Logo

I don't think the semantics HTML should be extended this way. Although this might have been the original intention while developing the HTML standard, I think at this day and age it is very difficult to add meaning to standard html tags, without affecting the user experience. This will work fine if you look at HTML as a transport mechanism for data, but not if this also needs to be opened by the user, in a browser..

I totally get having a separate HTML document, one for just carrying microformat standardized data, but what is the benefit to mix it with the html document the standard browsers get served? And if anything, why not base it off a seperate XML namespace and embed it in XHTML files. After all, XML was intended to be easily extendable. This is also how RDFa works.

Using RDFa still doesn't fix the browser-compatibility issues though. I think it would be perfectly acceptible to serve the machine-readable data off different url's. That way it's possible to use the full XML and HTTP (REST) stack. Atom has done this really well and is still detectable by a browser, because of its integration using the <link> tag. The other benefit is that the same degree of liberty can be retained in writing the presentation HTML, but you can demand higher strictness in the XML format.

One of the things atom does not cover, is a correlation between a specific section within a browser-readable HTML document and semantic data, but I'm sure this could be solved by referencing id's, or using XSLT like Mozilla's Microsummaries standard does.

The last argument for microsummaries, is that semantically only one URI should represent an entity (piece of data). The HTTP standard also has a solution for this, as it could simply leverage a simple Accept header.

So I guess the question I'm posing is: what is the benefit of embedding machine-readable data in HTML over serving it as a separate document, since it seems to make implementation more difficult.

Devshed article about SQL Injection (or why security related articles should only be written by experienced people)

Through PHPDeveloper I came across a Devshed article related to SQL Injection.

The one major flaw in the article is that it is suggested input validation is enough protection. This is not the case. Lets start with this example (literal copy-paste):

  1. <?php
  2.  
  3. //Validate text input
  4. if (! preg_match('/^[-a-z.-@,'s]*$/i',$_POST['name']))
  5. {
  6. } else
  7. if ($empty==0)
  8. {
  9. }
  10. else
  11. {
  12. }
  13. ?>

Ignoring the syntax errors, I believe the author here actually intended to allow the usage of the single quote character ('). This will allow sql injection in a lot of queries.

The worst part is the following:

Want stronger protection?If you need stronger protection you can validate the user input using the above scripts andmysql_real_escape_string; this will offer secondary protection in case the above validation scripts fail due to some reason. Discussing this feature is beyond the scope of this article and you can read useful resources on:http://www.php.net/mysql_real_escape_string

However, before you can use this feature, you must be connected to a MySQL database, or else it will return an error. Some really talented hackers can play around with mysql_real_escape_string, which is why it is highly recommended to have a double filter in your scripts (validating scripts +mysql_real_escape_string) to make hacking much more difficult.

Here it is is suggested that mysql_real_escape_string should be used only in the event somebody feels they need 'more' protection. Also, I would like to meet these talented hackers.

What you really should do

  • Always use mysql_real_escape_string. Escaping and validating serve two (important) purposes, and validating does not take away the need to escape. mysql_real_escape_string escapes many more 'bad' characters and it works with different collations.
  • Blacklisting 'bad' things is in most cases a bad idea. Always try to use whitelists for acceptable input.
  • Don't write security related articles if you don't really know the subject. You are potentially placing people at risk, and you promote bad habits.
 1

About

My name is Evert, and I've been writing semi-regularly on this blog since 2006.

I'm currently available for contract work.

more info.

Subscribe

Dropbox

Dropbox is a simple cross-platform online backup and sync application. The first 2GB of space is free, and both you and me get an extra 250MB extra space if you sign up through this link.