Browsing all posts tagged php

Even though the site aggravated me at first, I still occasionally troll Stack Overflow. One of the leading problems I see in questions pertaining to PHP & MySQL, is people's use of the MySQL extension in PHP. This extension, it turns out, is being deprecated. But does the documentation reflect this fact? Yes and no.

Certain function pages, such as mysql_real_escape_string, have big red boxes at the top indicating that the extension is being deprecated. "Don't use this", they seem to shout. Other function pages, however, such as the mysql_result page, don't have these warnings. Likewise, the top-level MySQL Drivers and Plugins page lists the MySQL extension first, with no indication whatsoever that the extension is being deprecated.

At the very least, every single documentation page that deals with the MySQL extension in any form or fashion, needs to include information about its intended deprecation. Otherwise, thousands upon thousands of programmers will write code using a plugin that is quickly nearing it's end-of-life. Which, based on what I see at Stack Overflow, already seems to be the case.

A couple of years ago, I blogged about two helper functions I wrote to get HTML form data in PHP: getGet and getPost. These functions do a pretty good job, but I have since replaced them with a single function: getData. Seeing as I haven't discussed it yet, I thought I would do so today. First, here's the function in its entirety:

/**
 * Obtains the specified field from either the $_GET or $_POST arrays
 * ($_GET always has higher priority using this function). If the value
 * is a simple scalar, HTML tags are stripped and whitespace is trimmed.
 * Otherwise, nothing is done, and the array reference is passed back.
 * 
 * @return The value from the superglobal array, or null if it's not present
 * 
 * @param $key (Required) The associative array key to query in either
 * the $_GET or $_POST superglobal
 */
function getData($key)
{
    if(isset($_GET[$key]))
    {
        if(is_array($_GET[$key]))
            return $_GET[$key];
        else
            return (strip_tags(trim($_GET[$key])));
    }
    else if(isset($_POST[$key]))
    {
        if(is_array($_POST[$key]))
            return $_POST[$key];
        else
            return (strip_tags(trim($_POST[$key])));
    }
    else
        return null;
}

Using this function prevents me from having to do two checks for data, one in $_GET and one in $_POST, and so reduces my code's footprint. I made the decision to make $_GET the tightest binding search location, but feel free to change that if you like.

As you can see, I first test to see if the given key points to an array in each location. If it is an array, I do nothing but pass the reference along. This is very important to note. I've thought about building in functionality to trim and strip tags on the array's values, but I figure it should be left up to the user of this function to do that work. Be sure to sanitize any arrays that this function passes back (I've been bitten before by forgetting to do this).

If the given key isn't found in either the $_GET or $_POST superglobals, I return null. Thus, a simple if(empty()) test can determine whether or not a value has been provided, which is generally all you care about with form submissions. An is_null() test could also be performed if you so desire. This function has made handling form submissions way easier in my various work with PHP, and it's one tool that's worth having in your toolbox.

I ran into an interesting phenomenon with PHP and MySQL this morning while working on a web application I've been developing at work. Late last week, I noted that page loads in this application had gotten noticeably slower. With the help of Firebug, I was able to determine that a 1-second delay was consistently showing up on each PHP page load. Digging a little deeper, it became clear that the delay was a result of a change I recently made to the application's MySQL connection logic.

Previously, I was using the IP address 127.0.0.1 as the connection host for the MySQL server:

$db = new mysqli("127.0.0.1", "myUserName", "myPassword", "myDatabase");

I recently changed the string to localhost (for reasons I don't recall):

$db = new mysqli("localhost", "myUserName", "myPassword", "myDatabase");

This change yielded the aforementioned 1-second delay. But why? The hostname localhost simply resolves to 127.0.0.1, so where is the delay coming from? The answer, as it turns out, is that IPv6 handling is getting in the way and slowing us down.

I should mention that I'm running this application on a Windows Server 2008 system, which uses IIS 7 as the web server. By default, in the Windows Server 2008 hosts file, you're given two hostname entries:

127.0.0.1 localhost
::1 localhost

I found that if I commented out the IPV6 hostname (the second line), things sped up dramatically. PHP bug #45150, which has since been marked "bogus," helped point me in the right direction to understanding the root cause. A comment in that bug pointed me to an article describing MySQL connection problems with PHP 5.3. The article dealt with the failure to connect, which happily wasn't my problem, but it provided one useful nugget: namely that the MySQL driver is partially responsible for determining which protocol to use. Using this information in my search, I found a helpful comment in MySQL bug #6348:

The driver will now loop through all possible IP addresses for a given host, accepting the first one that works.

So, long story short, it seems as though the PHP MySQL driver searches for the appropriate protocol to use every time (it's amazing that this doesn't get cached). Apparently, Windows Server 2008 uses IPV6 routing by default, even though the IPV4 entry appears first in the hosts file. So, either the initial IPV6 lookup fails and it then tries the IPV4 entry, or the IPV6 route invokes additional overhead; in either case, we get an additional delay.

The easiest solution, therefore, is to continue using 127.0.0.1 as the connection address for the database server. Disabling IPV6, while a potential solution, isn't very elegant and it doesn't embrace our IPV6 future. Perhaps future MySQL drivers will correct this delay, and it might go away entirely once the world switches to IPV6 for good.

As an additional interesting note, the PHP documentation indicates that a local socket gets used when the MySQL server name is localhost, while the TCP/IP protocol gets used in all other cases. But this is only true in *NIX environments. In Windows, TCP/IP gets used regardless of your connection method (unless you have previously enabled named pipes, in which case it will use that instead).

It's incredible to me that in 2011, programming languages still have problems with files larger than 2GB in size. We've had files that size for years, and yet overflow problems in this arena still persist. At work, I ran into this problem trying to get the file size of very large files (between 3 and 4 GB in size). The typical filesize() call, as shown below, would return an overflowed result on a very large file:

$size = filesize($someLargeFile);

Because PHP uses signed 32-bit integers to represent some file function return types, and because a 64-bit version of PHP is not officially available, you have to resort to farming the job out to the OS. In Windows, the most elegant way I've found so far is to use a COM object:

$fsobj = new COM("Scripting.FileSystemObject");
$f = $fsobj->GetFile($file);
$size = $file->Size;

Uglier hacks involve capturing the output of the dir command from the command line. There are two bug reports filed on this very issue: 27792 and 34750. The newest of these was filed in late 2005; a little more than 5 years ago! It's sad to see a language as prolific as PHP struggling with a problem so basic. Perhaps this issue will finally get fixed in PHP 6.

It's been quite a while since my last programming tips grab bag article, and it's high time for another. As promised, I'm discussing PHP this time around. Although simple, each of these tips is geared towards writing cleaner code, which is always a good thing.

1. Use Helper Functions to Get Incoming Data

Data is typically passed to a given web page through either GET or POST requests. To make things easy, PHP give us two superglobal arrays for each of these request types: $_GET and $_POST, respectively. I prefer to use helper functions to poke around in these superglobal arrays; it results in cleaner looking code. Here are the helper functions I typically use:

// Helper function for getting $_GET data
function getGet($key)
{
    if(isset($_GET[$key]))
    {
        if(is_array($_GET[$key]))
            return $_GET[$key];
        else
            return (trim($_GET[$key]));
    }
    else
        return null;
}

// Helper function for getting $_POST data
function getPost($key)
{
    if(isset($_POST[$key]))
    {
        if(is_array($_POST[$key]))
            return $_POST[$key];
        else
            return (trim($_POST[$key]));
    }
    else
        return null;
}

Calling these functions is super simple:

$someValue = getGet('some_value');

If the some_value parameter is set, the variable will get the appropriate value. If it's not set, the variable gets assigned null. So, all that's needed after calling getGet or getPost, is a test to make sure the variable is non-null:

if(! is_null($someValue))
{
    // ... do something
}

Note that these functions also handle the case where the incoming data may be an array (useful when processing lots of similar data fields at once). If the data is simply a scalar value, I run it through the trim function to make sure there's no stray whitespace on either side of the incoming value.

2. Write Your Own SQL Sanitizer

The first and most important rule when accepting data from a user is: never trust the user, even if that user is you! When incoming data is going to be put into a database, you need to sanitize the input to avoid SQL injection attacks. Like the superglobal arrays above, I like using a helper function for this task:

function dbSafe($string)
{
    global $db; // MySQLi extension instance
    return "'" . $db->escape_string($string) . "'";
}

In this example, I'm making use of the MySQLi extension. The $db variable is an instance of this extension, which gets created in another file. Here's an example of creating that instance, minus all the error checking (which you should do); the constants used as parameters should be self explanatory, and are defined elsewhere in my code:

$db = new mysqli(DB_HOST, DB_USER, DB_PASSWORD, DB_NAME);

Back to our dbSafe function, all I do is create a string value: a single quote, followed by the escaped version of the incoming data, followed by another single quote. Let's assume that my test data is the following:

$string = dbSafe("Isn't this the greatest?");

The resulting value of $string becomes 'Isn\'t this the greatest?'. Nice and clean for insertion into a database! Again, this helper makes writing code faster and cleaner.

3. Make a Simple Output Sanitizer

If you work with an application that displays user-generated content (and after all, isn't that what PHP is for?), you have to deal with cross-site scripting (XSS) attacks as well. All such data that is to be rendered to the screen must be sanitized. The htmlentities and htmlspecialchars functions provide us with the capability to encode HTML entities, thus making our output safe. I prefer using the latter, since it's a little safer when working with UTF-8 encoded data (see my article Unicode and the Web: Part 1 for more on that topic). As before, I wrap the call to this function in a helper to save me some typing:

function safeString($text)
{
    return htmlspecialchars($text, ENT_QUOTES, 'UTF-8', FALSE);
}

Everything here should be self explanatory (see the htmlspecialchars manual entry for explanations on the parameters to that function). I make sure to use this any time I display user-generated content; even content that I myself generate! Not only is it important from an XSS point of view, but it helps keep your HTML validation compliant.

4. Use Alternate Conditional Syntax for Cleaner Code

Displaying HTML based on a certain condition is incredibly handy when working with any web application. I used to write this kind of code like this:

<?php
if($someCondition)
{
    echo "\t<div class=\"myclass\">Some element to insert</div>\n";
}
else
{
    echo "\t<div class=\"myclass\"></div>\n"; // Empty element
}
?>

Not only do the backslashed double quotes look bad, the whole thing is generally messy. Instead, I now make use of PHP's alternative syntax for control structures. Using this alternative syntax, the above code is modified to become:

<?php if($someCondition): ?>
    <div class="myclass">Some element to insert</div>
<?php else: ?>
    <div></div>
<?php endif; ?>

Isn't that better? The second form is much easier to read, arguably making things much easier to maintain down the road. And no more backslashes!

A PHP Include Pitfall

Feb 22, 2009

I ran into an interesting problem with the PHP include mechanism last night (specifically, with the require_once variant, but this discussion applies to all of the include-style functions). Suppose I have the following folder structure in my web application:

myapp/
 |-- includes.php
 +-- admin/
      |-- admin_includes.php
      +-- ajax/
           +-- my_ajax.php

Let's take a look at the individual PHP files in reverse order. These examples are bare bones, but will illustrate the problem. First, my_ajax.php:

// my_ajax.php
<?php
require_once("../admin_includes.php");

some_generic_function();
?>

Here's the code for admin_includes.php:

// admin_includes.php
<?php
require_once("../includes.php");
?>

And finally, includes.php:

// includes.php
<?php
function some_generic_function()
{
    // Do something here
}
?>

When I go to access the my_ajax.php file, I'll get a "no such file or directory" PHP error. This immediately doesn't make much sense, but a quick glance at the PHP manual clears things up:

Files for including are first looked for in each include_path entry relative to the current working directory, and then in the directory of the current script. If the file name begins with ./ or ../, it is looked for only in the current working directory.

The important part is in that last sentence: if your include or require statement starts with a ./ or ../, PHP will only look in the current working directory. So, in our example above, our working directory when accessing the AJAX script is "/myapp/admin/ajax." The require_once within the admin_functions.php file will therefore fail, since there's no '../includes.php' in the current working directory.

This is surprising behavior and should be kept in mind when chaining includes. A simple workaround is to use the following code in your include statements:

require_once(dirname(__FILE__) . "../../some/relative/path.php");

It's not the most elegant solution in the world, but it gets around this PHP annoyance.

In my previous article on Unicode, I discussed a little bit of background on Unicode, how to prep PHP to serve UTF-8 encoded content, and how to handle displaying Unicode characters. There's still a bit more we need to talk about, however, before we can truly claim internationalization support.

Prepping MySQL for Unicode

MySQL allows you to specify a character encoding at four different levels: server, database, table, and column. This flexibility becomes quite useful when working on a shared host (like I do at DreamHost). In my particular case, I do not have control over either the server or database setting (and both are unfortunately set to latin1). As a result, I set my desired character encoding at the table level.

To see what your current system and database settings are, issue the following SQL commands at the MySQL command prompt:

SHOW VARIABLES LIKE 'character_set_system';
SHOW VARIABLES LIKE 'character_set_database';

To see what character set a table is using, issue the following command:

SHOW CREATE TABLE myTable;

If you are fortunate enough to have control over the database-level character set, you can set it using the following command:

(CREATE | ALTER) DATABASE ... DEFAULT CHARACTER SET utf8;

The table-specific commands are similar:

(CREATE | ALTER) TABLE ... DEFAULT CHARACTER SET utf8;

Column level character encoding can be specified when creating a table or by altering the desired column:

CREATE TABLE MyTable ( column1 TEXT CHARACTER SET utf8 );
ALTER TABLE MyTable MODIFY column1 TEXT CHARACTER SET utf8;

I personally recommend setting the character encoding as high up as you have the capability to. That way, you won't have to remember to set it on any new tables or columns (or even databases).

If you have existing tables that do not use the utf8 character encoding, you can convert them with a simple command:

ALTER TABLE ... CONVERT TO CHARACTER SET utf8;

Be very careful when attempting to convert your data. The convert command assumes that the existing data is encoded as latin1. Any Unicode characters that already exist will become corrupted during the conversion process. There are some ways to get around this limitation, which may be helpful if you've already got some Unicode data stored in your database.

Communicating with MySQL

Once our tables are ready to accept Unicode data, we need to make some minor changes in the way we connect our application to the database. Essentially, we will be specifying the character encoding that our connection should use. This call needs to be made very early in the order of operations. I personally make this call immediately after creating my database connection. There are several ways we can set the character encoding, depending on the version of PHP and the programming paradigms in use. The first method involves a call to the mysql_query() function:

mysql_query("SET NAMES 'utf8'");

An alternative to this in PHP version 5.2 or later involves a call to the mysql_set_charset() function:

mysql_set_charset('utf8',$conn);

And yet another alternative, if you're using the MySQL Improved extension, comes via the set_charset() function. Here's an example from my code:

// Change the character set to UTF-8 (have to do it early)
if(! $db->set_charset("utf8"))
{
    printf("Error loading character set utf8: %s\n", $db->error);
}

Once you have specified the character encoding for your database connection, your database queries (both setting and retrieving data) will be able to handle international characters.

Accepting Unicode Input

The final hurdle in adding internationalization support to our web application is accepting unicode input from the user. This is pretty easy to do, thanks to the accept-charset attribute on the form element:

<form accept-charset="utf8" ... >

Explicitly setting the character encoding on each form that can accept extended characters from your users will solve all kinds of potential problems (see the "Form submission and i18n" link in the Resources section below for much more on this topic).

Potential Pitfalls

Since PHP (prior to version 6) considers a character just one byte long, there are some potential coding problems that you might run into in your application:

Checking String Length

Using the strlen function to check the length of a given string can cause problems with strings containing international characters. For example, a string comprising 10 characters of a double-byte alphabet would return a length of 20. This might cause problems if you are expecting the string to be no longer than 10 characters. Thankfully, there's an elegant hack that we can use to get around this:

function utf8_strlen($string) {
    return strlen(utf8_decode($string));
}

The utf8_decode function will turn anything outside of the standard ISO-8859-1 encoding into a question mark, which gets counted as a single character in the strlen function (which is exactly what we wanted). Pretty slick!

Case Conversions

Forcing a particular case for string comparisons can be problematic with international character sets. In some languages, case has no meaning. So there's not a whole lot that one can do short of creating a lookup table. One example of such a lookup table comes from the mbstring extension. The Dokuwiki project implemented this solution in their conversion to UTF-8.

Using Regular Expressions

The Perl-Compatible Regular Expression (PCRE) functions in PHP support the UTF-8 encoding, through use of the /u pattern modifier. If you are making use of regular expressions in your application, you'll definitely want to look into this modifier.

Additional Resources

In learning about how to add internationalization support to web applications, I gathered a number of excellent resources that I highly recommend bookmarking. Without further ado, here's the list I've created:

Dustin and his wife recently uncovered an interesting limitation of my Monkey Album software: characters outside of the ISO-8859-1 (Latin 1) character set don't render properly. This comes as no surprise, seeing as I didn't design for Unicode. Being a rather egregious display error, I decided to set out and fix the problem. In the process, I learned quite a lot about Unicode, and how it affects web applications. This post will be the first of two detailing how to add Unicode support to a web application. I will only be exposing a tip of the Unicode iceberg in these posts. The ideas and practices behind Unicode support can (and do) fill the pages of many books. That said, let's jump in.

Brief Background

For the uninitiated, Unicode is a coded character set. That is, it maps a unique scalar value (a code point) to each character in a character set. ASCII is another example of a coded character set. Each character in a coded character set is intended to be encoded using a character encoding scheme. ISO-8859-1 is an example of a character encoding scheme.

It is important to note that ISO-8859-1 is the default encoding for documents on the web served via HTTP with a MIME type beginning with "text/". So, if you're not set up to specifically serve another encoding, your web pages are most likely using ISO-8859-1. This works just fine if you speak English or a subset of European languages. But because the ISO-8859-1 character encoding uses only 8 bits for its encoding scheme, it is limited to 256 possible characters. It turns out that 256 characters isn't enough for international text representation (the Chinese and Japanese languages come to mind). What can we do?

Thankfully, we have a solution in Unicode. A number of Unicode encoding schemes are available for us to use: UTF-7, UTF-8, UTF-16, and UTF-32. Each has its merits and detractors, but it turns out that UTF-8 is the preferred encoding of choice in the computing world (it's a nice trade off between space allocation and capability). As a bonus, UTF-8 works nicely with ASCII, which makes migrating English-based websites much easier.

Unfortunately, we have another major problem to deal with. All PHP releases (prior to the upcoming PHP 6) internally represent a character with 8-bits. That's right: PHP has no native support for international characters (yet)! This means that we have to be extra vigilant in our pursuit of internationalization support. So how do we do it?

Prepping Our PHP for Unicode

In order for our PHP application to properly display Unicode characters, we need to do some preparatory work. This involves setting the appropriate character encoding in a few places. We'll first set the encoding in the header:

header('Content-Type: text/html; charset=UTF-8');

Remember that the header() function must be called before we output any HTML, so it needs to appear early in the chain of events. Note also that the header call incorrectly labels the encoding as a 'charset,' making the naming conventions even more confusing.

We can also specify the encoding through the use of a meta tag (I recommend setting this even if you set the header):

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

If you take this route, make sure this tag is placed near the top of the <head> element in your HTML (before your <title> element, in fact). Otherwise, the browser may select an incorrect encoding.

To verify that that the appropriate encoding is being used, you can use the View Page Info feature in Firefox (just right click the page and you'll see it in the context menu). Here's an example:

Page Info Dialog in Firefox Showing UTF-8 Encoding
Displaying Unicode Text

One of the primary functions that PHP provides to convert characters into their HTML entity equivalents is the aptly named htmlentities() function. However, since we're converting our application to support UTF-8, we don't need to make use of this function. Why is this? First, HTML entities are generally only understood by web browsers. By converting special characters into HTML entity equivalents, it becomes much harder to move data between the web application and other data sources (RSS feeds, for example). Second, and most importantly, UTF-8 allows us to display extended characters directly. To quote Harry Fuecks [PDF], with UTF-8 "we don't need no stinkin' entities." Instead, we should only worry about the "special five":

  • Ampersand (&)
  • Double Quote (")
  • Single Quote (')
  • Less Than (<)
  • Greater Than (>)

Thankfully, PHP gives us the htmlspecialchars() function to handle these five special characters. One very important thing to note is that this function allows you to specify the character encoding to use when parsing the supplied text. For example:

htmlspecialchars($incomingString, ENT_QUOTES, "UTF-8");

Specifying the character encoding is very important when using this function! Otherwise, you open yourself up to to a rather nasty cross-site-scripting vulnerability, something that even Google was susceptible to a while back. In short, the character encoding specified in your htmlspecialchars() call should match the encoding being served by the page.

What Next?

In the next article, I'll cover the following topics:

  • Prepping MySQL databases for Unicode
  • Accepting Unicode characters from the user
  • Potential PHP pitfalls
  • Useful resources (loads of helpful links)

As always, if you have suggestions or questions, feel free to post them.