Browsing all posts tagged programming

An article entitled A Brief, Incomplete, and Mostly Wrong History of Programming Languages offers a very humorous glimpse into the world of programming. My absolute favorite snippet from the article:

1987 - Larry Wall falls asleep and hits Larry Wall's forehead on the keyboard. Upon waking Larry Wall decides that the string of characters on Larry Wall's monitor isn't random but an example program in a programming language that God wants His prophet, Larry Wall, to design. Perl is born.

It's funny because it's true. (Hat tip to Dustin for the pointer to this article.)

One of my Perl scripts here at work used the Add_Delta_Days subroutine from the Date::Calc module to do some calendar date arithmetic. I'm in the process of building a new machine on which this script will run, and I don't have access to an external network. Unfortunately, the install process for Date::Calc is fairly difficult. The module relies on a C library which must be compiled with the same compiler as was used to build the local Perl install. To make matters worse, the modules that Date::Calc is dependent on have similar requirements. As a result, I decided to skip installing this non-standard module, and instead use a home-brew replacement. It turns out that Add_Delta_Days is fairly straightforward to replace:

use Time::Local; # Standard module

sub addDaysToDate
{
    my ($y, $m, $d, $offset) = @_;

    # Convert the incoming date to epoch seconds
    my $TIME = timelocal(0, 0, 0, $d, $m-1, $y-1900);

    # Convert the offset from days to seconds and add
    # to our epoch seconds value
    $TIME += 60 * 60 * 24 * $offset;

    # Convert the epoch seconds back to a legal 'calendar date'
    # and return the date pieces
    my @values = localtime($TIME);
    return ($values[5] + 1900, $values[4] + 1, $values[3]);
}

You call this subroutine like this:

my $year = 2009;
my $month = 4;
my $day = 22;

my ($nYear, $nMonth, $nDay) = addDaysToDate($year, $month, $day, 30);

This subroutine isn't a one-to-one replacement, obviously. Unlike Date::Calc, my home-brew subroutine suffers from the Year 2038 problem (at least on 32-bit operating systems). It likewise can't go back in time by incredible amounts (I'm bound to the deltas around the epoch). However, this workaround saves me a bunch of setup time, and works just as well.

It's been quite a while since my last programming tips grab bag article, and it's high time for another. As promised, I'm discussing PHP this time around. Although simple, each of these tips is geared towards writing cleaner code, which is always a good thing.

1. Use Helper Functions to Get Incoming Data

Data is typically passed to a given web page through either GET or POST requests. To make things easy, PHP give us two superglobal arrays for each of these request types: $_GET and $_POST, respectively. I prefer to use helper functions to poke around in these superglobal arrays; it results in cleaner looking code. Here are the helper functions I typically use:

// Helper function for getting $_GET data
function getGet($key)
{
    if(isset($_GET[$key]))
    {
        if(is_array($_GET[$key]))
            return $_GET[$key];
        else
            return (trim($_GET[$key]));
    }
    else
        return null;
}

// Helper function for getting $_POST data
function getPost($key)
{
    if(isset($_POST[$key]))
    {
        if(is_array($_POST[$key]))
            return $_POST[$key];
        else
            return (trim($_POST[$key]));
    }
    else
        return null;
}

Calling these functions is super simple:

$someValue = getGet('some_value');

If the some_value parameter is set, the variable will get the appropriate value. If it's not set, the variable gets assigned null. So, all that's needed after calling getGet or getPost, is a test to make sure the variable is non-null:

if(! is_null($someValue))
{
    // ... do something
}

Note that these functions also handle the case where the incoming data may be an array (useful when processing lots of similar data fields at once). If the data is simply a scalar value, I run it through the trim function to make sure there's no stray whitespace on either side of the incoming value.

2. Write Your Own SQL Sanitizer

The first and most important rule when accepting data from a user is: never trust the user, even if that user is you! When incoming data is going to be put into a database, you need to sanitize the input to avoid SQL injection attacks. Like the superglobal arrays above, I like using a helper function for this task:

function dbSafe($string)
{
    global $db; // MySQLi extension instance
    return "'" . $db->escape_string($string) . "'";
}

In this example, I'm making use of the MySQLi extension. The $db variable is an instance of this extension, which gets created in another file. Here's an example of creating that instance, minus all the error checking (which you should do); the constants used as parameters should be self explanatory, and are defined elsewhere in my code:

$db = new mysqli(DB_HOST, DB_USER, DB_PASSWORD, DB_NAME);

Back to our dbSafe function, all I do is create a string value: a single quote, followed by the escaped version of the incoming data, followed by another single quote. Let's assume that my test data is the following:

$string = dbSafe("Isn't this the greatest?");

The resulting value of $string becomes 'Isn\'t this the greatest?'. Nice and clean for insertion into a database! Again, this helper makes writing code faster and cleaner.

3. Make a Simple Output Sanitizer

If you work with an application that displays user-generated content (and after all, isn't that what PHP is for?), you have to deal with cross-site scripting (XSS) attacks as well. All such data that is to be rendered to the screen must be sanitized. The htmlentities and htmlspecialchars functions provide us with the capability to encode HTML entities, thus making our output safe. I prefer using the latter, since it's a little safer when working with UTF-8 encoded data (see my article Unicode and the Web: Part 1 for more on that topic). As before, I wrap the call to this function in a helper to save me some typing:

function safeString($text)
{
    return htmlspecialchars($text, ENT_QUOTES, 'UTF-8', FALSE);
}

Everything here should be self explanatory (see the htmlspecialchars manual entry for explanations on the parameters to that function). I make sure to use this any time I display user-generated content; even content that I myself generate! Not only is it important from an XSS point of view, but it helps keep your HTML validation compliant.

4. Use Alternate Conditional Syntax for Cleaner Code

Displaying HTML based on a certain condition is incredibly handy when working with any web application. I used to write this kind of code like this:

<?php
if($someCondition)
{
    echo "\t<div class=\"myclass\">Some element to insert</div>\n";
}
else
{
    echo "\t<div class=\"myclass\"></div>\n"; // Empty element
}
?>

Not only do the backslashed double quotes look bad, the whole thing is generally messy. Instead, I now make use of PHP's alternative syntax for control structures. Using this alternative syntax, the above code is modified to become:

<?php if($someCondition): ?>
    <div class="myclass">Some element to insert</div>
<?php else: ?>
    <div></div>
<?php endif; ?>

Isn't that better? The second form is much easier to read, arguably making things much easier to maintain down the road. And no more backslashes!

Back in the spring of 2005, after having graduating from college, I went looking for a job. I got the chance to interview for Microsoft, though I'm not sure what I would have ended up doing had I gotten the job (they never really told me). My interview was conducted entirely over the phone, and consisted of the typical "brain teaser" type questions that Microsoft is famous for. Needless to say, I performed very poorly and was instantly rejected. The guy on the phone said he'd let me know and, 10 minutes later via email, I knew.

One of the questions they asked me stumped me beyond belief, and I butchered my answer terribly. Not only was I embarrassed for myself, I was embarrassed for the interviewer, having to patiently listen to me. :oops: Anyway, here's a retelling of the question I was asked:

Given a large NxN tic-tac-toe board (instead of the regular 3x3 board), design a function to determine whether any player is winning in the current round, given the current board state.

I realize now that I misinterpreted the question horribly. The interviewer stated the question quite differently than I have it written above; I believe he used something along the lines of "given a tic-tac-toe board of N dimensions ..." I assumed that the bit about dimensionality meant delving into the realm of 3 or more physical dimensions; essentially something like 3-D tic-tac-toe. Obviously, solving such a problem is much more difficult than solving on an NxN 2-D board.

Tonight, for whatever reason, I recalled this question and the fact that I never found an answer for myself. Happily, I subsequently stumbled upon someone else's answer (see question 4), which is quite clever. It's good to finally resolve this problem.

I know interviewing candidates for a job can be tricky, but asking these kinds of questions is silly. Does someone's ability to answer this kind of question really prove they are a better programmer than someone who can't? In the end, I'm eternally glad I didn't get hired for Microsoft; I now realize they are one of the companies I would least like to work for. My current employer seemed much more concerned with real-world problems, my previous employment experience, and the (increasingly rare) ability to program in C++. For that, I am oh-so-grateful.

When I added the favorite photos feature to my photo album software, I wanted a way to randomly show a subset of said favorites on the albums display page. I initially thought about implementing my own means of doing this through PHP. Ultimately, I wanted random selection without replacement, so that viewers would not see multiple copies of the same image in the 'Favorites Preview' section. Thankfully, MySQL saved the day!

When sorting a MySQL query, you can opt to sort randomly:

SELECT {some columns} FROM {some tables}
WHERE {some condition} ORDER BY rand()

The rand() function in PHP essentially gives you random selection without replacement for free! How great is that? It was an easy solution to a not-so-simple problem, and saved me a lot of programming time.

Update: I have since learned that the ORDER BY rand() call is horribly inefficient for large data sets. As such, it should ideally be avoided. There's a great article describing ways to work around these performance limitations.

I ran into a weird problem in one of our build scripts at work today. We compile our tools across a number of platforms and architectures, and I ran across this issue on one of our oldest boxes, running RedHat 9. Here's the horrible error that I got when linking:

/usr/bin/ld: myFile.so: undefined versioned symbol name std::basic_string<char, std::char_traits<char>, std::allocator<char> >& std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_replace_safe<char const*>(__gnu_cxx::__normal_iterator<char*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, __gnu_cxx::__normal_iterator<char*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, char const*, char const*)@@GLIBCPP_3.2 /usr/bin/ld: failed to set dynamic section sizes: Bad value

It seems as if the standard C++ libraries on this system were compiled with gcc 3.2, while the version we're using to build our tools is 3.2.3. Unfortunately, the 3.2 compiler isn't installed on the system, and I'm not sure where we would find it for RH9 anyway. Thankfully, I found a workaround for this problem. Our link step originally looked like this:

gcc -shared -fPIC -lstdc++ -lrt -lpthread -o myFile.so {list_of_object_files}

I found out that by moving the standard libraries to the end of the line, the problem disappeared. Here's the new link step:

gcc -shared -fPIC -o myFile.so {list_of_object_files} -lstdc++ -lrt -lpthread

I don't fully understand why ordering should matter during the link step, but by putting the standard libraries last, we were able to get rid of this error. If you understand the root cause of this, please leave a comment explaining. I'd love to know more about why changing the order makes a difference.

A PHP Include Pitfall

Feb 22, 2009

I ran into an interesting problem with the PHP include mechanism last night (specifically, with the require_once variant, but this discussion applies to all of the include-style functions). Suppose I have the following folder structure in my web application:

myapp/
 |-- includes.php
 +-- admin/
      |-- admin_includes.php
      +-- ajax/
           +-- my_ajax.php

Let's take a look at the individual PHP files in reverse order. These examples are bare bones, but will illustrate the problem. First, my_ajax.php:

// my_ajax.php
<?php
require_once("../admin_includes.php");

some_generic_function();
?>

Here's the code for admin_includes.php:

// admin_includes.php
<?php
require_once("../includes.php");
?>

And finally, includes.php:

// includes.php
<?php
function some_generic_function()
{
    // Do something here
}
?>

When I go to access the my_ajax.php file, I'll get a "no such file or directory" PHP error. This immediately doesn't make much sense, but a quick glance at the PHP manual clears things up:

Files for including are first looked for in each include_path entry relative to the current working directory, and then in the directory of the current script. If the file name begins with ./ or ../, it is looked for only in the current working directory.

The important part is in that last sentence: if your include or require statement starts with a ./ or ../, PHP will only look in the current working directory. So, in our example above, our working directory when accessing the AJAX script is "/myapp/admin/ajax." The require_once within the admin_functions.php file will therefore fail, since there's no '../includes.php' in the current working directory.

This is surprising behavior and should be kept in mind when chaining includes. A simple workaround is to use the following code in your include statements:

require_once(dirname(__FILE__) . "../../some/relative/path.php");

It's not the most elegant solution in the world, but it gets around this PHP annoyance.

I ran into a strange problem with a Perl CGI script yesterday. Upon script execution, I received the following error message from IIS:

CGI Error The specified CGI application misbehaved by not returning a complete set of HTTP headers.

A quick Google search of this error message turned up a number of discussions mentioning bugs in IIS, server configuration problems, etc. However, I suspected that my scripts were to blame (I had been hacking on them on Friday). But how could I determine whether I was at fault or if the server was to blame? Thankfully, the solution comes through one of the Perl CGI modules (here's the Perl tip):

use CGI::Carp qw(fatalsToBrowser warningsToBrowser);

The Carp module (and where does that name come from?) gives us the fatalsToBrowser and warningsToBrowser subroutines. When included in your script, any resulting Perl execution errors will be output into the browser window (very handy). After turning on these features, I immediately found my error. It resided in this line (here's the gotcha):

$safeProductName =~ s/\$/\\$/g;

It was my intent to replace any instances of the dollar sign character ($) with a backslash-dollar sign pair (\$). At first glance, this substitution rule may look alright. But it's not! The replacement portion of a substitution is treated as a double quoted string. So, the interpreter was escaping the backslash just fine, but then hits a naked dollar sign, indicating a variable (of which I didn't provide a name). And so it chokes! The line should have read:

$safeProductName =~ s/\$/\\\$/g;

Note the three backslashes in the replacement string. Two to print an actual backslash character, and one to print the actual dollar sign. Subtle? You bet.

Visual Studio 2005 introduced support for doing parallel builds in solutions that contain more than one project. This is a great idea, especially on systems equipped with multi-core processors. Unfortunately, the developers at Microsoft apparently don't know how to program a multi-threaded application.

Suppose we're building two projects within one solution, call them Project A and Project B. If A and B exist in completely different folders, and are mutually exclusive in every way possible, the parallel build option is quite handy (improved build performance). However, if projects A and B share any code, any code at all, you run the risk of build failures. It seems as though Visual Studio doesn't lock files appropriately during the build process. So, if each instance of the compiler tries to build the same file at the same time, one of them will fall over and die, complaining that "no class instances were found."

It's shocking to me that something so seemingly simple could be broken in an application of this caliber.

I ran across another weird and subtle bug in Visual Studio 2005. If you've got a solution with many project in it, you can set one of those projects to be the default project at startup (i.e. when you open the solution file). But this setting apparently resides in the user options file (.suo), which is something we don't keep in our code repository (since it differs for every user). So how can you set a default startup project that affects anyone working with your code? Simple: hack the solution file.

Thankfully, the solution file is just plain text. Apparently, if there's no user options file for a given solution, Visual Studio 2005 simply selects the first project it comes across in the solution file. Here's a quick example of what a solution file looks like (wrapped lines marked with »):

Microsoft Visual Studio Solution File, Format Version 9.00
# Visual Studio 2005
Project("{3853E850-5CD7-11DD-AD8B-0800200C9A66}") = "ProjectA", »
"projecta.vcproj", "{D9BA97DE-0D09-4C35-99D6-CC4C30A6279C}"
EndProject
Project("{3853E850-5CD7-11DD-AD8B-0800200C9A66}") = "ProjectB", »
"projectb.vcproj", "{E1D73B44-57D9-4202-A92A-0296E3583AC4}"
EndProject
Global
{ ... a bunch of junk goes here ... }
EndGlobal

In this case, Project A will be the default startup project. To make Project B the default, simply move its associated lines above Project A in the file, like so:

Microsoft Visual Studio Solution File, Format Version 9.00
# Visual Studio 2005
Project("{3853E850-5CD7-11DD-AD8B-0800200C9A66}") = "ProjectB", »
"projectb.vcproj", "{E1D73B44-57D9-4202-A92A-0296E3583AC4}"
EndProject
Project("{3853E850-5CD7-11DD-AD8B-0800200C9A66}") = "ProjectA", »
"projecta.vcproj", "{D9BA97DE-0D09-4C35-99D6-CC4C30A6279C}"
EndProject
Global
{ ... a bunch of junk goes here ... }
EndGlobal

Don't forget to grab the end tags of each project (and any child content that may live between them).

Recently at work, I spent a fair amount of time debugging some strange run-time errors in one of our test tools (after having ported it from Visual Studio 2003 to VS 2005). When starting up a debug build of the tool, I would get the following error message:

An application has made an attempt to load the C runtime library incorrectly. Please contact the application's support team for more information.

This error message turned out to be a red herring, though it pointed me in the direction of the actual culprit: a circular dependency chain of debug and release versions of various Microsoft DLLs. In trying to figure out what was going wrong, I ran across an incredibly helpful article on troubleshooting these kinds of issues. The author presents seven different scenarios that can arise with executables built in Visual Studio 2005, along with solutions for each one. It's a great resource to have if you run into these kinds of problems.

While working on a Windows batch script earlier today, I ran across an interesting side effect of the call and exit commands. Let's take this simple example, which we'll name script_a.bat:

@echo off
SETLOCAL

call :function
cd %SOME_PATH%

goto :functionEnd
:function
    set foobar=1
    if "%foobar%" == "1" exit /B 1
    goto :EOF
:functionEnd

Unlike Bash, Windows batch files have no function capabilities. Clever hacks like the above can be used to fake out functions, but these hacks hide some subtle quirks. You see that exit call within the 'function'? It only gets called if the %foobar% variable is equal to 1 (which is always the case, in our example). Also note that we exit with an error code of 1. So, in short, this script should always return an exit code of 1. Now, let's create another batch script which we'll name script_b.bat:

@echo off

call script_a.bat
echo Exit Code = %ERRORLEVEL%

This second script is very simple. All we do is call script_a.bat, and then print its resulting return code. What do you expect the return code to be? One would expect it to be 1, but it's not! Our second script will actually print out Exit Code = 0. Why is this?

The answer lies in the call command. Again, unlike Bash scripts, stand-alone batch files do not create their own context when executed. But if you use the call command, the thing you call does get its own context. How weird is that? So, let's trace the first script we wrote to figure out where the error code gets changed.

After some initial setup, we call our function (call :function). Inside our function, we create a variable, initialize it to 1, then test to see if the value is 1. Since the value is indeed 1, the if test succeeds, and the exit command is called. But we don't exit the script; instead, we exit the context that was created when we called our function. Note that immediately after we call our function, we perform a cd operation. This line of code gets executed, succeeds, and sets the %ERRORLEVEL% global to 0.

In order to exit properly, we have to exit our initial script twice, like this:

@echo off
SETLOCAL

call :function
if "%ERRORLEVEL%" == "1" exit /B 1

cd %SOME_PATH%

goto :functionEnd
:function
    set foobar=1
    if "%foobar%" == "1" exit /B 1
    goto :EOF
:functionEnd

See the new exit call after our initial function call? Then, and only then, will our second script print out what we expected. This subtle behavior stymied me for several hours today; hopefully this short post will help someone else avoid this frustration.

I ran into an interesting side-effect with the foreach loop in Perl today. I'm surprised that I haven't hit this before, but it may be a subtle enough issue that it only pops up under the right circumstances. Here's a sample program that we'll use as an example:

#!/usr/bin/perl
use strict;
use warnings;

my @array = ("Test NUM", "Line NUM", "Part NUM");

for (my $i=0; $i < 3; $i++)
{
    foreach (@array)
    {
        s/NUM/$i/;
        print "$_\n";
    }
    print "------\n";
}

What should the output for this little script look like? Here's what I assumed it would be:

Test 0
Line 0
Part 0
------
Test 1
Line 1
Part 1
------
Test 2
Line 2
Part 2
------

But here's the actual output:

Test 0
Line 0
Part 0
------
Test 0
Line 0
Part 0
------
Test 0
Line 0
Part 0
------

So what's going on here? Well, it turns out that the foreach construct doesn't act quite like I thought it did. Let's isolate just that loop:

foreach (@array)
{
    s/NUM/$i/;
    print "$_\n";
}

We simply loop over each element of the array, we do a substitution, and we print the result. Pretty simple. Pay attention to the fact that we are storing each iteration through the loop in Perl's global $. The point here is that $ doesn't represent a copy of the array element, it represents the actual array element. From the Programming Perl book (which I highly recommend):

foreach VAR (LIST) {
    ...
}
If LIST consists entirely of assignable values (meaning variables, generally, not enumerated constants), you can modify each of those variables by modifying VAR inside the loop. That's because the foreach loop index variable is an implicit alias for each item in the list that you're looping over.

This is an interesting side effect, which can be unwanted in some cases. As a workaround, I simply created a temporary buffer to operate on in my substitution call:

foreach (@array)
{
    my $temp = $_;
    $temp =~ s/NUM/$i/;
    print "$temp\n";
}

An easy fix to a not-so-obvious problem.

A little over a year ago, I inherited a productivity tool at work that allows users to enter weekly status reports for various products in our division. The tool is web-based and is written entirely in Perl. One of the mangers who uses this tool recently suggested a new feature, and I decided to implement it using cookies. Having never implemented cookies from a programming perspective, I was new to the subject and had to do some research on how to do it in Perl. It turns out to be quite easy, so I figured I would share my newfound knowledge:

Creating a Cookie

Although there are other ways to do this (as always with Perl), this tutorial will be making use of the CGI::Cookie module. It makes creating and reading cookies very easy, which is a good thing. Furthermore, this module ships with virtually all Perl distributions! Here's a chunk of code that creates a cookie:

use CGI qw(:all);

my $cgi = new CGI;
my $cookie = $cgi->cookie(-name => 'my_first_cookie',
                          -value => $someValueToStore,
                          -expires => '+1y',
                          -path => '/');

print $cgi->header(-cookie => $cookie);

I first import all of the CGI modules. This isn't exactly necessary, and it might be a little slower than using the :standard include directive, but I needed a number of sub-modules for the tool I was writing. I then create a new CGI object, and use it to call the cookie() subroutine. This routine takes a number of parameters, but the most important ones are shown.

The -name parameter is simply what you want to name this cookie. You should use something that clearly identifies what the cookie is being used for (though you should always be mindful of the associated security implications). The -value parameter is just that: the value you wish to store in the cookie. I believe cookies have a bounds of around 4K of storage, so remember to limit what you store. Next up is the -expires parameter, which specifies how far into the future (or past) the cookie should expire. The value of '+1y' that we specified in the example above indicates we should expire in one year's time. Values in the past (specified with a minus sign) simply indicate that the cookie should be expired immediately. No value will cause the cookie to expire when the user closes their browser. Finally, the -path parameter indicates for what paths on your site the cookie should apply. A value of '/cgi-bin/' for example will only allow the cookie to work for scripts in the /cgi-bin folder of your site. We specified '/' in our example above, which means the cookie is valid for any path at our site.

Finally we print our CGI header, passing along a -cookie parameter with our cookie variable. As always, the documentation for the CGI module will give you lots more information on what's available.

Reading a Cookie

Reading back the value stored in a cookie is even simpler:

use CGI qw(:all);

my $cgi = new CGI;
my $someValue= $cgi->cookie('my_first_cookie');

Again we create our CGI object, but this time we use it to read our cookie, simply by calling the cookie() routine with the name of the cookie we created before. If the cookie is found, the stored value is read and stored into our variable ($someValue in the example above). If the cookie is not found, a null value is returned.

One Gotcha

In the tool I was working with, I was handling storing and reading the cookie on the same page. Since we have to create our cookie via the header() call, I was concerned about how to handle the case where we weren't creating a cookie. The solution, it turns out, is pretty simple:

use CGI qw(:all);

my $cgi = new CGI;
unless (param())
{
    print $cgiquery->header;
}

In this example, we print out a generic CGI header only if no parameters were passed in (i.e. the user didn't push us either a POST or GET). If we do have parameters, we want to create a cookie, and we'll send the header after we have done so. Pretty easy!

Perl 5.10

Feb 11, 2008

I just found out about Perl 5.10, which has been out for some time now (released on December 18 ... how did I miss this?). The perldelta documentation goes into detail on what's new, but here's a brief overview of some of the features I find most appealing:

The 'feature' pragma

First and foremost is the feature pragma, which is used to turn on the new features added by 5.10. By default, the new features are disabled, and you explicitly have to request their support (a great idea, in my opinion). A simple use feature; statement will do the trick.

New 'Defined-Or' operator

A new // operator is now available, for handling the 'defined-or' case. For example:

$a // $b; # This is equivalent to the line below

defined $a : $a ? $b; # Same meaning as above

This new operator has the same precedence as the logical-or operator. In typical Perl fashion, the new operator is simply a shortcut that makes your scripts shorter and more difficult to read one month after you write it. ;)

Switch statements

At long last, Perl has a switch statement. The syntax here is quite different from other programming languages with which you might be familiar:

given ($state)
{
    when ("state_1") { $a = 1; }
    when (/^abcdef/) { $b = 2; }
    default { $c = 0; }
}

The various when tests allow for some powerful options, including: array slices, string compares, regular expression matches, and beyond.

Named captures in regular expressions

Suppose we want to read in a configuration file that contains lines with the following structure: option = value. Today, we could write a regular expression to capture these values like this: /(\w+) = (\w+)/. We would then access the captured values with $1 and $2.

In Perl 5.10, we could write the same expression like this: /(<?option>\w+) = (<?value>\w+)/. Now, the captured values are accessed through either the %+ or %- magical hashes, using each label as the key into each hash (see the perldelta documentation for the differences between the two hashes). This will make complex regular expressions much easier to decipher, and gets rid of the annoying parenthesis counting that we currently have to do.

Just 'say' it

The new say keyword is just like print, but it automatically adds a newline at the end of what it prints. How great is that? This simplifies printing code a little bit, especially for loops. Instead of print "$_\n" for @items; we can now use say for @items;. Clean and simple!

Stackable file tests

Doing multiple file tests is much easier now. Instead of if (-f $file and -w $file and -z $file) we can now write if (-f -w -z $file). Again, this makes things much cleaner.

Better error messages

Have you ever seen this error message? I know I have:

$str = "Hello $name! Today is $day and the time is $time.\n";

Use of uninitialized value in concatenation (.) or string at test.pl line 3.

In 5.10, this same error message will read:


$str = "Hello $name! Today is $day and the time is $time.\n";

Use of uninitialized value $time in
concatenation (.) or string at test.pl line 3.

Now I can know exactly where the error occurred! Finally!

And lots more

There are plenty of other new features that I haven't touched here: recursive regular expressions, a new smart matching operator, state ("static") variables, inside-out objects, and lots more. I'm really looking forward to trying out some of these new features.

It's time once again for a programming tips grab bag. As with the previous grab bag, I'll focus on Perl tips since I've been doing some Perl coding recently. Next time, I'll present some tips for PHP.

1. Always use the 'strict' and 'warning' pragmas for production code

This tip is pretty much a no-brainer. Whenever you write production level code, you must make use of the 'strict' pragma (enabled with 'use strict;'). Not only will it save you from a lot of pain in the long run, but it also forces you to write cleaner code. You should also enable warnings, just for good measure. And don't do this at the end of your development cycle; do it right from the beginning. Always start scripts that you think will be used by others with the following two lines:

#!/usr/bin/perl
use strict;
use warnings;

I can't tell you how many times turning on strict checking has saved me from some goofy problems (such as using square brackets instead of curly braces for a hash reference).

2. Use 'our' to fake global variables

Global variables are generally considered to be bad practice in the world of programming, and rightfully so. They can cause untold amounts of trouble and can be quite dangerous in the hands of novice programmers. Out of the box, Perl only uses global variables, which is both a blessing and a curse. For quick and dirty scripts, globals are fine (and encouraged). But for production level code (which uses the 'strict' pragma mentioned above), globals aren't an option.

But sometimes, you can't avoid having a global variable (and they even make more sense than locals in some instances). I recently made use of the File::Find module in one of my scripts, calling it like this:

#!/usr/bin/perl
use strict;
use warnings;
use File::Find;

my $inSomeState;
find(\&mySearchFunction, $somePathVariable);

sub mySearchFunction {
    if ($inSomeState) {
        # Do something
    }
}

The find() call will execute the mySearchFunction subroutine, operating in the $somePathVariable folder. I cannot pass any parameters to the mySearchFunction subroutine, but it needs to be able to check the value of the variable $inSomeState. We previously created this variable using the 'my' construct, but since this subroutine is out of that variable's scope, Perl will complain. We can fix this by forcing the $inSomeState variable to be global, using the our call instead of 'my':

#!/usr/bin/perl
use strict;
use warnings;
use File::Find;

our $inSomeState;
find(\&mySearchFunction, $somePathVariable);

sub mySearchFunction {
    if ($inSomeState) {
        # Do something
    }
}

By declaring the variable with 'our,' we essentially force the variable into a global state (for the current scope, which happens to be the script itself in this case). Very handy!

3. Capture matched regex expressions inline

The parenthesis capturing functionality in regular expressions is extremely useful. However, I found that I always wrote my capture statements as a part of an if block:

if(m/(\w+)-(\d+)/)
{
    my $word = $1;
    my $number = $2;
}

I recently learned that this same code can be shortened into a one liner:

my ($word, $number) = (m/(\w+)-(\d+)/);

Of course, the match may not occur, so you'd have to test that the values of $word and $number aren't null, but it's a cleaner way of capturing stuff from a regular expression.

4. Make sure to shift by 8 for return codes

If you're trying to automate something (which I have been doing a lot of recently), the return codes from external processes are generally of great interest. The system call makes executing a process very easy, but getting the return code is (to me at least) a little non-intuitive. Here's how to do it:

system ("some_process.exe");

my $retval = ($? >> 8);

The return code from the some_process.exe program will be stored in the $? variable, but you have to remember to shift the value right by 8 to get the actual return value.

Another new recurring feature I'm going to try out here at the site are programming tip 'grab bags.' These will often feature a few tips I've picked up over the years, which I find highly useful. We'll start out this inaugural article with a few Perl tips:

1. Don't parse command line options yourself

One thing I've learned a number of times over is to never parse command line options yourself. Why? Because the Getopt::Long and Getopt::Std modules do it for you (and they make it both easy and convenient). These standard modules allow you to store away your command line options either in separate variables, or in a hash. There are times you'll want to use Getopt::Long over Getopt::Std (and vice-versa), so know the differences between the two. Either one will save you lots of time and headache. Here's one way to make use of this module:

use Getopt::Std;

our($opt_c, $opt_d, $opt_t);
getopts("cdt:");

my $filename = shift;

This tiny snippet parses the given command line parameters, looking for either a 'c', a 'd', or a 't' option. In this example, the 'c' and 'd' options are flags and the 't' option expects a user supplied value (note the trailing colon). If the user passes either '-c' or '-d' on the command line, the $opt_c and $opt_d variables will get set appropriately (otherwise, they remain null). Likewise, if the user passes a '-t' on the command line, the $opt_t variable gets set to the value the user passed in (so the user would need to type something like myScript.pl -t someValue). Otherwise, $opt_t remains null. Also note that we are still able to retrieve other values passed in via the command line (in this example, a filename). Quite handy!

One other hidden benefit of the Getopt modules is the fact that they handle combined options. So, myScript.pl -cd would parse just the same as myScript.pl -c -d. Doing this kind of parsing by hand would be tricky, so don't try to do it. Let Getopt do all the work for you.

Getopt::Long allows for long options (which make use of the double dash, such as --verbose), but it can also handle single letter options. Storing options in a hash is also available to both modules, making it very easy to set up if you have lots of options to parse.

2. Use printf (or variants) to print plurals

This tip comes from the excellent Perl Cookbook, and I've used it a number of times. Use either the printf or sprintf functions to handle printing the proper plural (or singular) of a value. For example:

printf "%d item%s returned", $size, $size == 1 ? "" : "s";

If there were only 1 item, we would print out 1 item returned. Likewise, if we printed out 2 or more items, 2 items returned (note the trailing 's'). You can use this trick to print the proper plural for words that have strange plurals, like "goose" and "geese."

3. Use File::Spec to handle cross platform file paths

The File::Spec module and its children allow one to easily make cross-platform file paths, useful for those scripts which must operate across operating systems. In one project at work, I made use of the File::Spec::Functions module, which exports a number of handy functions. I find the catfile function very handy, and I use it like so:

my $logFile = catfile('weeklybuild', 'log', 'build.log');

The function takes care of putting the right separators between the values (backslash for Windows, forward slash for Linux, and colons for the Mac).

A Perl Module Primer

Aug 18, 2007

I've recently been wrangling with some Perl code for a project at work, and have been putting together a Perl module that includes a number of common functions that I need. As such, I had to remind myself how to create a Perl module. During my initial development, I ran into a number of problems, but I eventually worked through all of them. In the hopes of helping myself remember how to do this, and to help any other burgeoning Perl developers, I've written the following little guide. Hopefully it will help shed some light on this subject.

Let me preface this guide with two important statements:

  1. I'm not aiming to show you how to create a module for distribution. Most of the other tutorials cover that topic in depth.
  2. I am going to assume that you have a working knowledge of Perl.

To start, let's take a look at our sample module:

package MyPackage;
use strict;
use warnings;

require Exporter;
our @ISA = ("Exporter");

our %EXPORT_TAGS = ( 'all' => [ qw(sayHello whoAreYou $firstName
    %hashTable @myArray) ] );
our @EXPORT_OK = (@{ $EXPORT_TAGS{'all'} });
our @EXPORT = qw();

our $firstName = "Jonah";
our $lastName = "Bishop";

our %hashTable = { a => "apple", b => "bird", c => "car" };
our @myArray = ("Monday", "Tuesday", "Wednesday");

sub sayHello
{
    print "Hello World!\n";
}

sub whoAreYou
{
    print "My name is $firstName $lastName\n";
}

1;

We start out by declaring our package name with the package keyword. Special Note: If you intend on having multiple modules, and you use the double colon (::) separator, you're going to need to set up your directory structure correspondingly. For example, if I had two modules, one named Jonah::ModuleOne and another named Jonah::ModuleTwo, I would need to have a folder named Jonah, inside of which would live the code to my two modules.

I next enable the strict and warnings pragmas, since that's good programming practice. Lines 5 and 6 are standard to virtually all Perl modules. First, we require inclusion of the standard Exporter module, then we indicate that our module inherits from said Exporter (the @ISA (is a) array is what sets this).

Line 8 is where things get interesting. We need to specify what symbols we want to export from this module. There are a number of ways of doing this, but I have chosen to use the EXPORT_TAGS hash. Special Note: This is a hash, not an array! I recently spent about an hour trying to debug a strange error message, and it all stemmed from the fact that I had accidentally created this as an array.

The EXPORT_TAGS hash gives us a means of grouping our symbols together. We essentially associate a label with a group of symbols, which makes it easy to selectively choose what you want to import when using the module. In this example, I simply have a tag named 'all' which, as you might guess, allows me to import all of the specified symbols I provide in the associated qw() list. Note that you must precede exported variable names with their appropriate character: $ for scalars, @ for arrays, and % for hashes. Exported subroutines don't need to have the preceding & character, but it doesn't hurt if you put it there.

Line 10 shows the EXPORT_OK array. This array specifies the symbols that are allowed to be requested by the user. I have placed the EXPORT_TAGS{'all'} value here for exporting. I will show how to import this symbol into a script in just a moment. Line 11 is the EXPORT array, which specifies the symbols that are exported by default. Note that I don't export anything by default. Special Note: It is good programming practice to not export anything by default; the user should specifically ask for their desired symbols when they import your package.

Lines 13 through 27 should be self explanatory. We set up two scalar variables, $firstName and $lastName, as well as a hash table and an array. Note that we precede all variables with the our declaration, which puts this variable into the global scope for the given context. Since we're using the strict pragma, we need these our declarations; otherwise we'd get some compilation errors.

Line 29 is very important and can easily be forgotten. When a Perl module is loaded via a use statement, the compiler expects the last statement to produce a true value when executed. This particular line ensures that this is always the case.

Now that we've taken a look at the module, let's take a look at a script that uses it:

#!/usr/bin/perl
use strict;
use warnings;
use MyPackage qw(:all);

sayHello();
whoAreYou();

print "$lastName\n"; # WRONG!
print $MyPackage::lastName . "\n"; # RIGHT!

Most of this should be pretty clear. Note, however, how we import the module on line 4. We do the typical use MyPackage statement, but we also include the symbols we want to import. Since we didn't export anything by default, the user has to explicitly ask for the desired symbols. All we exported was a tag name, so we specify it here. Note the preceding colon! When you are importing a tag symbol, it must be preceded by a single colon. This too caused me a great deal of frustration, and it's a subtlety that's easily missed.

One other interesting note: on line 9, we try to print the $lastName variable. Since we never exported that particular variable in our module, referencing it by name only will result in an error. The correct way to access the variable, even though it wasn't exported, is shown on line 9. You must fully qualify non-exported symbols!

Hopefully this quick little guide has made things a little clearer for you. If for no other reason, it will help me remember these subtleties of Perl programming. :-)

While working on my rewrite of Monkey Album, I ran into an interesting programming dilemma. In the past week or so, I've been introduced to the MySQLi extension in PHP. The current Monkey Album implementation makes use of the PHP 4 mysql_*() calls, so I thought I'd try out the MySQLi interface to see how it works.

MySQLi includes support for what are known as "prepared statements" (only available in MySQL 4.1 and later). A prepared statement basically gives you three advantages: (1) SQL logic is separated from the data being supplied, (2) incoming data is sanitized for you which increases security, and (3) performance is increased, since a given statement only needs to be parsed a single time.

It seems to me that the performance benefit can only be seen in situations where the query is executed multiple times (in a loop, for example). In fact, an article on prepared statements confirms this suspicion; the author in fact mentions that prepared statements can be slower for queries executed only once.

So here's the problem I face: the queries that get executed in Monkey Album are, for the most part, only ever executed once. So, do I make use of prepared statements just to get the security benefit? It doesn't seem worth it to me, since I can get the same security by escaping all user input (something I already do today). Does someone with more knowledge of this stuff have an opinion? If so, please share it.

After graduating from school with a bachelor's degree of computer science, I must admit that I knew virtually nothing about developing *NIX based applications (that's UNIX / Linux based applications for the non-geeks out there). Granted, I did do a little bit of non-Windows based programming while in school, but it was always incredibly basic stuff: compiling one or two source files, or occasionally writing a make-file for larger projects (three or four source files). Having never had a Linux or UNIX box to play with outside of school, I just never got a chance to get my feet wet. Thankfully, my job at IBM has changed that.

Over the past few weeks, I've been doing a great deal of Linux programming, thanks to the cross-"platformedness" of one of the projects I'm working on. And this project is way more complicated than your typical school assignment. I'm now horsing around dynamically linked libraries, also known as "shared objects" in Linux land, like nobody's business. Not only that, the project itself is essentially a multi-threaded shared object, making it all the more exciting. I've learned more about g++, ld, and ldd in the past few weeks than I ever knew before.

Unfortunately, debugging multi-threaded shared objects is easier said than done. The debugging tools in Linux (at least the ones I've played with) all suck so horribly. They make you really appreciate the level of quality in Microsoft's Visual Studio debugger, or better yet, in WinDBG (this thing is hard core, and it's what the MS developers actually use in practice). Fortunately, printf() always saves the day.

One cool trick I recently employed to debug a library loading problem I was having, is the LD_DEBUG environment variable. If you set LD_DEBUG to a value of versions, the Linux dynamic linker will print all of the version dependencies for each library used for a given command. If you have a Linux box, try it out. Set the LD_DEBUG environment variable, then do an ls. You'll be amazed at the number of libraries that such a simple command involves.

Although Linux development can be frustrating at times, I've already learned a great deal and consider my experiences a great success. If I come across any more useful tips (like LD_DEBUG above), I'll try my best to post them here (as much for my sake as for yours). Until then, you'll find me knee-deep in my Linux code. I've got a few more bugs to squash.