Browsing all posts tagged programming

CSV Parsing Woes

Nov 14, 2021

An occasional annoyance of my job is having to deal with poorly constructed data. One recent instance of this came through a collection of CSV files. In these files, certain free-form text fields sometimes included either non-escaped double quotes or an embedded newline where there shouldn't be one. Shortened examples of each are shown below:

"Samsung","ABC-12345","2.5 TB SAS 2.5" hard drive","Released","2018-06-01"
"Lenovo","DEF-88776 
PQR-66554","Mechanical chassis","Released","2020-02-22"

The first line above has an embedded double quote character which has not been escaped. The second line showcases a rogue newline character.

Parsing these problematic cases in Python gets real tricky, and the native csv module doesn't have great malformed data handling support. While thinking about how to handle these situations, it occurred to me that I could use the way the file was constructed to my advantage. These files are output by, what is to me, a black box. Under the hood it's undoubtedly a database query, the results of which are then sent into a CSV format. As a byproduct, each file has a consistent format where each field is quoted, and fields are separated by a comma. I can use the "," string (double quote, comma, double quote) as my separator, looking for the fields I expect:

previous_chunk = []
with open(infile, 'r', encoding='utf8') as csvfile:
    with open(f"{infile.stem}-clean.csv", 'w', encoding='utf8', newline='') as outfile:
        writer = csv.writer(outfile, quoting=csv.QUOTE_ALL)

        for line in csvfile.readlines():
            line = line.rstrip()  # Trim the trailing newline

            pieces = line.split('","')  # Split on our separator
            pieces[0] = re.sub(r'^"', '', pieces[0])  # Remove the first double quote
            pieces[-1] = re.sub(r'"$', '', pieces[-1])  # Remove the last double quote

            # If we don't have the number of columns we expect, merge
            if(len(pieces) != expected_columns):
                previous_chunk = merge_chunks(previous_chunk, pieces)
                if(len(previous_chunk) == expected_columns):
                    writer.writerow(previous_chunk)
                    previous_chunk = []
                elif(len(previous_chunk) > expected_columns):
                    print(f"ERROR: Overran column count! Expected {expected_columns}, Found "
                          f"{len(previous_chunk)}")
            else:
                writer.writerow(pieces)

The merge_chunks method is very simple:

def merge_chunks(a, b):
"""
Merges lists a and b. The content of the first element of list b will be appended
to the content of the last element of list a. The result will be returned.
"""
    temp = []
    temp.extend(a)

    if(a):
        temp[-1] = f"{a[-1]} {b[0]}"
        temp.extend(b[1:])
    else:
        temp.extend(b)

    return temp

I believe the only way this could potentially break is if the content, for some reason, contained the "," separator somewhere in a data field. Given the types of data fields I'm working with, this is highly unlikely. Even if it does occur, I can use the format of some of the fields to make best guesses as to where the actual dividers are (e.g. the trailing elements on each line are typically always date stamps).

This is obviously not a general solution, but it sometimes pays to step away from the built-in parsing capability in a language and roll your own scheme.

I recently had to change the URLs of some of my REST endpoints in a project at work. In so doing, I started receiving reports of users who weren't seeing the data they expected from some endpoints. My redirect seemed simple enough:

location ~* ^/rest/(report/.*)$ {
    return 302 https://$host/rest/project/$1;
}

So, I'm sending URLs like https://myhostname/rest/report/1/ to https://myhostname/rest/project/report/1/. The redirect worked, but the problem was that any additional query string that was included in the URL got dropped.

For example, https://myhostname/rest/report/1/?query=somevalue would result in https://myhostname/rest/project/report/1/. The fix for this is easy, but is something I didn't realize you had to pay attention to:

location ~* ^/rest/(report/.*)$ {
    return 302 https://$host/rest/project/$1$is_args$args;
}

The $is_args directive will return a ? if there are arguments, or an empty string otherwise. Similarly, $args will return any arguments that happened to be passed. Including these variables ensures that any query parameters get passed along with the redirect.

Monolithic Methods

Apr 15, 2021

One of my current projects at work involves adding new functionality to my oldest web tool. I inherited this Django-powered project way back in 2015 (the tool was in its infancy at the time), and have been the sole designer and developer for the project ever since. It's been pretty rock solid, humming along for the past few years without any major modifications or issues. However, this recently has changed, as management wants to track the tool's data using a new dimension that we weren't considering previously.

These new features require adjustments to the database schema, which means that corresponding front-end changes, for both data entry and reporting, are also needed. The end result is that a lot of code needs to be updated. Digging through this ancient code has been both embarrassing and humbling.

When I inherited this project, I didn't know Python. By extension, I knew nothing about Django, which was still in its relatively early days (Django 1.8 was the latest at the time). I had plenty of web-design and programming experience, which made learning both much easier, but I made a ton of mistakes with both the architecture and implementation of the application. I'm now regretting those mistakes.

One of the most egregious errors in this application, and something I honestly still struggle with to a degree, is writing monolithic methods. Some of the view methods in this tool are many hundreds of lines long. Amidst these lines are dozens of calls to various "helper" functions, many of which are equally as complex and lengthy. It has made figuring out what I was doing painful, to say the least.

I'm trying to remedy this situation by creating stand-alone classes to act as logic processors. The resulting class is easier to read, even if the length of code is nearly the same. So, a sample view using this methodology might look something like this:

class MyView(View):
    def get(self, request):
        vp = MyViewProcessor(request)
        response = vp.process()
        return JsonResponse(response)

The corresponding processor class would then look as follows:

class MyViewProcessor:
    def __init__(self, request):
        self.request = request
        # Other initialization goes here

    def process(self):
        self.load_user_filters()
        self.load_data()
        self.transform_data()
        return self.build_response()

Each of the calls in the process() method are to other methods (not shown) that handle tasks like processing incoming data (from a front-end form), loading data from the database using those filters, etc. This construct, while not perfect, at least makes the code more readable by breaking the work into discrete units.

I've been doing web development in some form or fashion since 1999 (as an aside, the Wayback Machine even has a snapshot of one of my old websites; what a world)! I probably started picking up JavaScript way back in the early 2000s, as my web developing knowledge improved. Since web browsers are generally really good at supporting the old way of doing things, my knowledge of JavaScript has been pretty stagnant for a long time.

Not too long ago, I stumbled upon The Modern JavaScript Tutorial, a terrific resource for learning how to do things the modern way. I'm working my way through reading it, even taking the time to go back over the basics. I've already learned a lot; some of what I've been doing has apparently been deprecated for a while now, which was interesting to learn.

I've also learned about features I hadn't seen before (the nullish coalescing operator being one of those). I recommend it if, like me, you're still living in the dark ages.

I use Python virtual environments a bunch at work, and this morning I finally put together a small helper script, saved as a Gist at GitHub, that makes enabling and disabling virtual environments a lot easier. I'm not sure why I didn't do this a lot earlier. Simply type work to enable the virtual environment, and work off to disable it. This script should be in your PATH, if it's not already obvious.

Here's the script itself:

@echo off

if exist "%cd%\venv" (
    if "%1" == "off" (
        echo Deactivating virtual environment
        call "%cd%\venv\Scripts\deactivate.bat"
        echo.
    ) else (
        echo Activating virtual environment
        call "%cd%\venv\Scripts\activate.bat"
    )
) else (
    echo No venv folder found in %cd%.
)

I maintain multiple tools at work that all run in Docker containers on the same machine. The overall setup looks like the following diagram:

Tool Network Diagram

The router container on top (nginx) routes traffic to the various application containers based on the hostname seen in each request (each tool has its own internal domain name). Each application has an nginx container for serving static assets, and a gunicorn container to serve the dynamic parts of the application (using the Django framework).

Earlier this week, I was trying to add a redirect rule to one of my application containers (at the application nginx layer), because a URL was changing. As a convenience for users, I wanted to redirect them to the new location so they don't get the annoying "404: Not Found" error. I set up the redirect as a permanent redirect using a rewrite rule in nginx. For some strange reason, the port of the application's nginx layer, which should never be exposed to the outside world, was being appended to the redirect!

Adding the port_in_redirect off; directive to my nginx rules made no difference (or so I thought), and I struggled for an entire day on why this redirect wasn't working properly. At the end of the day, I learned that permanent redirects are aggressively cached by the browser! This annoyance means you need to clear your browser's cache to remove bogus redirects. I wasted an entire day because my stupid browser was using a bogus cached reference. Ugh!

SMBC RSS

Jan 3, 2019

One of the web comics I follow is Saturday Morning Breakfast Cereal. The official RSS feed for this comic only includes the comic itself and the associated hover-text joke. To see the extra joke, you have to visit the SMBC website. But no longer!

I've just created a new project on GitHub that fixes this issue. It's another RSS feed generator, and the feed that it generates contains the daily comic, the hover-text joke, and the hidden joke, all inline.

As always, there's room for improvement in a place or two. Let me know if you spot any issues.

Since I no longer subscribe to my local newspaper, I now primarily read daily comic strips through RSS feeds. comicsrss.com carries the vast majority of the strips I read, but several key strips are not included. It turns out that these missing strips are all owned by King Features which, frustratingly, doesn't provide RSS feeds to their strips.

I have now fixed that.

My new project, comics-rss, is now available for users interested in creating RSS feeds to the comic strips provided by King Features. The project is admittedly brittle at the moment, but it has worked well for me so far. A number of improvements are planned:

  1. The script currently caches the comic strips locally, linking to the cached copy. I'd like to provide an option to use direct links instead, skipping the cache altogether.
  2. Cached strips are not currently cleaned up, so the folder into which they are stored will grow each day. I'll be adding an "expired" configuration option to clean things up.
  3. Error checking in the configuration file isn't very robust, and needs to be improved.

I would be interested in any feedback you might have on this project. If you find bugs or have suggestions for improvement, be sure to file them on the project issues board.

A Subtle Python Bug

Feb 23, 2018

I recently had a very subtle bug with an OrderedDict in my Python code at work. I constructed the contents of this object from a SQL query that was output in a specific order (can you spot the bug?):

qs = models.MyModel.objects.all().order_by("-order")
data = OrderedDict({x.id: x.name for x in qs})

My expectation was output like the following, which I was seeing on my development system (Python 3.6):

OrderedDict([(4, 'Four'), (3, 'Three'), (2, 'Two'), (1, 'One')])

However, on my official sandbox test system (which we use for internal testing, running Python 3.5), I was seeing output like this:

OrderedDict([(1, 'One'), (2, 'Two'), (3, 'Three'), (4, 'Four')])

There are actually two issues in play here, and it took me a while to figure out what was going on.

  1. First, I'm constructing the OrderedDict element incorrectly. I'm using a dictionary comprehension as the initialization data for the object's constructor. Dictionaries are (until recently) not guaranteed to preserve insertion order when iterated over. This is where my order was being screwed up.
  2. Second, the above behavior for dictionary order preservation is an implementation detail that changed in Python 3.6. As of 3.6 (in the CPython implementation), dictionaries now preserve the insertion order when iterated over. My development system, running on 3.6, was therefore outputting things as I expected them. The sandbox system, still running 3.5, did not. What an annoyance!

I've learned two valuable lessons here: (a) make sure you're running on the same levels of code in various places, and (b) don't initialize an OrderedDict with a dictionary comprehension.

Born Geek on GitHub

Mar 22, 2014

I have uploaded the source of both CoLT and Googlebar Lite to GitHub:

This should make it way easier for folks to submit new ideas and bug reports for each extension, provide patches (if you feel so inclined), and view sample code for Firefox extension development. I've already posted a few issues to the CoLT repo, and a number should be appearing for Googlebar Lite as well.

In my last post, I complained about my initial experience with Stack Overflow. I decided to give myself 30 days with the service, to see whether or not I warmed up to it. Now that those 30 days are over, I will be posting several of my thoughts and observations. This first post won't be about the site itself; instead, it will cover some of the things I learned during my 30 days. A second upcoming post will cover some problems I think exist with the Stack Overflow model, and my final post will provide a few suggestions for how I think things can be improved.

Let me first say that I learned a lot simply by browsing the site. Reading existing questions and their answers was fascinating, at least for the programming topics I care about. Some of what I learned came through mistakes I made attempting to answer open questions. Other bits of information just came through searching the web for the solution to someone's problem (something that a lot of people at Stack Overflow are apparently too lazy to do). Without further ado, here's a list of stuff I learned, in no particular order (each item lists the corresponding language):

C (with GNU Extension), PHP (5.3+)
The true clause in a ternary compare operation can be omitted. In this case, the first operand (the test) will be returned if true. This is a bizarre shortcut, and one I would never personally use. Here's a PHP example (note that there's no space between the question mark and the colon; in C, a space is necessary):
$a = $b ?: $c; // No true clause (too lazy to type it, I guess)
$a = $b ? $b : $c; // The above is equivalent to this
Regular Expressions (Perl, PHP, possibly others)
The $ in a regular expression doesn't literally match the absolute end of the string; it can also match a new-line character that is the last character in the string. Pattern modifiers are usually available to modify this behavior. This fact was a surprise to me; I've had it wrong all these years!
Bash
I found a terrific article that details the differences between test, [, and [[.
Firefox Extensions (XUL, JS)
You can use the addTab method in the global browser object to inject POST data to a newly opened tab.
Perl
The way I learned to open files for output in Perl (over a decade ago) is now not advised. It's going to take a lot of effort on my part to change to the new style; old habits, and all that.
# Old way of doing it (how I learned)
open OUT, "> myfile.txt" or die "Failed to open: $!";

# The newer, recommended way (as of Perl 5.6)
open my $out, '>', "myfile.txt" or die "Failed to open: $!";

A couple of years ago, I blogged about two helper functions I wrote to get HTML form data in PHP: getGet and getPost. These functions do a pretty good job, but I have since replaced them with a single function: getData. Seeing as I haven't discussed it yet, I thought I would do so today. First, here's the function in its entirety:

/**
 * Obtains the specified field from either the $_GET or $_POST arrays
 * ($_GET always has higher priority using this function). If the value
 * is a simple scalar, HTML tags are stripped and whitespace is trimmed.
 * Otherwise, nothing is done, and the array reference is passed back.
 * 
 * @return The value from the superglobal array, or null if it's not present
 * 
 * @param $key (Required) The associative array key to query in either
 * the $_GET or $_POST superglobal
 */
function getData($key)
{
    if(isset($_GET[$key]))
    {
        if(is_array($_GET[$key]))
            return $_GET[$key];
        else
            return (strip_tags(trim($_GET[$key])));
    }
    else if(isset($_POST[$key]))
    {
        if(is_array($_POST[$key]))
            return $_POST[$key];
        else
            return (strip_tags(trim($_POST[$key])));
    }
    else
        return null;
}

Using this function prevents me from having to do two checks for data, one in $_GET and one in $_POST, and so reduces my code's footprint. I made the decision to make $_GET the tightest binding search location, but feel free to change that if you like.

As you can see, I first test to see if the given key points to an array in each location. If it is an array, I do nothing but pass the reference along. This is very important to note. I've thought about building in functionality to trim and strip tags on the array's values, but I figure it should be left up to the user of this function to do that work. Be sure to sanitize any arrays that this function passes back (I've been bitten before by forgetting to do this).

If the given key isn't found in either the $_GET or $_POST superglobals, I return null. Thus, a simple if(empty()) test can determine whether or not a value has been provided, which is generally all you care about with form submissions. An is_null() test could also be performed if you so desire. This function has made handling form submissions way easier in my various work with PHP, and it's one tool that's worth having in your toolbox.

I ran into an interesting phenomenon with PHP and MySQL this morning while working on a web application I've been developing at work. Late last week, I noted that page loads in this application had gotten noticeably slower. With the help of Firebug, I was able to determine that a 1-second delay was consistently showing up on each PHP page load. Digging a little deeper, it became clear that the delay was a result of a change I recently made to the application's MySQL connection logic.

Previously, I was using the IP address 127.0.0.1 as the connection host for the MySQL server:

$db = new mysqli("127.0.0.1", "myUserName", "myPassword", "myDatabase");

I recently changed the string to localhost (for reasons I don't recall):

$db = new mysqli("localhost", "myUserName", "myPassword", "myDatabase");

This change yielded the aforementioned 1-second delay. But why? The hostname localhost simply resolves to 127.0.0.1, so where is the delay coming from? The answer, as it turns out, is that IPv6 handling is getting in the way and slowing us down.

I should mention that I'm running this application on a Windows Server 2008 system, which uses IIS 7 as the web server. By default, in the Windows Server 2008 hosts file, you're given two hostname entries:

127.0.0.1 localhost
::1 localhost

I found that if I commented out the IPV6 hostname (the second line), things sped up dramatically. PHP bug #45150, which has since been marked "bogus," helped point me in the right direction to understanding the root cause. A comment in that bug pointed me to an article describing MySQL connection problems with PHP 5.3. The article dealt with the failure to connect, which happily wasn't my problem, but it provided one useful nugget: namely that the MySQL driver is partially responsible for determining which protocol to use. Using this information in my search, I found a helpful comment in MySQL bug #6348:

The driver will now loop through all possible IP addresses for a given host, accepting the first one that works.

So, long story short, it seems as though the PHP MySQL driver searches for the appropriate protocol to use every time (it's amazing that this doesn't get cached). Apparently, Windows Server 2008 uses IPV6 routing by default, even though the IPV4 entry appears first in the hosts file. So, either the initial IPV6 lookup fails and it then tries the IPV4 entry, or the IPV6 route invokes additional overhead; in either case, we get an additional delay.

The easiest solution, therefore, is to continue using 127.0.0.1 as the connection address for the database server. Disabling IPV6, while a potential solution, isn't very elegant and it doesn't embrace our IPV6 future. Perhaps future MySQL drivers will correct this delay, and it might go away entirely once the world switches to IPV6 for good.

As an additional interesting note, the PHP documentation indicates that a local socket gets used when the MySQL server name is localhost, while the TCP/IP protocol gets used in all other cases. But this is only true in *NIX environments. In Windows, TCP/IP gets used regardless of your connection method (unless you have previously enabled named pipes, in which case it will use that instead).

It's incredible to me that in 2011, programming languages still have problems with files larger than 2GB in size. We've had files that size for years, and yet overflow problems in this arena still persist. At work, I ran into this problem trying to get the file size of very large files (between 3 and 4 GB in size). The typical filesize() call, as shown below, would return an overflowed result on a very large file:

$size = filesize($someLargeFile);

Because PHP uses signed 32-bit integers to represent some file function return types, and because a 64-bit version of PHP is not officially available, you have to resort to farming the job out to the OS. In Windows, the most elegant way I've found so far is to use a COM object:

$fsobj = new COM("Scripting.FileSystemObject");
$f = $fsobj->GetFile($file);
$size = $file->Size;

Uglier hacks involve capturing the output of the dir command from the command line. There are two bug reports filed on this very issue: 27792 and 34750. The newest of these was filed in late 2005; a little more than 5 years ago! It's sad to see a language as prolific as PHP struggling with a problem so basic. Perhaps this issue will finally get fixed in PHP 6.

I recently ran into a stupid problem using the system() call in C++ on Windows platforms. For some strange reason, calls to system() get passed through the cmd /c command. This has some strange side effects if your paths contain spaces, and you try to use double quotes to allow those paths. From the cmd documentation:

If /C or /K is specified, then the remainder of the command line after the switch is processed as a command line, where the following logic is used to process quote (") characters:
  1. If all of the following conditions are met, then quote characters on the command line are preserved:
    • no /S switch
    • exactly two quote characters
    • no special characters between the two quote characters, where special is one of: &<>()@^|
    • there are one or more whitespace characters between the two quote characters
    • the string between the two quote characters is the name of an executable file
  2. Otherwise, old behavior is to see if the first character is a quote character and if so, strip the leading character and remove the last quote character on the command line, preserving any text after the last quote character.

As you can see from this documentation, if you have any special characters or spaces in your call to system(), you must wrap the entire command in an extra set of double quotes. Here's a working example:

string myCommand = "\"\"C:\\Some Path\\Here.exe\" -various -parameters\"";
int retVal = system(myCommand.c_str());
if (retVal != 0)
{
    // Handle the error
}

Note that I've got a pair of quotes around the entire command, as well as a pair around the path with spaces. This requirement isn't apparent at first glance, but it's something to keep in mind if you ever find yourself in this situation.

Disliking Java

Sep 21, 2010

If you were to ask me which programming language I hated, my first answer would most certainly be Lisp (short for "Lots of Stupid, Irritating Parentheses"). On the right day, my second answer might be Java. But seeing as hate is such a strong word, I'll opt for the statement that I dislike Java instead.

For the first time in probably 7 or 8 years, I'm having to write some Java code for a project at work. In all fairness, one of the main reasons I dislike the language is that I'm simply not very familiar with it. I'm sure that if I spent more time writing Java code, I might warm up to some of its quirks. But there are too many annoyances out of the gate to make me want to write stuff in Java for fun. Jumping back into Java development reminds me just how lucky I am to work with Perl and C++ code on a daily basis. Here are a few of my main gripes:

  1. It's a little ridiculous that the language requires the filename containing a class to exactly match the name of the class (so, a class named MyClass has to be placed in a file named "MyClass.java"). Other than making it easy to find where certain code resides, what's the benefit of this practice? The compiler simply translates your human-readable code into machine-specific byte code; filenames get lost in the translation!
  2. It pains me to have to write System.out.println("Some string"); to print some text, when in Perl it's simply print "Some string";. This leads me to my next major gripe:
  3. Java is way too verbose. I have to write 100 lines of code in Java to do what can be done in 10 lines of Perl. My time is worth something and I'm spending too much of it dealing with Java boilerplate code. In C++, I can use the public: keyword once, and everything that follows is public (until either another similar control keyword is reached or we come to the end of the block). It doesn't look like that's allowed in Java. Instead, I have to place the public keyword in front of each and every member variable and function. Ugh!
  4. Surprisingly, Java's documentation is pretty poor. Examples are few and far between and varying terminology makes it unclear when to use what function. For example, in some list-based data structure classes, getting a count of the items in said list might be getSize(), it might be getLength(), it could be just length(), or it might even be getNumberOfItems(). There's apparently no standard. Every other language manual I've ever used, be it PHP, Perl, or even the official C++ manual, has examples throughout, and relatively sane naming conventions. I can find no such help in Java-land.
  5. Automatic memory management can be handy, but it can also be a bother. I know for a fact that there are folks out there who make competent Java programmers who wouldn't last 10 minutes with C++ code. Pointers still matter in the world of computing. That Java hides all of those concepts from programmers, especially young programmers learning the trade, seems detrimental to me. It pays to know how memory allocation works. Trusting the computer to "just handle it" for you isn't always the best solution.
  6. Nearly all Java IDE's make Visual Studio look like the greatest thing on the planet; and Visual Studio sucks!

All that being said, the language does have a few redeeming features. Packages are a nice way to bundle up chunks of code (I wish C++ had a similar feature). It's also nice that the language recognizes certain data types as top-level objects (strings being one; again, C++ really hurts in this department, and yes I know about STL string which has its own set of problems).

I know there are folks who read this site that make a living writing Java code, so please don't take offense at my views. It's not that I hate Java; it's just that I don't like it.

As I mentioned a while back, I've been wanting to discuss automatic dependency generation using GNU make and GNU gcc. This is something I just recently figured out, thanks to two helpful articles on the web. The following is a discussion of how it works. I'll be going through this material quickly, and I'll be doing as little hand-holding as possible, so hang on tight.

Let's start by looking at the final makefile:

SHELL = /bin/bash

ifndef BC
    BC=debug
endif

CC = g++
CFLAGS = -Wall
DEFINES = -DMY_SYMBOL
INCPATH = -I../some/path

ifeq($(BC),debug)
    CFLAGS += -g3
else
    CFLAGS += -O2
endif

DEPDIR=$(BC)/deps
OBJDIR=$(BC)/objs

# Build a list of the object files to create, based on the .cpps we find
OTMP = $(patsubst %.cpp,%.o,$(wildcard *.cpp))

# Build the final list of objects
OBJS = $(patsubst %,$(OBJDIR)/%,$(OTMP))

# Build a list of dependency files
DEPS = $(patsubst %.o,$(DEPDIR)/%.d,$(OTMP))

all: init $(OBJS)
    $(CC) -o My_Executable $(OBJS)

init:
    mkdir -p $(DEPDIR)
    mkdir -p $(OBJDIR)

# Pull in dependency info for our objects
-include $(DEPS)

# Compile and generate dependency info
# 1. Compile the .cpp file
# 2. Generate dependency information, explicitly specifying the target name
# 3. The final three lines do a little bit of sed magic. The following
#    sub-items all correspond to the single sed command below:
#    a. sed: Strip the target (everything before the colon)
#    b. sed: Remove any continuation backslashes
#    c. fmt -1: List words one per line
#    d. sed: Strip leading spaces
#    e. sed: Add trailing colons
$(OBJDIR)/%.o : %.cpp
    $(CC) $(DEFINES) $(CFLAGS) $(INCPATH) -o $@ -c $<
    $(CC) -MM -MT $(OBJDIR)/$*.o $(DEFINES) $(CFLAGS) $(INCPATH) \
        $*.cpp > $(DEPDIR)/$*.d
    @cp -f $(DEPDIR)/$*.d $(DEPDIR)/$*.d.tmp
    @sed -e 's/.*://' -e 's/\\\\$$//' < $(DEPDIR)/$*.d.tmp | fmt -1 | \
        sed -e 's/^ *//' -e 's/$$/:/' >> $(DEPDIR)/$*.d
    @rm -f $(DEPDIR)/$*.d.tmp

clean:
    rm -fr debug/*
    rm -fr release/*

Let's blast through the first 20 lines of code real quick, seeing as this is all boring stuff. We first set our working shell to bash, which happens to be the shell I prefer (if you don't specify this, the shell defaults to 'sh'). Next, if the user didn't specify the BC environment variable (short for "Build Configuration"), we default it to a value of 'debug.' This is how I gate my build types in the real world; I pass it in as an environment variable. There are probably nicer ways of doing this, but I like the flexibility that an environment variable gives me. Next, we set up a bunch of common build variables (CC, CFLAGS, etc.), and we do some build configuration specific setup. Finally, we set our DEPDIR (dependency directory) and OBJDIR (object directory) variables. These will allow us to store our dependency and object files in separate locations, leaving our source directory nice and clean.

Now we come to some code that I discussed in my last programming grab bag:

# Build a list of the object files to create, based on the .cpps we find
OTMP = $(patsubst %.cpp,%.o,$(wildcard *.cpp))

# Build the final list of objects
OBJS = $(patsubst %,$(OBJDIR)/%,$(OTMP))

# Build a list of dependency files
DEPS = $(patsubst %.o,$(DEPDIR)/%.d,$(OTMP))

The OTMP variable is assigned a list of file names ending with the .o extension, all based on the .cpp files we found in the current directory. So, if our directory contained three files (a.cpp, b.cpp, c.cpp), the value of OTMP would end up being: a.o b.o c.o.

The OBJS variable modifies this list of object files, sticking the OBJDIR value on the front of each, resulting in our "final list" of object files. We do the same thing for DEPDIR, instead prepending the DEPDIR value to each entry (giving us our final list of dependency files).

Next up is our first target, the all target. It depends on the init target (which is responsible for making sure that the DEPDIR and OBJDIR directories exist), as well as our list of object files that we created moments ago. The command in this target will link together the objects to form an executable, after all the objects have been built. The next line is very important:

# Pull in dependency info for our objects
-include $(DEPS)

This line tells make to include all of our dependency files. The minus sign at the front says, "if one of these files doesn't exist, don't complain about it." After all, if the dependency file doesn't exist, neither does the object file, so we'll be recreating both anyway. Let's take a quick look at one of the dependency files to see what they look like, and to understand the help they'll provide us:

objs/myfile.o: myfile.cpp myfile.h
myfile.cpp:
myfile.h:

In this example, our object file depends on two files: myfile.cpp and myfile.h. Note that, after the dependency list, each file is listed by itself as a rule with no dependencies. We do this to exploit a subtle feature of make:

If a rule has no prerequisites or commands, and the target of the rule is a nonexistent file, then make imagines this target to have been updated whenever its rule is run. This implies that all targets depending on this one will always have their commands run.

This feature will help us avoid the dreaded "no rule to make target" error, which is especially helpful if a file gets renamed during development. No longer will you have to make clean in order to pick up those kinds of changes; the dependency files will help make do that work for you!

Back in our makefile, the next giant block is where all the magic happens:

# Compile and generate dependency info
# 1. Compile the .cpp file
# 2. Generate dependency information, explicitly specifying the target name
# 3. The final three lines do a little bit of sed magic. The following
#    sub-items all correspond to the single sed command below:
#    a. sed: Strip the target (everything before the colon)
#    b. sed: Remove any continuation backslashes
#    c. fmt -1: List words one per line
#    d. sed: Strip leading spaces
#    e. sed: Add trailing colons
$(OBJDIR)/%.o : %.cpp
    $(CC) $(DEFINES) $(CFLAGS) $(INCPATH) -o $@ -c $<
    $(CC) -MM -MT $(OBJDIR)/$*.o $(DEFINES) $(CFLAGS) $(INCPATH) \
        $*.cpp > $(DEPDIR)/$*.d
    @cp -f $(DEPDIR)/$*.d $(DEPDIR)/$*.d.tmp
    @sed -e 's/.*://' -e 's/\\\\$$//' < $(DEPDIR)/$*.d.tmp | fmt -1 | \
        sed -e 's/^ *//' -e 's/$$/:/' >> $(DEPDIR)/$*.d
    @rm -f $(DEPDIR)/$*.d.tmp

This block of code is commented, but I'll quickly rehash what's going on. The first command actually compiles the object file, while the second command generates the dependency file. We then use some sed magic to create the special rules in each dependency file.

Though it's a lot to take in, these makefile tricks are handy to have in your toolbox. Letting make handle the dependency generation for you will save you a ton of time in the long run. It also helps when you're working with very large projects, as I do at work.

If you have a comment or question about this article, feel free to comment.

It has once again been ages since the last programming grab bag article was published, so let's dive right into another one, shall we? This time around, we'll be looking at some simple tricks involving GNU make.

1. Let Make Construct Your Object List

One common inefficiency in many Makefiles I've seen is having a manual list of the object files you are interested in building. Let's work with the following example makefile (I realize that this makefile has a number of design issues; it's a simple, contrived example for the sake of this discussion). I've highlighted the list of objects below (line 2):

CFLAGS = -Wall
OBJS = class_a.o class_b.o my_helpers.o my_program.o

all: my_program

my_program: $(OBJS)
    gcc -o my_program $(OBJS)

class_a.o: class_a.cpp
    gcc $(CFLAGS) -c class_a.cpp

class_b.o: class_b.cpp
    gcc $(CFLAGS) -c class_b.cpp

my_helpers.o: my_helpers.cpp
    gcc $(CFLAGS) -c my_helpers.cpp

my_program.o: my_program.cpp
    gcc $(CFLAGS) -c my_program.cpp

For very small projects, maintaining a list like this is doable, even if it is a bother. When considering larger projects, this approach rarely works. Why not let make do all this work for us? It can generate our list of object files automatically from the cpp files it finds. Here's how:

OBJS = $(patsubst %.cpp,%.o,$(wildcard *.cpp))

We are using two built-in functions here: patsubst and wildcard. The first function will do a pattern substitution: the first parameter is the pattern to match, the second is the substitution, and the third is the text in which to do the substitution.

Note that, in our example, the third parameter to the patsubst function is a call to the wildcard function. A call to wildcard will return a space separated list of file names that match the given pattern (in our case, *.cpp). So the resulting string in our example would be: class_a.cpp class_b.cpp my_helpers.cpp my_program.cpp. Given this string, patsubst would change all .cpp instances to .o instead, giving us (at execution time): class_a.o class_b.o my_helpers.o my_program.o. This is exactly what we wanted!

The obvious benefit of this technique is that there's no need to maintain our list anymore; make will do it for us!

2a. Use Pattern Rules Where Possible

One other obvious problem in our example makefile above is that all the object targets are identical in nature (only the file names are different). We can solve this maintenance problem by writing a generic pattern rule:

%.o: %.cpp
    gcc -c $< -o $@

Pretty ugly syntax, huh? This rule allows us to build any foo.o from a corresponding foo.cpp file. Again, the % characters here are wildcards in the patterns to match. Note also that the command for this rule uses two special variables: $< and $@. The former corresponds to the name of the first prerequisite from the rule, while the latter corresponds to the file name of the target of this rule.

Combining this pattern rule with the automatic list generation from tip #1 above, results in the following updated version of our example makefile:

CFLAGS = -Wall
OBJS = $(patsubst %.cpp,%.o,$(wildcard *.cpp))

all: my_program

my_program: $(OBJS)
    gcc -o my_program $(OBJS)

%.o: %.cpp
    gcc $(CFLAGS) -c $< -o $@

This is much more maintainable than our previous version, wouldn't you agree?

2b. Potential Problems With This Setup

Astute readers have undoubtedly noticed that my sample makefile has no header (.h) files specified as dependencies. In the real world, it's good to include them so that updates to said files will trigger a build when make is executed. Suppose that our example project had a header file named class_a.h. As the makefile is written now, if we update this header file and then call make, nothing will happen (we would have to make clean, then make again, to pick up the changes).

Header file dependencies aren't likely to be a one-to-one mapping. Fortunately, we can get make to automatically generate our dependencies for us. Furthermore, we can get make to include those automatic dependencies at execution time, without any recursive calls! The process for doing this is above and beyond the scope of this article, but I will be writing an article on this very subject in the near future (so stay tuned).

3. Target-Specific Variables Can Help

Suppose that we want to build a debug version of our program using a target. Wouldn't it be nice to be able to modify some of our variable values given that specific target? Well, it turns out that we can do just that. Here's how (the added lines have been highlighted):

CFLAGS = -Wall
OBJS = $(patsubst %.cpp,%.o,$(wildcard *.cpp))

all: my_program

debug: CFLAGS += -g3
debug: my_program

my_program: $(OBJS)
    gcc -o my_program $(OBJS)

%.o: %.cpp
    gcc -c $< -o $@

In this example, when we type make debug from the command line, our CFLAGS variable will have the appropriate debug option appended (in this case, -g3), and then the program will be built using the specified dependencies. Being able to override variables in this manner can be quite useful in the right situations.

Do you have your own make tips? If so, leave a comment! I'll be posting more about doing automatic dependency generation with make and gcc in the near future.

One of the things I most appreciate about Perl is that it requires code blocks to be surrounded by curly braces. In my mind, this is particularly important with nested if-else statements. Many programming languages don't require braces to surround code blocks, so nested conditionals can quickly become unreadable and much harder to maintain. Let's take a look at an example:

if (something)
    if (another_thing)
    {
        some_call;
        some_other_call;
        if (yet_another_thing)
        {
            do_it;
            do_it_again;
        }
    }

Note that the outer if-statement doesn't have corresponding curly braces. As surprising as it may seem, this is completely legal code in many languages. In my opinion, this is a dangerous programming practice. If I wanted to add additional logic to the contents of the outer if block, I would have to remember to put the appropriate braces in place.

Had I attempted to use this code in a Perl script, the interpreter would have complained immediately, even if warnings and strict parsing were both disabled! This kind of safety checking prevents me from shooting myself in the foot. Some may complain that requiring braces makes programming slightly more inefficient from a productivity standpoint. My response to that is that any code editor worth its salt can insert the braces for you. My favorite editor, SlickEdit, even supports dynamic brace surrounding, a feature I truly appreciate. It's a shame that more programming languages don't enforce this kind of safety net. Hopefully future languages will keep small matters like this in mind.

An article entitled A Brief, Incomplete, and Mostly Wrong History of Programming Languages offers a very humorous glimpse into the world of programming. My absolute favorite snippet from the article:

1987 - Larry Wall falls asleep and hits Larry Wall's forehead on the keyboard. Upon waking Larry Wall decides that the string of characters on Larry Wall's monitor isn't random but an example program in a programming language that God wants His prophet, Larry Wall, to design. Perl is born.

It's funny because it's true. (Hat tip to Dustin for the pointer to this article.)