Browsing all posts tagged python

Back in 2020, I put together a very simple virtual environment helper script for Python virtual environments in Windows. One issue this previous incarnation had was the inability to walk up the tree structure, looking for virtual environment folders above the current level. As a result, you always had to be in the root of the project to activate or deactivate the virtual environment.

I've finally spent a little time fixing this, and the improved script is below. The script will now walk up the tree structure, allowing you to activate or deactivate the environment, regardless of your depth in the tree. Just save the contents as a batch file named work.bat, put it in your PATH, and use either work or work off to activate or deactivate the environment, respectively.

@echo off

set "startDir=%cd%"

:: Get the drive letter from the start directory
for %%D in (%startDir%) do set "drive=%%~dD"

:: Traverse upwards from here
set "currentDir=%startDir%"
:loop

:: Stop if we reach the root of the drive
if "%cd%"=="%drive%\" (
    echo Unable to locate venv folder in this tree
    exit /b
)

if exist "%cd%\venv" (
    if "%1" == "off" (
        echo Deactivating virtual environment
        call "%cd%\venv\Scripts\deactivate.bat"
        echo.
    ) else (
        echo Activating virtual environment
        call "%cd%\venv\Scripts\activate.bat"
    )
    cd %startDir%
    exit /b
)

:: Move up one level
cd ..
goto loop

Python dict to tuple

Feb 1, 2024

I had no idea Python could do this, but Ruff taught me something new this morning. Suppose I want to transform a dict into a set of tuples. You can do it in a single line, with no comprehension required!

mydict = {
    'a': 123,
    'b': 456,
    'c': 789,
}
myset = set(mydict.items())
# myset is now: {('a', 123), ('b', 456), ('c', 789)}

I love stumbling upon little hidden gems like this!

Here's another thing that linting taught me recently: Python's str.startswith() method, and str.endswith() as well, takes a tuple as the first parameter! This makes checking for multiple options really simple:

# Verbose way of writing it
if mystring.startswith('c.') or mystring.startswith('m.') or mystring.startswith('s.'):
    ...

# Easier way
if mystring.startswith(('c.', 'm.', 's.')):
   ...

I didn't realize the language allowed this!

As I mentioned previously, I'm now using Ruff to lint my Python projects. Several linter warnings continually crop up in my code, which I find interesting, so I thought I'd highlight a few of them (there are plenty that I'm leaving out; I apparently write fairly crude code by these linters' standards).

missing-trailing-comma
This is a common recurrence in places where I'm setting up a dict for something:

mydict = {
    'posts': some_queryset.all(),
    'today': Date.today()  # Missing a trailing comma
}

if-expr-with-false-true
This pops up on occasion for me, though not terribly often. I apparently easily forget about the not operator.

return False if self.errors else True

# The above is a little more legible if we use:
return not self.errors

superfluous-else-return
I was surprised that this occurred so often in my code. Removing these cases flattens the code somewhat, which is a new practice I'm trying to engrain into my programming habits.

if is_ajax_request(request):
    return HttpResponseForbidden('Forbidden')
else:  # This isn't needed
    return redirect(reverse('home'))

# The above looks better as:
if is_ajax_request(request):
    return HttpResponseForbidden('Forbidden')

return redirect(reverse('home'))

explicit-f-string-type-conversion
This warning taught me something I didn't know about f-strings; namely that explicit conversion flags are available. Also that the conversions I was making were mostly not necessary in the first place.

error = f"Part not owned by {str(self.part_owner)}!"

# Better:
error = f"Part not owned by {self.part_owner!s}!."

# Best:
error = f"Part not owned by {self.part_owner}!."

type-comparison
Again, I was surprised by how often I do this. Base types can (and often are) subclassed, so it's better to use isinstance() than type is.

if type(loader) is list:
    return error_response(loader)

# Better:
if isinstance(loader, list):
    return error_response(loader)

Running the linting frameworks has taught me a fair amount about my programming habits, and has also informed me about various aspects of the language. I recommend running linters if you don't already, and I highly recommend Ruff!

Linting With Ruff

Sep 8, 2023

I enjoy using linting frameworks for the code I write, primarily employing flake8 for my Python code, which is about 90% of what I write these days. Recently, however, I saw news on Ruff, a new linting framework written in Rust that is orders of magnitude faster. It's so fast that the entire CPython repository, which contains over 1200 files, can be linted from scratch in only 0.29 seconds. Several testimonial quotes in Ruff's README attest to this blazing speed:

Ruff is so fast that sometimes I add an intentional bug in the code just to confirm it's actually running and checking the code. - Sebastián Ramírez, creator of FastAPI

Just switched my first project to Ruff. Only one downside so far: it's so fast I couldn't believe it was working till I intentionally introduced some errors. - Timothy Crosley, creator of isort

Another benefit on top of its speed is the near-parity it brings with Flake8, which is nice. There are still a number of formatting rules from the pycodestyle package that haven't been implemented, which is an annoyance, but there's an active issue tracking the progress on that front.

To top it all off, Ruff includes rules from dozens of Flake8 plugins, most of which I've never run. Enabling all of them in my projects has been humbling, to say the least, but I'm learning a ton of improved practices from doing so. I don't always agree with some of the rules, and have disabled a number of rule sets that annoy me, but it's been an interesting learning process.

In the coming days I'll be writing about a few of the sloppy practices that this framework has pointed out in my code, so stay tuned.

I don't use the namedtuple often in Python, but every time I do, I ask myself, "Why aren't I using this more often?" Today I ran into a case where it made total sense to use it.

I'm loading data from a database into a dictionary, so that I can later use this data to seed additional tables. To keep things nice and flat, I use a tuple as the key into the dictionary:

ModelKey = namedtuple('ModelKey', 'org role location offset')
model_data = {}
for x in models.DataModelEntry.objects.filter(data_model=themodel):
    key = ModelKey(x.org, x.role, x.location, x.offset)
    model_data.setdefault(key, x.value)

Later, when I use this data, I can use the field names directly, without having to remember in which slot I stored what parameter:

to_create = []
for key, value in model_data.items():
    obj = models.Resource(org=key.org, role=key.role, location=key.location,
                          offset=key.offset, value=value)
    to_create.append(obj)

The first line in the loop is so much clearer than the following:

    obj = models.Resource(org=key[0], role=key[1], location=key[2], offset=key[3], value=value)

Using the field names also makes debugging easier for future you!

Django REST Framework (DRF) is, on the surface, a neat piece of software. It provides web interactivity (for free!) to your REST interfaces, and can make generating those interfaces pretty quick to do. It has baked in authentication and permission handling. With just a few lines of code you're up and running. Or are you?

As you dig deeper into their tutorial, you'll find that this framework is abstraction layer on top of abstraction layer. Using naked Django style views, I could easily write a listing routine for a specific model in my application. Let's take an example model (all code in this post will omit imports, for brevity):

class Person(models.Model):
    email = models.CharField(max_length=60, unique=True)
    first_name = models.CharField(max_length=60)
    last_name = models.CharField(max_length=60)
    display_name = models.CharField(blank=True, max_length=120)
    manager = models.ForeignKey('self', related_name='direct_reports', on_delete=models.CASCADE)

    def __str__(self):
        return (self.display_name if self.display_name
                else f"{self.first_name} {self.last_name}")

This model is simply a few key pieces of data on a person inside my application. A simple view to get the list of people known to my application might look like this:

class PersonList(View):
    def get(self, request):
        people = []
        for x in Person.objects.select_related('manager').all():
            obj = {
                'email': x.email,
                'first_name': x.first_name,
                'last_name': x.last_name,
                'display_name': str(x),
                'manager': str(x.manager),
            }
            people.append(obj)
        return JsonResponse({'people': people})

This, I would argue, is simple and easy to read. It may be a little verbose for me, the programmer, I admit. But if another programmer comes along behind me, they're fairly likely to understand exactly what's going on here; especially if they are a junior programmer. Maintenance of this code therefore becomes trivial.

Let's now look at a DRF example. First I need a serializer:

class PersonSerializer(serializers.ModelSerializer):
    class Meta:
        model = Person
        fields = '__all__'
        depth = 1

This looks good, but it doesn't handle the display_name case correctly, because I want the str() method output for that field, not the field value itself. The same goes for the manager field. So I now have to write some field getters for both. Here's the updated serializer code:

class PersonSerializer(serializers.ModelSerializer):
    display_name = serializers.SerializerMethodField()
    manager = serializers.SerializerMethodField()

    class Meta:
        model = Person
        fields = '__all__'
        depth = 1

    def get_display_name(self, obj):
        return str(obj)

    def get_manager(self, obj):
        return str(obj.manager)

Once my serializer is complete, I still need to set up the view that will be used to actually load the list:

class PersonList(generics.ListAPIView):
    queryset = Person.objects.select_related('manager').all()
    serializer_class = PersonSerializer

I'll admit, this is pretty lean code. You cannot convince me, however, that it's more maintainable. The junior programmer is going to come in and look at this and wonder:

  • Why do only two fields in the serializer have get routines?
  • What even is a SerializerMethodField?
  • Why is the depth value set on this serializer?
  • What does the ListAPIView actually return?
  • Can I inject additional ancillary data into the response if necessary? If so, how?

DRF feels like it was designed by Java programmers (does anyone else get that vibe?). REST interfaces always have weird edge cases, and I'd much rather handle them in what I would consider the more pythonic way: simple, readble, naked views. After all, according to the Zen of Python:

  • Simple is better than complex.
  • Readability counts.

I don't open multiple file handles at once very often in Python, so it surprised me to find out this morning that you have to use the old-school line continuation hack to make it work (in 3.9 or earlier):

# Note the trailing backslash on the next line...
with open('a.txt', 'w') as file_a, \
     open('b.txt', 'w') as file_b:

    # Do something with file_a
    # Do something with file_b

Happily, Python 3.10 fixes this by adding Parenthesized Context Managers, which allows us to use parentheses like you would expect to:

# Only in Python 3.10+
with (
    open('a.txt', 'w') as file_a,
    open('b.txt', 'w') as file_b
):

    # Do something with file_a
    # Do something with file_b

The project I'm working on is still on Python 3.9, but it's good to know this was improved, and is a motivator to upgrade the version I'm using.

CSV Parsing Woes

Nov 14, 2021

An occasional annoyance of my job is having to deal with poorly constructed data. One recent instance of this came through a collection of CSV files. In these files, certain free-form text fields sometimes included either non-escaped double quotes or an embedded newline where there shouldn't be one. Shortened examples of each are shown below:

"Samsung","ABC-12345","2.5 TB SAS 2.5" hard drive","Released","2018-06-01"
"Lenovo","DEF-88776 
PQR-66554","Mechanical chassis","Released","2020-02-22"

The first line above has an embedded double quote character which has not been escaped. The second line showcases a rogue newline character.

Parsing these problematic cases in Python gets real tricky, and the native csv module doesn't have great malformed data handling support. While thinking about how to handle these situations, it occurred to me that I could use the way the file was constructed to my advantage. These files are output by, what is to me, a black box. Under the hood it's undoubtedly a database query, the results of which are then sent into a CSV format. As a byproduct, each file has a consistent format where each field is quoted, and fields are separated by a comma. I can use the "," string (double quote, comma, double quote) as my separator, looking for the fields I expect:

previous_chunk = []
with open(infile, 'r', encoding='utf8') as csvfile:
    with open(f"{infile.stem}-clean.csv", 'w', encoding='utf8', newline='') as outfile:
        writer = csv.writer(outfile, quoting=csv.QUOTE_ALL)

        for line in csvfile.readlines():
            line = line.rstrip()  # Trim the trailing newline

            pieces = line.split('","')  # Split on our separator
            pieces[0] = re.sub(r'^"', '', pieces[0])  # Remove the first double quote
            pieces[-1] = re.sub(r'"$', '', pieces[-1])  # Remove the last double quote

            # If we don't have the number of columns we expect, merge
            if len(pieces) != expected_columns:
                previous_chunk = merge_chunks(previous_chunk, pieces)
                if len(previous_chunk) == expected_columns:
                    writer.writerow(previous_chunk)
                    previous_chunk = []
                elif len(previous_chunk) > expected_columns:
                    print(f"ERROR: Overran column count! Expected {expected_columns}, Found "
                          f"{len(previous_chunk)}")
            else:
                writer.writerow(pieces)

The merge_chunks method is very simple:

def merge_chunks(a, b):
"""
Merges lists a and b. The content of the first element of list b will be appended
to the content of the last element of list a. The result will be returned.
"""
    temp = []
    temp.extend(a)

    if a:
        temp[-1] = f"{a[-1]} {b[0]}"
        temp.extend(b[1:])
    else:
        temp.extend(b)

    return temp

I believe the only way this could potentially break is if the content, for some reason, contained the "," separator somewhere in a data field. Given the types of data fields I'm working with, this is highly unlikely. Even if it does occur, I can use the format of some of the fields to make best guesses as to where the actual dividers are (e.g. the trailing elements on each line are typically always date stamps).

This is obviously not a general solution, but it sometimes pays to step away from the built-in parsing capability in a language and roll your own scheme.

One of the well known tenets of Python is:

There should be one (and preferably only one) obvious way to do it.

There are plenty of places in the Python universe where this tenet is blatantly ignored, but none tickles me quite like shutil.copy and shutil.copy2. Both methods copy files from one location to another, with one (and apparently only one) difference, as the documentation for copy2 spells out:

shutil.copy2(src, dst, *, follow_symlinks=True)
Identical to copy() except that copy2() also attempts to preserve file metadata.

I'd love to know what motivation the author of the (very poorly named) copy2 method had for adding it to the library. Was adding a preserve_metadata argument to copy() not sufficient for some reason? That's what any sane developer might have done.

I ran into a problem at work today with a custom template tag I've written in a Django project. The tag works as follows:

{% if_has_team_role team "role_name_to_check" %}
<!-- block of HTML to be included if true goes here -->
<!-- otherwise, all of this is skipped -->
{% endif_has_team_role %}

I'm using a custom tag here, rather than a simple conditional, because the underlying check is more complicated than should be expressed at the template layer of my code. The problem came when I nested other conditionals in this block:

{% if_has_team_role team "role_name_to_check" %}
  {% if some_other_condition %}
    <!-- a nested element -->
  {% endif %}
{% endif_has_team_role %}

This setup was throwing an error. While searching for a solution to this issue, I stumbled upon this StackOverflow question. Reading the question, it matched the very problem I was having. At the end of the question, I noted that I was the original asker, 5 years ago; ha!

The solution I had accepted back when I asked this was, at best, a workaround. It turns out that a simple typo in the code was to blame, and fixing that typo solves the problem. It's a nice feeling to answer your own question, even if it takes five years to do it.

Python showPath Script

Dec 27, 2020

I occasionally have a need to either view the PATH environment variable from the command line, or search the PATH for something. I wrote a small Python script to make this easy to do, adapting it from an old Perl script I wrote years ago. The script, in its entirety, is shown below (you can also download it here; just save it as a Python script). Note that this script currently has a Windows focus, but could easily be adjusted to work in Linux too.

When used by itself, the script will simply pretty-print all of the paths currently in your environment's PATH. You can pass a --sort option to sort the output, or you can supply a needle to search for. Hopefully someone else will find this as useful as I do.

#!/usr/bin/python
import argparse
import os


parser = argparse.ArgumentParser()
parser.add_argument('searchterm', default='', nargs='?')
parser.add_argument('--sort', action='store_true', default=False,
                    help='Print PATH in a sorted form')

args = parser.parse_args()

path = os.getenv('PATH').split(';')

if args.sort:
    path = sorted(path)

if args.searchterm:
    needle = args.searchterm.lower()
    print(f"\nSearching PATH for {needle}")

    matches = []
    for p in path:
        if needle in p.lower():
            matches.append(p)

    if matches:
        print(f"Found {len(matches)} result{'' if len(matches) == 1 else 's'}")
        for m in matches:
            print(m)
    else:
        print(f"Unable to find {needle} in PATH")
else:
    print("\nShowing PATH:\n")
    for p in path:
        print(p)

Django 3.1 now uses the Python pathlib module for its internal paths. This change caught me off guard when I started developing with it, as I was used to the old os.path way of doing things. Here's a look at the old way and its newer counterpart:

# Old way
import os
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))

# New way
from pathlib import Path
BASE_DIR = Path(__file__).resolve().parent.parent

Joining paths together uses a very different mechanism as well:

# Old way
STATIC_ROOT = os.path.join(BASE_DIR, 'static')

# New way
STATIC_ROOT = BASE_DIR / 'static'

# Alternate new way (which I prefer)
STATIC_ROOT = BASE_DIR.joinpath('static')

These new mechanisms feel so different since they treat paths as objects, not strings. This works well, however, since paths aren't really strings in the first place. Transitioning to this new way of thinking is taking me some time, since prior to this module, every path in Python was treated as a string. However, I can already see the utility of this module, especially when it comes to resolving relative paths.

The dependency resolver in Python's pip command was recently updated in version 20.3. This fundamental change has a number of improvements, but I discovered today a serious drawback of this new machinery. Using the previous resolver, pip allowed you do the following to discover what versions of a package were available:

pip install markdown==

This command provided output like the following:

ERROR: Could not find a version that satisfies the requirement markdown==
(from versions: 1.7, 2.0, 2.0.1, 2.0.2, 2.0.3, 2.1.0, 2.1.1, 2.2.0, 2.2.1,
2.3, 2.3.1, 2.4, 2.4.1, 2.5, 2.5.1, 2.5.2, 2.6, 2.6.1, 2.6.2, 2.6.3, 2.6.4,
2.6.5, 2.6.6, 2.6.7, 2.6.8, 2.6.9, 2.6.10, 2.6.11, 3.0, 3.0.1, 3.1, 3.1.1,
3.2, 3.2.1, 3.2.2, 3.3, 3.3.1, 3.3.2, 3.3.3)
ERROR: No matching distribution found for markdown==

This trick was often useful to discover what new versions (if any) of required packages are available. Sadly, the new machinery no longer produces output like the above. Instead, all you get is this entirely unhelpful message:

ERROR: Could not find a version that satisfies the requirement markdown==
ERROR: No matching distribution found for markdown==

An open bug in the pip project is tracking this issue, but most of the developer responses so far have been of the "we don't have the funding to fix this" variety. There are a number of recommended solutions in the ticket, none of which seem as simple as the previous trick. Hopefully this is something that can be prioritized and fixed soon.

Note: I now have an improved version of this script.

I use Python virtual environments a bunch at work, and this morning I finally put together a small helper script, saved as a Gist at GitHub, that makes enabling and disabling virtual environments a lot easier. I'm not sure why I didn't do this a lot earlier. Simply type work to enable the virtual environment, and work off to disable it. This script should be in your PATH, if it's not already obvious.

Here's the script itself:

@echo off

if exist "%cd%\venv" (
    if "%1" == "off" (
        echo Deactivating virtual environment
        call "%cd%\venv\Scripts\deactivate.bat"
        echo.
    ) else (
        echo Activating virtual environment
        call "%cd%\venv\Scripts\activate.bat"
    )
) else (
    echo No venv folder found in %cd%.
)

A Subtle Python Bug

Feb 23, 2018

I recently had a very subtle bug with an OrderedDict in my Python code at work. I constructed the contents of this object from a SQL query that was output in a specific order (can you spot the bug?):

qs = models.MyModel.objects.all().order_by("-order")
data = OrderedDict({x.id: x.name for x in qs})

My expectation was output like the following, which I was seeing on my development system (Python 3.6):

OrderedDict([(4, 'Four'), (3, 'Three'), (2, 'Two'), (1, 'One')])

However, on my official sandbox test system (which we use for internal testing, running Python 3.5), I was seeing output like this:

OrderedDict([(1, 'One'), (2, 'Two'), (3, 'Three'), (4, 'Four')])

There are actually two issues in play here, and it took me a while to figure out what was going on.

  1. First, I'm constructing the OrderedDict element incorrectly. I'm using a dictionary comprehension as the initialization data for the object's constructor. Dictionaries are (until recently) not guaranteed to preserve insertion order when iterated over. This is where my order was being screwed up.
  2. Second, the above behavior for dictionary order preservation is an implementation detail that changed in Python 3.6. As of 3.6 (in the CPython implementation), dictionaries now preserve the insertion order when iterated over. My development system, running on 3.6, was therefore outputting things as I expected them. The sandbox system, still running 3.5, did not. What an annoyance!

I've learned two valuable lessons here: (a) make sure you're running on the same levels of code in various places, and (b) don't initialize an OrderedDict with a dictionary comprehension.