An occasional annoyance of my job is having to deal with poorly constructed data. One recent instance of this came through a collection of CSV files. In these files, certain free-form text fields sometimes included either non-escaped double quotes or an embedded newline where there shouldn't be one. Shortened examples of each are shown below:
"Samsung","ABC-12345","2.5 TB SAS 2.5" hard drive","Released","2018-06-01"
"Lenovo","DEF-88776
PQR-66554","Mechanical chassis","Released","2020-02-22"
The first line above has an embedded double quote character which has not been escaped. The second line showcases a rogue newline character.
Parsing these problematic cases in Python gets real tricky, and the native csv
module doesn't have great malformed data handling support. While thinking about how to handle these situations, it occurred to me that I could use the way the file was constructed to my advantage. These files are output by, what is to me, a black box. Under the hood it's undoubtedly a database query, the results of which are then sent into a CSV format. As a byproduct, each file has a consistent format where each field is quoted, and fields are separated by a comma. I can use the ","
string (double quote, comma, double quote) as my separator, looking for the fields I expect:
previous_chunk = []
with open(infile, 'r', encoding='utf8') as csvfile:
with open(f"{infile.stem}-clean.csv", 'w', encoding='utf8', newline='') as outfile:
writer = csv.writer(outfile, quoting=csv.QUOTE_ALL)
for line in csvfile.readlines():
line = line.rstrip() # Trim the trailing newline
pieces = line.split('","') # Split on our separator
pieces[0] = re.sub(r'^"', '', pieces[0]) # Remove the first double quote
pieces[-1] = re.sub(r'"$', '', pieces[-1]) # Remove the last double quote
# If we don't have the number of columns we expect, merge
if len(pieces) != expected_columns:
previous_chunk = merge_chunks(previous_chunk, pieces)
if len(previous_chunk) == expected_columns:
writer.writerow(previous_chunk)
previous_chunk = []
elif len(previous_chunk) > expected_columns:
print(f"ERROR: Overran column count! Expected {expected_columns}, Found "
f"{len(previous_chunk)}")
else:
writer.writerow(pieces)
The merge_chunks
method is very simple:
def merge_chunks(a, b):
"""
Merges lists a and b. The content of the first element of list b will be appended
to the content of the last element of list a. The result will be returned.
"""
temp = []
temp.extend(a)
if a:
temp[-1] = f"{a[-1]} {b[0]}"
temp.extend(b[1:])
else:
temp.extend(b)
return temp
I believe the only way this could potentially break is if the content, for some reason, contained the ","
separator somewhere in a data field. Given the types of data fields I'm working with, this is highly unlikely. Even if it does occur, I can use the format of some of the fields to make best guesses as to where the actual dividers are (e.g. the trailing elements on each line are typically always date stamps).
This is obviously not a general solution, but it sometimes pays to step away from the built-in parsing capability in a language and roll your own scheme.