Python's regex is not always the best option

A few years ago I worked for a small startup with with a few developers who knew regular expressions very well, but didn't know Python's way of doing things. We had no API, and our way of exchanging data was parsing files uploaded to our FTP server by our partners (sigh ... I know it's ugly, but hey I didn't design it). The some of the files where CSV or XML files with string as dates. The quality of the files was not really good, which means that the dates where not consistent. Dates could have been written as:

16-14-03
2016.14.03

We had no way of telling, and there was only one parser for all files. The code of the parser contained a function called parseDateString which was responsible to convert a string to a python datetime object. The function was pretty long, and had to be extended every time a new format was encountered, because it broke the code which cause the ETL to stop. So how did this function look, and what does it have to do with regex? Here is a small demo extract, not the real code, but close enough to demonstrate the case:

def parseDateString(date_string):

        d = None

        if re.match('\d{4}-\d{2}-\d{2}'):
                d = date.strptime(date_string, "%Y-%m-%d")
        elif re.match('\d{2}-\d{2}-\d{2}', date_string):
                d = date.strptime(date_string, "%y-%m-%d")
        elif re.match('\d{4}\.\d{2}\.\d{2}', date_string):
                d = date.strptime(date_string, "%y.%m.%d")
        elif re.match(....):
        ...

        return d

So, now if we got a data file with a new date format, for example 2015/01/31 the above code would have returned None, and somewhere along the way an exception was triggered, the ETL process was stopped, and somebody would have to fix the code by adding another regex match.

While this worked, this method have become something like 200 lines of code with the time, which needed to be modified every time this code broke. This was expensive, because it was running slow. And even if we used re.compile it was still expensive in terms of developer time spent on that task. As I was the junior in the company, I had to deal with this task very often, because it was a repetitive task that no one liked doing (I suggested refactoring this part, but since I was junior and we had no tests, I could no prove my case, but that is another story).

So how this can be written without regex,using a Pythonic way without making this a Programmer task? Here is just was example:

def parseDateString(date_string):

    formats = ['%Y-%m-%d', '%Y.%m.%d', '%Y/%m/%d',]  # add more as you like

    for format_ in formats:
        try:
           d = date.strptime(date_string, format_)
           return d
        except ValueError:
           continue

Now formats can be easily read from a configuration file, or a Django admin interface, which can be done by other people then developers (who hate repetitive tasks...). Also there is no regex involved. The original function with regex, was written the way it was written because the original developer was really good at them, and tended to use them everywhere, even when not needed.

Here is a second example. This time, I am the one responsible for this overwhelmingly complex regex. Luckily for everyone involved, this never really got to production, but I kept the code as a future reference for a bad code. The task at hand was to find the average response time of a web server serving API requests. The log file contained entries like the following:

INFO  @ 24 Oct 2013 07:59:35,770 @ io.company.api.web.interceptor.ExecuteTimeInterceptor - Request to url='/status' was answered with statusCode='200' took t='1ms'
INFO  @ 24 Oct 2013 07:59:45,770 @ io.company.api.web.interceptor.ExecuteTimeInterceptor - Request to url='/status' was answered with statusCode='200' took t='1ms'
INFO  @ 24 Oct 2013 07:59:55,771 @ io.company.api.web.interceptor.ExecuteTimeInterceptor - Request to url='/status' was answered with statusCode='200' took t='2ms'
INFO  @ 24 Oct 2013 07:37:37,661 @ io.company.api.web.interceptor.ExecuteTimeInterceptor - Request to url='/v2/foo/e0cf059a-5080-40a5-aaf1-67eb866aa48f/secretKey' was answered with statusCode='400' took t='211ms'
INFO  @ 24 Oct 2013 07:59:55,771 @ io.company.api.web.interceptor.ExecuteTimeInterceptor - Request to url='/status' was answered with statusCode='200' took t='2ms'
INFO  @ 24 Oct 2013 07:59:55,771 @ io.company.api.web.interceptor.ExecuteTimeInterceptor - Request to url='/status' was answered with statusCode='200' took t='2ms'
INFO  @ 24 Oct 2013 07:36:15,743 @ io.company.api.web.interceptor.ExecuteTimeInterceptor - Request to url='/transactions/async/cc7089aa5a9a54a116fd9593/0e76e80992b416f520d0b' was answered with statusCode='200' took t='294ms'
INFO  @ 24 Oct 2013 07:36:15,786 @ io.company.api.web.interceptor.ExecuteTimeInterceptor - Request to url='/status' was answered with statusCode='200' took t='4ms'
INFO  @ 24 Oct 2013 07:36:25,785 @ io.company.api.web.interceptor.ExecuteTimeInterceptor - Request to url='/status' was answered with statusCode='200' took t='2ms'

The task at hand was to find the average response time of each of the URLs served. The way I started solving this problem was too sophisticated. I tried hammering it late at night with regex. I thought it would be useful to match the url with a regex group. The difficulty laid in the fact that some of the urls contained trailing UUIDs and different secretKey strings which had to be grouped together.

Here is how I started, wasted an expensive half an hour building a regex:

regex = (".+ - Request to url='/(?P<URL>.*/)(?P<UUID1>([a-f0-9]{0,32})|"
         "([0-9a-f]{8}-([0-9a-f]{4}-){3}[0-9a-f]{12}))/(?P<UUID2>[0-9a-f]"
         "{32}|([0-9a-f]{8}-[0-9a-f]{4}-{3}[0-9a-f]{12})|.*).*t="
         "'(?P<time>\d+)ms'$")

lines = open('requests.txt').readlines()

for line in lines:
    print re.match(rgx, t).groupdict()

The above code was spitting out dictionaries which I thought I could group:

{'H2': '', 'H': None, 'U2': None, 'S': None, 'U': 'e0cf059a-5080-40a5-aaf1-67eb866aa48f', 'S2': 'secretKey'}
{'H2': '1a6173566e18097992b416f520dc00b0', 'H': '0e617c6e180a497992b416f520dc00b0', 'U2': None, 'S': None, 'U': None, 'S2': None}
{'H2': '', 'H': '0e617c6e180a497992b416f520dc00b0', 'U2': None, 'S': None, 'U': None, 'S2': 'secretKey'}
{'H2': None, 'H': None, 'U2': '1bcf059a-5080-40a5-aaf1-67eb866aa48f', 'S': None, 'U': 'e0cf059a-5080-40a5-aaf1-67eb866aa48f', 'S2': None}

However, I created a monster, which was not always working, and took much time to care for. Honestly, speaking, reading that regex 6 months later, I don't even understand it quickly. I think other developers who will read this regex, will not cheer if they had to expand it. After a few tries of fixing that code regex, I wrote a the following comment in my code:

"""stop using REGEX you fuck"""

I deleted the regex and made git commit, so I will have a future reference. Now, I started again with a clean slate intending to use regex as little as possible, and if possible use built-it methods for strings. Here is what I came with:

lines = open('requests.txt').readlines()

lines = [line.strip().split('=')[1:] for line in lines]

# dummy dict with url, n calls and total time
url_table = {'url':  {'calls': 0,  'total_time': 0}}

for (url, _, time) in lines:

   url_path = url.replace("'", "").split()[0]
   url_path = url_path.lstrip('/')
   if url_path.count('/') > 2:
       url_path = '/'.join(url_path.split('/')[:-2])
   if url_path not in url_table:
      url_table[url_path] = {'calls': 1, 'total_time': int(time[1:-3])}
   else:
      calls = url_table[url_path]['calls']
      total_time = url_table[url_path]['total_time']
      url_table[url_path] = {'calls': calls+1 ,
                             'total_time': total_time+int(time[1:-3])}

url_table.pop('url')
# print header
print "URL\t |# of calls \t| avg. response time (ms)"
for url, values in url_table.items():
    print "{}\t|{}\t|{}".format(url, values['calls'],
            values['total_time']/values['calls'])

# ouput:

# URL    |# of calls    | avg. response time (ms)
# status    |210    |163
# transactions  |3  |395
# api-docs/v2/merchants |2  |416
# status/details    |28 |852
# api-docs  |2  |173
# v2/merchants  |3  |74
# api-docs/v2/transactions  |2  |331
# transactions/async    |5  |2567
# api-docs/v2/events    |2  |390

It is not the most elegant code, but it's way more readable compared to the monstrous regex I started with.

My conclusion is that sometimes using Python's string methods is enough, and it is more readable. You have spilt, count, strip, lstrip, endswith and startswith and more. The last two methods I mentioned accept a string or a tuple of strings to check:

>>> str = "elepahant"
>>> str.startswith(('elep', 'p'))
True
>>> str.endswith(('elep', 'nt'))
True

So before you go hacking yourself a nice regex, check the STL, maybe the code you need is already implemented with a nice API. Don't write your own pareDateString as shown above. Python is there to help you work with text in an elegant way, without much need to use regex.

This entry was tagged: python

Share this post:

Discussions/Feedback.

comments powered by Disqus