A few years ago I worked for a small startup with with a few developers who knew regular expressions very well, but didn't know Python's way of doing things. We had no API, and our way of exchanging data was parsing files uploaded to our FTP server by our partners (sigh ... I know it's ugly, but hey I didn't design it). The some of the files where CSV or XML files with string as dates. The quality of the files was not really good, which means that the dates where not consistent. Dates could have been written as:
16-14-03
2016.14.03
We had no way of telling, and there was only one parser for all files. The code
of the parser contained a function called parseDateString
which was responsible
to convert a string to a python datetime
object. The function was pretty long,
and had to be extended every time a new format was encountered, because it broke
the code which cause the ETL to stop. So how did this function look, and what
does it have to do with regex? Here is a small demo extract, not the real code,
but close enough to demonstrate the case:
def parseDateString(date_string):
d = None
if re.match('\d{4}-\d{2}-\d{2}'):
d = date.strptime(date_string, "%Y-%m-%d")
elif re.match('\d{2}-\d{2}-\d{2}', date_string):
d = date.strptime(date_string, "%y-%m-%d")
elif re.match('\d{4}\.\d{2}\.\d{2}', date_string):
d = date.strptime(date_string, "%y.%m.%d")
elif re.match(....):
...
return d
So, now if we got a data file with a new date format, for example 2015/01/31
the above code would have returned None
, and somewhere along the way an exception was
triggered, the ETL process was stopped, and somebody would have to fix the code by adding
another regex match.
While this worked, this method have become something like 200
lines of code with the time, which needed to be modified every time this code broke.
This was expensive, because it was running slow. And even if
we used re.compile
it was still expensive in terms of developer time spent on
that task. As I was the junior in the company, I had to deal with this task very
often, because it was a repetitive task that no one liked doing (I suggested
refactoring this part, but since I was junior and we had no tests, I could no prove
my case, but that is another story).
So how this can be written without regex,using a Pythonic way without making this a Programmer task? Here is just was example:
def parseDateString(date_string):
formats = ['%Y-%m-%d', '%Y.%m.%d', '%Y/%m/%d',] # add more as you like
for format_ in formats:
try:
d = date.strptime(date_string, format_)
return d
except ValueError:
continue
Now formats can be easily read from a configuration file, or a Django admin interface, which can be done by other people then developers (who hate repetitive tasks...). Also there is no regex involved. The original function with regex, was written the way it was written because the original developer was really good at them, and tended to use them everywhere, even when not needed.
Here is a second example. This time, I am the one responsible for this overwhelmingly complex regex. Luckily for everyone involved, this never really got to production, but I kept the code as a future reference for a bad code. The task at hand was to find the average response time of a web server serving API requests. The log file contained entries like the following:
INFO @ 24 Oct 2013 07:59:35,770 @ io.company.api.web.interceptor.ExecuteTimeInterceptor - Request to url='/status' was answered with statusCode='200' took t='1ms'
INFO @ 24 Oct 2013 07:59:45,770 @ io.company.api.web.interceptor.ExecuteTimeInterceptor - Request to url='/status' was answered with statusCode='200' took t='1ms'
INFO @ 24 Oct 2013 07:59:55,771 @ io.company.api.web.interceptor.ExecuteTimeInterceptor - Request to url='/status' was answered with statusCode='200' took t='2ms'
INFO @ 24 Oct 2013 07:37:37,661 @ io.company.api.web.interceptor.ExecuteTimeInterceptor - Request to url='/v2/foo/e0cf059a-5080-40a5-aaf1-67eb866aa48f/secretKey' was answered with statusCode='400' took t='211ms'
INFO @ 24 Oct 2013 07:59:55,771 @ io.company.api.web.interceptor.ExecuteTimeInterceptor - Request to url='/status' was answered with statusCode='200' took t='2ms'
INFO @ 24 Oct 2013 07:59:55,771 @ io.company.api.web.interceptor.ExecuteTimeInterceptor - Request to url='/status' was answered with statusCode='200' took t='2ms'
INFO @ 24 Oct 2013 07:36:15,743 @ io.company.api.web.interceptor.ExecuteTimeInterceptor - Request to url='/transactions/async/cc7089aa5a9a54a116fd9593/0e76e80992b416f520d0b' was answered with statusCode='200' took t='294ms'
INFO @ 24 Oct 2013 07:36:15,786 @ io.company.api.web.interceptor.ExecuteTimeInterceptor - Request to url='/status' was answered with statusCode='200' took t='4ms'
INFO @ 24 Oct 2013 07:36:25,785 @ io.company.api.web.interceptor.ExecuteTimeInterceptor - Request to url='/status' was answered with statusCode='200' took t='2ms'
The task at hand was to find the average response time of each of the URLs served.
The way I started solving this problem was too sophisticated. I tried hammering
it late at night with regex. I thought it would be useful to match the url with
a regex group. The difficulty laid in the fact that some of the urls contained trailing
UUIDs and different secretKey
strings which had to be grouped together.
Here is how I started, wasted an expensive half an hour building a regex:
regex = (".+ - Request to url='/(?P<URL>.*/)(?P<UUID1>([a-f0-9]{0,32})|"
"([0-9a-f]{8}-([0-9a-f]{4}-){3}[0-9a-f]{12}))/(?P<UUID2>[0-9a-f]"
"{32}|([0-9a-f]{8}-[0-9a-f]{4}-{3}[0-9a-f]{12})|.*).*t="
"'(?P<time>\d+)ms'$")
lines = open('requests.txt').readlines()
for line in lines:
print re.match(rgx, t).groupdict()
The above code was spitting out dictionaries which I thought I could group:
{'H2': '', 'H': None, 'U2': None, 'S': None, 'U': 'e0cf059a-5080-40a5-aaf1-67eb866aa48f', 'S2': 'secretKey'}
{'H2': '1a6173566e18097992b416f520dc00b0', 'H': '0e617c6e180a497992b416f520dc00b0', 'U2': None, 'S': None, 'U': None, 'S2': None}
{'H2': '', 'H': '0e617c6e180a497992b416f520dc00b0', 'U2': None, 'S': None, 'U': None, 'S2': 'secretKey'}
{'H2': None, 'H': None, 'U2': '1bcf059a-5080-40a5-aaf1-67eb866aa48f', 'S': None, 'U': 'e0cf059a-5080-40a5-aaf1-67eb866aa48f', 'S2': None}
However, I created a monster, which was not always working, and took much time to care for. Honestly, speaking, reading that regex 6 months later, I don't even understand it quickly. I think other developers who will read this regex, will not cheer if they had to expand it. After a few tries of fixing that code regex, I wrote a the following comment in my code:
"""stop using REGEX you fuck"""
I deleted the regex and made git commit
, so I will have a future reference.
Now, I started again with a clean slate intending to use regex as little as possible,
and if possible use built-it methods for strings. Here is what I came with:
lines = open('requests.txt').readlines()
lines = [line.strip().split('=')[1:] for line in lines]
# dummy dict with url, n calls and total time
url_table = {'url': {'calls': 0, 'total_time': 0}}
for (url, _, time) in lines:
url_path = url.replace("'", "").split()[0]
url_path = url_path.lstrip('/')
if url_path.count('/') > 2:
url_path = '/'.join(url_path.split('/')[:-2])
if url_path not in url_table:
url_table[url_path] = {'calls': 1, 'total_time': int(time[1:-3])}
else:
calls = url_table[url_path]['calls']
total_time = url_table[url_path]['total_time']
url_table[url_path] = {'calls': calls+1 ,
'total_time': total_time+int(time[1:-3])}
url_table.pop('url')
# print header
print "URL\t |# of calls \t| avg. response time (ms)"
for url, values in url_table.items():
print "{}\t|{}\t|{}".format(url, values['calls'],
values['total_time']/values['calls'])
# ouput:
# URL |# of calls | avg. response time (ms)
# status |210 |163
# transactions |3 |395
# api-docs/v2/merchants |2 |416
# status/details |28 |852
# api-docs |2 |173
# v2/merchants |3 |74
# api-docs/v2/transactions |2 |331
# transactions/async |5 |2567
# api-docs/v2/events |2 |390
It is not the most elegant code, but it's way more readable compared to the monstrous regex I started with.
My conclusion is that sometimes using Python's string methods is enough, and
it is more readable. You have spilt
, count
, strip, lstrip
, endswith
and
startswith
and more. The last two methods I mentioned accept a string or a
tuple of strings to check:
>>> str = "elepahant"
>>> str.startswith(('elep', 'p'))
True
>>> str.endswith(('elep', 'nt'))
True
So before you go hacking yourself a nice regex, check the STL, maybe the code
you need is already implemented with a nice API. Don't write your own
pareDateString
as shown above. Python is there to help you work with text in
an elegant way, without much need to use regex.
Share this post: