Toiling in the data-mines: what data exploration feels like

2009-10-26T09:34:34+00:00

Toiling in the data-mines: what data exploration feels like

Useful advice from Tom Armitage on the exploratory development approach required when starting to build a project against a large, complex dataset. Tips include making sure you have a REPL to hand and using tools like gRaphael to generate graphs against pretty much everything, since until you’ve seen their shape you won’t know if they are interesting or not.

Tags: berg, data, exploratoryprogramming, graphael, graphing, programming, repl, tom-armitage

Interactive Python

2003-09-15T21:20:50+00:00

I adore the Python interactive interpreter. I use it for development (it's amazing how many bugs you can skip by testing your code line by line in the interactive environment), I use it for calculations, but recently I've also found myself using it just as a general tool for answering questions.

Here's a classic example. This blog entry describes a campaign to reimbuse the 12 year old girl recently fined $2000 by the RIAA for file sharing. The full amount has been raised, and a list of doners is available along with how much each donated. Being the inquisitive type I am, I wanted to know how much money was raised in total. First, I copied and pasted the list in to a Python string in IDLE:

>>> s = """$20 - Emmett Plant, USA
$20 - Peter Mills, UK
$20 - "Billy Blackbeard," USA
...
$10 - Will Morton"""
>>>

All of the monetary values consist of 2 digits, so next I compiled and tested a regular expression to search for them:

>>> import re
>>> num = re.compile(r'\d\d')
>>> num.findall(s)
['20', '20', '20', '20', '20', '20', '20', ...

Now I can run the sum() function to add them all up:

>>> sum(num.findall(s))

Traceback (most recent call last):
  File "<pyshell#4>", line 1, in -toplevel-
    sum(num.findall(s))
TypeError: unsupported operand type(s) for +: 'int' and 'str'

Oops! sum operates on integers but the list is full of strings. We can use map to apply the int function to every item in the list first:

>>> sum(map(int, num.findall(s)))
2005

And there's the answer. I think this quite neatly demonstrates the power and flexibility of the interactive prompt - for one thing, it shows that errors really don't matter as you can simply try again the next time round. It also shows that most of the time you don't even need to assign additional variables - Python is fast enough that you can just build up more and more complicated expressions. When you're just trying to find a one off answer to a problem code readability doesn't really come in to the equation.

A more interesting problem that came up today was working out the percentage of Netscape 4 visits to the Python.org site in the last month, as part of a mailing list discussion on whether or not the site should embrace a pure CSS layout. The raw data is a huge, ugly file listing 12,000 odd user agent strings along with the number of hits from each. My first step was to copy out the data part of the file and save it as a text file. I also compiled a new regular expression to find all lines that start with a number, which could then be used to ensure the data loaded was in the right format.

>>> num = re.compile(r'^(\d+)')
>>> lines = open('python-browser-stats.txt').readlines() 
>>> lines = [line for line in lines if num.match(line)]

Finding the lines that contained a user agent string for Netscape 4 took a bit of effort, mainly because of the utterly insane way user agent strings have evolved over the years. I eventually settled on the rule that anything with Mozilla/4.x in it without the word 'compatible' was probably a Netscape 4 variant. I excluded anything with 'Gecko' in it as well, but with hindsight this was unnecessary as Gecko browsers all start with Mozilla/5.x.

>>> netscape = [line for line in lines if
    'Mozilla/4' in line and
    'compatible' not in line and
    'Gecko' not in line]

Are you getting the impression that I love list comprehensions yet?

When working in the interactive prompt it's a good idea to periodically check that the data you are dealing with looks how you expect it to look. I've stripped down the explanation of what I did quite a bit - in fact there was a lot more checking of variables and lists to make sure nothing had gone wrong. At this point, here's what an item in my netscape array looked like:

>>> netscape[0]
'3536       0.05%  Mozilla/4.01 [en](Win95;I)\n'

OK, I now had two arrays, one featuring all of the lines in the input set and another featuring just those lines that referred to a Netscape 4 browser. The final trick is to add up the total numbers for each of those sets. Remember, the total is the sum of all of the numbers at the start of each line. First, I built up new arrays of just those numbers (as integers) using the regular expression defined previously:

>>> nscounts = [int(num.match(line).groups()[0]) for line in netscape] 
>>> allcounts = [int(num.match(line).groups()[0]) for line in lines]

We now have two arrays of numbers. The total for each array can be found with the sum function, but we want the over all percentage of Netscape 4 user agents:

>>> print float(sum(nscounts)) / sum(allcounts) * 100
1.17457446601

The float call is in there because Python disregards the remainder in straight integer division; by casting one of the arguments to a float floating point division is used instead. As you can see, only approximately 1.17% of visits to Python.org in August were made using Netscape 4^*. The case for CSS seems assured.

This has turned in to a longer entry than I had intended, but I hope it demonstrates the power and versatility of Python's interactive mode.

^* Please note that this figure is not entirely accurate, as it may also include web spiders that pretend to be Netscape 4, Opera users and a few other false positives as well. As an estimate though it's probably pretty good.

Tags: analytics, netscape, python, repl

Simon Willison's Weblog: repl

Toiling in the data-mines: what data exploration feels like

Interactive Python