<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: repl</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/repl.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2009-10-26T09:34:34+00:00</updated><author><name>Simon Willison</name></author><entry><title>Toiling in the data-mines: what data exploration feels like</title><link href="https://simonwillison.net/2009/Oct/26/toiling/#atom-tag" rel="alternate"/><published>2009-10-26T09:34:34+00:00</published><updated>2009-10-26T09:34:34+00:00</updated><id>https://simonwillison.net/2009/Oct/26/toiling/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://berglondon.com/blog/2009/10/23/toiling-in-the-data-mines-what-data-exploration-feels-like/"&gt;Toiling in the data-mines: what data exploration feels like&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Useful advice from Tom Armitage on the exploratory development approach required when starting to build a project against a large, complex dataset. Tips include making sure you have a REPL to hand and using tools like gRaphael to generate graphs against pretty much everything, since until you’ve seen their shape you won’t know if they are interesting or not.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/berg"&gt;berg&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data"&gt;data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/exploratoryprogramming"&gt;exploratoryprogramming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/graphael"&gt;graphael&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/graphing"&gt;graphing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/programming"&gt;programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/repl"&gt;repl&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tom-armitage"&gt;tom-armitage&lt;/a&gt;&lt;/p&gt;



</summary><category term="berg"/><category term="data"/><category term="exploratoryprogramming"/><category term="graphael"/><category term="graphing"/><category term="programming"/><category term="repl"/><category term="tom-armitage"/></entry><entry><title>Interactive Python</title><link href="https://simonwillison.net/2003/Sep/15/interactivePython/#atom-tag" rel="alternate"/><published>2003-09-15T21:20:50+00:00</published><updated>2003-09-15T21:20:50+00:00</updated><id>https://simonwillison.net/2003/Sep/15/interactivePython/#atom-tag</id><summary type="html">
    &lt;p&gt;I adore the Python interactive interpreter. I use it for development (it's amazing how many bugs you can skip by testing your code line by line in the interactive environment), I use it for calculations, but recently I've also found myself using it just as a general tool for answering questions.&lt;/p&gt;

&lt;p&gt;Here's a classic example. &lt;a href="http://funkbunny.com/datatype/archives/000076.html" title="The RIAA Are Dicks. We Apologize."&gt;This blog entry&lt;/a&gt; describes a campaign to reimbuse the 12 year old girl recently &lt;a href="http://news.com.com/2100-1027_3-5073717.html" title="RIAA settles with 12-year-old girl"&gt;fined $2000&lt;/a&gt; by the &lt;acronym title="Recording Industry Association of America"&gt;RIAA&lt;/acronym&gt; for file sharing. The full amount has been raised, and a list of doners is available along with how much each donated. Being the inquisitive type I am, I wanted to know how much money was raised in total. First, I copied and pasted the list in to a Python string in IDLE:&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;&amp;gt;&amp;gt;&amp;gt; s = """$20 - Emmett Plant, USA
$20 - Peter Mills, UK
$20 - "Billy Blackbeard," USA
...
$10 - Will Morton"""
&amp;gt;&amp;gt;&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;All of the monetary values consist of 2 digits, so next I compiled and tested a regular expression to search for them:&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;&amp;gt;&amp;gt;&amp;gt; import re
&amp;gt;&amp;gt;&amp;gt; num = re.compile(r'\d\d')
&amp;gt;&amp;gt;&amp;gt; num.findall(s)
['20', '20', '20', '20', '20', '20', '20', ...
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now I can run the &lt;code class="python"&gt;sum()&lt;/code&gt; function to add them all up:&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;&amp;gt;&amp;gt;&amp;gt; sum(num.findall(s))

Traceback (most recent call last):
  File "&amp;lt;pyshell#4&amp;gt;", line 1, in -toplevel-
    sum(num.findall(s))
TypeError: unsupported operand type(s) for +: 'int' and 'str'
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Oops! &lt;code class="python"&gt;sum&lt;/code&gt; operates on integers but the list is full of strings. We can use &lt;code class="python"&gt;map&lt;/code&gt; to apply the &lt;code class="python"&gt;int&lt;/code&gt; function to every item in the list first:&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;&amp;gt;&amp;gt;&amp;gt; sum(map(int, num.findall(s)))
2005
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And there's the answer. I think this quite neatly demonstrates the power and flexibility of the interactive prompt - for one thing, it shows that errors really don't matter as you can simply try again the next time round. It also shows that most of the time you don't even need to assign additional variables - Python is fast enough that you can just build up more and more complicated expressions. When you're just trying to find a one off answer to a problem code readability doesn't really come in to the equation.&lt;/p&gt;

&lt;p&gt;A more interesting problem that came up today was working out the percentage of Netscape 4 visits to the &lt;a href="http://www.python.org/"&gt;Python.org&lt;/a&gt; site in the last month, as part of a mailing list discussion on whether or not the site should embrace a pure &lt;acronym title="Cascading Style Sheets"&gt;CSS&lt;/acronym&gt; layout. The raw data is a &lt;a href="http://www.python.org/wwwstats/agent_200308.html" title="Usage Statistics for www.python.org: August 2003 - User Agent"&gt;huge, ugly file&lt;/a&gt; listing 12,000 odd user agent strings along with the number of hits from each. My first step was to copy out the data part of the file and save it as a text file. I also compiled a new regular expression to find all lines that &lt;em&gt;start&lt;/em&gt; with a number, which could then be used to ensure the data loaded was in the right format.&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;&amp;gt;&amp;gt;&amp;gt; num = re.compile(r'^(\d+)')
&amp;gt;&amp;gt;&amp;gt; lines = open('python-browser-stats.txt').readlines() 
&amp;gt;&amp;gt;&amp;gt; lines = [line for line in lines if num.match(line)] 
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Finding the lines that contained a user agent string for Netscape 4 took a bit of effort, mainly because of the utterly insane way user agent strings have evolved over the years. I eventually settled on the rule that anything with Mozilla/4.x in it &lt;em&gt;without&lt;/em&gt; the word 'compatible' was probably a Netscape 4 variant. I excluded anything with 'Gecko' in it as well, but with hindsight this was unnecessary as Gecko browsers all start with Mozilla/5.x.&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;&amp;gt;&amp;gt;&amp;gt; netscape = [line for line in lines if
    'Mozilla/4' in line and
    'compatible' not in line and
    'Gecko' not in line]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Are you getting the impression that I love list comprehensions yet?&lt;/p&gt;

&lt;p&gt;When working in the interactive prompt it's a good idea to periodically check that the data you are dealing with looks how you expect it to look. I've stripped down the explanation of what I did quite a bit - in fact there was a lot more checking of variables and lists to make sure nothing had gone wrong. At this point, here's what an item in my netscape array looked like:&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;&amp;gt;&amp;gt;&amp;gt; netscape[0]
'3536       0.05%  Mozilla/4.01 [en](Win95;I)\n'
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;OK, I now had two arrays, one featuring all of the lines in the input set and another featuring just those lines that referred to a Netscape 4 browser. The final trick is to add up the total numbers for each of those sets. Remember, the total is the sum of all of the numbers at the start of each line. First, I built up new arrays of just those numbers (as integers) using the regular expression defined previously:&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;&amp;gt;&amp;gt;&amp;gt; nscounts = [int(num.match(line).groups()[0]) for line in netscape] 
&amp;gt;&amp;gt;&amp;gt; allcounts = [int(num.match(line).groups()[0]) for line in lines]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We now have two arrays of numbers. The total for each array can be found with the sum function, but we want the over all percentage of Netscape 4 user agents:&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;&amp;gt;&amp;gt;&amp;gt; print float(sum(nscounts)) / sum(allcounts) * 100
1.17457446601
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The float call is in there because Python disregards the remainder in straight integer division; by casting one of the arguments to a float floating point division is used instead. As you can see, only approximately 1.17% of visits to Python.org in August were made using Netscape 4&lt;sup&gt;*&lt;/sup&gt;. The case for &lt;acronym title="Cascading Style Sheets"&gt;CSS&lt;/acronym&gt; seems assured.&lt;/p&gt;

&lt;p&gt;This has turned in to a longer entry than I had intended, but I hope it demonstrates the power and versatility of Python's interactive mode.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;sup&gt;*&lt;/sup&gt; Please note that this figure is not entirely accurate, as it may also include web spiders that pretend to be Netscape 4, Opera users and a few other false positives as well. As an estimate though it's probably pretty good.&lt;/em&gt;&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/analytics"&gt;analytics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/netscape"&gt;netscape&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/repl"&gt;repl&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="analytics"/><category term="netscape"/><category term="python"/><category term="repl"/></entry></feed>