Simon Willison's Weblog: guardian

The Guardian’s nifty old-article trick is a reminder of how news organizations can use metadata to limit misinformation

2019-12-23T09:36:42+00:00

The Guardian’s nifty old-article trick is a reminder of how news organizations can use metadata to limit misinformation

The Guardian displays prominent banners on news stories from more than a year ago warning that it is an older article to help prevent accidental or intentional spread of misinformation using their content as ammunition. Impressively they also display the year prominently on the card images they serve as social media previews fir older articles.

Via @chronotope

Tags: guardian, news, misinformation

Want to see what one digital future for newspapers looks like? Look at The Guardian, which isn't losing money anymore

2019-05-02T05:49:52+00:00

Want to see what one digital future for newspapers looks like? Look at The Guardian, which isn't losing money anymore

After losing money every single year since 1998, the Guardian just managed to turn a profit! Detailed analysis of how they did it by Joshua Benton.

Via MetaFilter

Tags: guardian, newspapers

Linked Data at the Guardian

2010-10-19T19:11:00+00:00

Linked Data at the Guardian

The Guardian’s Open Platform API can now be queried by MusicBrainz ID and ISBN, opening up some extremely useful new types of query.

Tags: guardian, openplatform, semanticweb, recovered

Today's Guardian, by Phil Gyford

2010-06-09T23:21:00+00:00

Today's Guardian, by Phil Gyford

An alternative interface for reading today’s Guardian, built using the new Open Platform Content API and with extensive design notes from creator Phil Gyford.

Tags: contentapi, design, guardian, newspapers, openplatform, phil-gyford, recovered

What's powering the Content API?

2010-05-24T14:08:00+00:00

What's powering the Content API?

The new Guardian Content API runs on Solr, scaled using EC2 and Solr replication and with a Scala web service layer sitting between Solr and the API’s end users.

Tags: apis, contentapi, ec2, guardian, openplatform, scala, scaling, solr, recovered

OpenPlatform Content API Explorer

2010-05-20T17:42:00+00:00

OpenPlatform Content API Explorer

The new API explorer for the Guardian’s Content API.

Tags: api, guardian, openplatform, recovered, api-explorer

The Guardian's Open Platform is open for business

2010-05-20T17:40:00+00:00

The Guardian's Open Platform is open for business

The Guardian’s Content API is now out of beta. Of particular interest: you can access basic article metadata (headline, URL and tags) without using an API key at all, and the API supports JSONP—just request format=json and include a callback=foo argument.

Tags: api, guardian, json, jsonp, openplatform, recovered, content

Live blogging the general election

2010-05-10T16:27:00+00:00

Live blogging the general election

The Guardian’s ongoing live blogs covering the UK election have been the best way of following events that I’ve seen (yes, better than Twitter). Live-blog author Andrew Sparrow explains his approach.

Tags: blogging, guardian, journalism, recovered, election, andrew-sparrow

Breakfast Instapaper

2010-04-29T11:49:02+00:00

Breakfast Instapaper

Handy tool for selecting and bulk-submitting stories from today’s Guardian and NYTimes to your Instapaper account, by Daniel Vydra.

Tags: daniel-vydra, guardian, instapaper, new-york-times

Comprehensive notes from my three hour Redis tutorial

2010-04-25T22:36:16+00:00

Last week I presented two talks at the inaugural NoSQL Europe conference in London. The first was presented with Matthew Wall and covered the ways in which we have been exploring NoSQL at the Guardian. The second was a three hour workshop on Redis, my favourite piece of software to have the NoSQL label applied to it.

I've written about Redis here before, and it has since earned a place next to MySQL/PostgreSQL and memcached as part of my default web application stack. Redis makes write-heavy features such as real-time statistics feasible for small applications, while effortlessly scaling up to handle larger projects as well. If you haven't tried it out yet, you're sorely missing out.

For the workshop, I tried to give an overview of each individual Redis feature along with detailed examples of real-world problems that the feature can help solve. I spent the past day annotating each slide with detailed notes, and I think the result makes a pretty good stand-alone tutorial. Here's the end result:

Redis tutorial slides and notes

In unrelated news, Nat and I both completed the first ever Brighton Marathon last weekend, in my case taking 4 hours, 55 minutes and 17 seconds. Sincere thanks to everyone who came out to support us - until the race I had never appreciated how important the support of the spectators is to keep going to the end. We raised £757 for the Have a Heart children's charity. Thanks in particular to Clearleft who kindly offered to match every donation.

Tags: brightonmarathon, guardian, marathon, nosql, redis, running, my-talks, highlights, annotated-talks

Random Guardian

2010-03-26T16:47:03+00:00

Random Guardian

A random page from today’s Guardian, built by Daniel Vydra.

Via Inside Guardian

Tags: daniel-vydra, guardian, random

A day on Chatroulette, the web's weirdest new outpost

2010-02-15T11:15:00+00:00

A day on Chatroulette, the web's weirdest new outpost

By the Guardian’s Bobbie Johnson.

Tags: bobbie-johnson, chatroulette, guardian

What's hot? Introducing Zeitgeist

2010-02-05T12:17:32+00:00

What's hot? Introducing Zeitgeist

Dan Catt’s first project at the Guardian. “When something appears on the Zeitgeist page, it’s because it performed better (got more attention) than the norm for that content type/section/day”. The application itself is written in Python and runs on Google App Engine.

Tags: dan-catt, google-app-engine, guardian, python, zeitgeist

World Government Data

2010-01-27T12:27:03+00:00

World Government Data

Launched last week, this is the Guardian’s meta-search engine for searching and browsing through data from four different government data sites (with more sites planned). Under the hood it’s Django, Solr, Haystack and the Scrapy crawling library. The application was built by Ben Firshman during an internship over Christmas.

Tags: ben-firshman, data, datagovuk, django, guardian, haystack, projects, python, scrapy, solr

Applications: the real stars of the data.gov.uk launch

2010-01-27T12:23:54+00:00

Applications: the real stars of the data.gov.uk launch

A write-up of the data.gov.uk launch event at the Guardian. I demonstrated the Guardian’s World Government Data search engine and a small data.gov.uk inspired feature on WildlifeNearYou.

Tags: datagovuk, guardian, projects, wildlifenearyou

Notes on designing the Guardian iPhone app

2009-12-20T12:55:48+00:00

Notes on designing the Guardian iPhone app

By John-Henry Barac, the principal designer of he iPhone application who also previously worked on the Guardian’s print transition to the Berliner format.

Tags: design, guardian, iphone, john-henry-barac, mobile

Crowdsourced document analysis and MP expenses

2009-12-20T12:07:53+00:00

As you may have heard, the UK government released a fresh batch of MP expenses documents a week ago on Thursday. I spent that week working with a small team at Guardian HQ to prepare for the release. Here's what we built:

http://mps-expenses2.guardian.co.uk/ Updated March 2021: all links now go to the Internet Archive

It's a crowdsourcing application that asks the public to help us dig through and categorise the enormous stack of documents - around 30,000 pages of claim forms, scanned receipts and hand-written letters, all scanned and published as PDFs.

This is the second time we've tried this - the first was back in June, and can be seen at mps-expenses.guardian.co.uk. Last week's attempt was an opportunity to apply the lessons we learnt the first time round.

Writing crowdsourcing applications in a newspaper environment is a fascinating challenge. Projects have very little notice - I heard about the new document release the Thursday before giving less than a week to put everything together. In addition to the fast turnaround for the application itself, the 48 hours following the release are crucial. The news cycle moves fast, so if the application launches but we don't manage to get useful data out of it quickly the story will move on before we can impact it.

ScaleCamp on the Friday meant that development work didn't properly kick off until Monday morning. The bulk of the work was performed by two server-side developers, one client-side developer, one designer and one QA on Monday, Tuesday and Wednesday. The Guardian operations team deftly handled our EC2 configuration and deployment, and we had some extra help on the day from other members of the technology department. After launch we also had a number of journalists helping highlight discoveries and dig through submissions.

The system was written using Django, MySQL (InnoDB), Redis and memcached.

Asking the right question

The biggest mistake we made the first time round was that we asked the wrong question. We tried to get our audience to categorise documents as either "claims" or "receipts" and to rank them as "not interesting", "a bit interesting", "interesting but already known" and "someone should investigate this". We also asked users to optionally enter any numbers they saw on the page as categorised "line items", with the intention of adding these up later.

The line items, with hindsight, were a mistake. 400,000 documents makes for a huge amount of data entry and for the figures to be useful we would need to confirm their accuracy. This would mean yet more rounds of crowdsourcing, and the job was so large that the chance of getting even one person to enter line items for each page rapidly diminished as the news story grew less prominent.

The categorisations worked reasonably well but weren't particularly interesting - knowing if a document is a claim or receipt is useful only if you're going to collect line items. The "investigate this" button worked very well though.

We completely changed our approach for the new system. We dropped the line item task and instead asked our users to categories each page by applying one or more tags, from a small set that our editors could control. This gave us a lot more flexibility - we changed the tags shortly before launch based on the characteristics of the documents - and had the potential to be a lot more fun as well. I'm particularly fond of the "hand-written" tag, which has highlighted some lovely examples of correspondence between MPs and the expenses office.

Sticking to an editorially assigned set of tags provided a powerful tool for directing people's investigations, and also ensured our users didn't start creating potentially libelous tags of their own.

Breaking it up in to assignments

For the first project, everyone worked together on the same task to review all of the documents. This worked fine while the document set was small, but once we had loaded in 400,000+ pages the progress bar become quite depressing.

This time round, we added a new concept of "assignments". Each assignment consisted of the set of pages belonging to a specified list of MPs, documents or political parties. Assignments had a threshold, so we could specify that a page must be reviewed by at least X people before it was considered reviewed. An editorial tool let us feature one "main" assignment and several alternative assignments right on the homepage.

Clicking "start reviewing" on an assignment sets a cookie for that assignment, and adds the assignment's progress bar to the top of the review interface. New pages are selected at random from the set of unreviewed pages in that assignment.

The assignments system proved extremely effective. We could use it to direct people to the highest value documents (our top hit list of interesting MPs, or members of the shadow cabinet) while still allowing people with specific interests to pick an alternative task.

Get the button right!

Having run two crowdsourcing projects I can tell you this: the single most important piece of code you will write is the code that gives someone something new to review. Both of our projects had big "start reviewing" buttons. Both were broken in different ways.

The first time round, the mistakes were around scalability. I used a SQL "ORDER BY RAND()" statement to return the next page to review. I knew this was an inefficient operation, but I assumed that it wouldn't matter since the button would only be clicked occasionally.

Something like 90% of our database load turned out to be caused by that one SQL statement, and it only got worse as we loaded more pages in to the system. This caused multiple site slow downs and crashes until we threw together a cron job that pushed 1,000 unreviewed page IDs in to memcached and made the button pick one of those at random.

This solved the performance problem, but meant that our user activity wasn't nearly as well targeted. For optimum efficiency you really want everyone to be looking at a different page - and a random distribution is almost certainly the easiest way to achieve that.

The second time round I turned to my new favourite in-memory data structure server, redis, and its SRANDMEMBER command (a feature I requested a while ago with this exact kind of project in mind). The system maintains a redis set of all IDs that needed to be reviewed for an assignment to be complete, and a separate set of IDs of all pages had been reviewed. It then uses redis set intersection (the SDIFFSTORE command) to create a set of unreviewed pages for the current assignment and then SRANDMEMBER to pick one of those pages.

This is where the bug crept in. Redis was just being used as an optimisation - the single point of truth for whether a page had been reviewed or not stayed as MySQL. I wrote a couple of Django management commands to repopulate the denormalised Redis sets should we need to manually modify the database. Unfortunately I missed some - the sets that tracked what pages were available in each document. The assignment generation code used an intersection of these sets to create the overall set of documents for that assignment. When we deleted some pages that had accidentally been imported twice I failed to update those sets.

This meant the "next page" button would occasionally turn up a page that didn't exist. I had some very poorly considered fallback logic for that - if the random page didn't exist, the system would return the first page in that assignment instead. Unfortunately, this meant that when the assignment was down to the last four non-existent pages every single user was directed to the same page - which subsequently attracted well over a thousand individual reviews.

Next time, I'm going to try and make the "next" button completely bullet proof! I'm also going to maintain a "denormalisation dictionary" documenting every denormalisation in the system in detail - such a thing would have saved me several hours of confused debugging.

Exposing the results

The biggest mistake I made last time was not getting the data back out again fast enough for our reporters to effectively use it. It took 24 hours from the launch of the application to the moment the first reporting feature was added - mainly because we spent much of the intervening time figuring out the scaling issues.

This time we handled this a lot better. We provided private pages exposing all recent activity on the site. We also provided public pages for each of the tags, as well as combination pages for party + tag, MP + tag, document + tag, assignment + tag and user + tag. Most of these pages were ordered by most-tagged, with the hope that the most interesting pages would quickly bubble to the top.

This worked pretty well, but we made one key mistake. The way we were ordering pages meant that it was almost impossible to paginate through them and be sure that you had seen everything under a specific tag. If you're trying to keep track of everything going on in the site, reliable pagination is essential. The only way to get reliable pagination on a fast moving site is to order by the date something was first added to a set in ascending order. That way you can work through all of the pages, wait a bit, hit "refresh" and be able to continue paginating where you left off. Any other order results in the content of each page changing as new content comes in.

We eventually added an undocumented /in-order/ URL prefix to address this issue. Next time I'll pay a lot more attention to getting the pagination options right from the start.

Rewarding our contributors

The reviewing experience the first time round was actually quite lonely. We deliberately avoided showing people how others had marked each page because we didn't want to bias the results. Unfortunately this meant the site felt like a bit of a ghost town, even when hundreds of other people were actively reviewing things at the same time.

For the new version, we tried to provide a much better feeling of activity around the site. We added "top reviewer" tables to every assignment, MP and political party as well as a "most active reviewers in the past 48 hours" table on the homepage (this feature was added to the first project several days too late). User profile pages got a lot more attention, with more of a feel that users were collecting their favourite pages in to tag buckets within their profile.

Most importantly, we added a concept of discoveries - editorially highlighted pages that were shown on the homepage and credited to the user that had first highlighted them. These discoveries also added valuable editorial interest to the site, showing up on the homepage and also the index pages for political parties and individual MPs.

Light-weight registration

For both projects, we implemented an extremely light-weight form of registration. Users can start reviewing pages without going through any signup mechanism, and instead are assigned a cookie and an anon-454 style username the first time they review a document. They are then encouraged to assign themselves a proper username and password so they can log in later and take credit for their discoveries.

It's difficult to tell how effective this approach really is. I have a strong hunch that it dramatically increases the number of people who review at least one document, but without a formal A/B test it's hard to tell how true that is. The UI for this process in the first project was quite confusing - we gave it a solid makeover the second time round, which seems to have resulted in a higher number of conversions.

Overall lessons

News-based crowdsourcing projects of this nature are both challenging and an enormous amount of fun. For the best chances of success, be sure to ask the right question, ensure user contributions are rewarded, expose as much data as possible and make the "next thing to review" behaviour rock solid. I'm looking forward to the next opportunity to apply these lessons, although at this point I really hope it involves something other than MPs' expenses.

Tags: crowdsourcing, django, guardian, innodb, memcached, mpsexpenses, mysql, nosql, politics, projects, python, redis

Guardian iPhone app

2009-12-14T13:29:29+00:00

Guardian iPhone app

Released today, ad-free, £2.39 for the application, has an excellent offline mode. I helped build the backend web service, which is a Django app running on EC2.

Tags: django, ec2, guardian, iphone, python

UK Scale Camp

2009-11-04T23:12:17+00:00

UK Scale Camp

We’re hosting a one day web performance and scalability unconference at the Guardian on the 4th of December. If you’re involved in running a high-scale website in the UK (or abroad) we’d love you to come along. Spaces are going fast.

Tags: events, guardian, performance, scalability, ukscalecamp, unconference

Why I like Redis

2009-10-22T10:58:21+00:00

I've been getting a lot of useful work done with Redis recently.

Redis is typically categorised as yet another of those new-fangled NoSQL key/value stores, but if you look closer it actually has some pretty unique characteristics. It makes more sense to describe it as a "data structure server" - it provides a network service that exposes persistent storage and operations over dictionaries, lists, sets and string values. Think memcached but with list and set operations and persistence-to-disk.

It's also incredibly easy to set up, ridiculously fast (30,000 read or writes a second on my laptop with the default configuration) and has an interesting approach to persistence. Redis runs in memory, but syncs to disk every Y seconds or after every X operations. Sounds risky, but it supports replication out of the box so if you're worried about losing data should a server fail you can always ensure you have a replicated copy to hand. I wouldn't trust my only copy of critical data to it, but there are plenty of other cases for which it is really well suited.

I'm currently not using it for data storage at all - instead, I use it as a tool for processing data using the interactive Python interpreter.

I'm a huge fan of REPLs. When programming Python, I spend most of my time in an IPython prompt. With JavaScript, I use the Firebug console. I experiment with APIs, get something working and paste it over in to a text editor. For some one-off data transformation problems I never save any code at all - I run a couple of list comprehensions, dump the results out as JSON or CSV and leave it at that.

Redis is an excellent complement to this kind of programming. I can run a long running batch job in one Python interpreter (say loading a few million lines of CSV in to a Redis key/value lookup table) and run another interpreter to play with the data that's already been collected, even as the first process is streaming data in. I can quit and restart my interpreters without losing any data. And because Redis semantics map closely to Python native data types, I don't have to think for more than a few seconds about how I'm going to represent my data.

Here's a 30 second guide to getting started with Redis:

$ wget http://redis.googlecode.com/files/redis-1.01.tar.gz
$ tar -xzf redis-1.01.tar.gz
$ cd redis-1.01
$ make
$ ./redis-server

And that's it - you now have a Redis server running on port 6379. No need even for a ./configure or make install. You can run ./redis-benchmark in that directory to exercise it a bit.

Let's try it out from Python. In a separate terminal:

$ cd redis-1.01/client-libraries/python/
$ python
>>> import redis
>>> r = redis.Redis()
>>> r.info()
{u'total_connections_received': 1, ... }
>>> r.keys('*') # Show all keys in the database
[]
>>> r.set('key-1', 'Value 1')
'OK'
>>> r.keys('*')
[u'key-1']
>>> r.get('key-1')
u'Value 1'

Now let's try something a bit more interesting:

>>> r.push('log', 'Log message 1', tail=True)
>>> r.push('log', 'Log message 2', tail=True)
>>> r.push('log', 'Log message 3', tail=True)
>>> r.lrange('log', 0, 100)
[u'Log message 3', u'Log message 2', u'Log message 1']
>>> r.push('log', 'Log message 4', tail=True)
>>> r.push('log', 'Log message 5', tail=True)
>>> r.push('log', 'Log message 6', tail=True)
>>> r.ltrim('log', 0, 2)
>>> r.lrange('log', 0, 100)
[u'Log message 6', u'Log message 5', u'Log message 4']

That's a simple capped log implementation (similar to a MongoDB capped collection) - push items on to the tail of a 'log' key and use ltrim to only retain the last X items. You could use this to keep track of what a system is doing right now without having to worry about storing ever increasing amounts of logging information.

See the documentation for a full list of Redis commands. I'm particularly excited about the RANDOMKEY and new SRANDMEMBER commands (git trunk only at the moment), which help address the common challenge of picking a random item without ORDER BY RAND() clobbering your relational database. In a beautiful example of open source support in action, I requested SRANDMEMBER on Twitter yesterday and antirez committed just 12 hours later.

I used Redis this week to help create heat maps of the BNP's membership list for the Guardian. I had the leaked spreadsheet of the BNP member details and a (licensed) CSV file mapping 1.6 million postcodes to their corresponding parliamentary constituencies. I loaded the CSV file in to Redis, then looped through the 12,000 postcodes from the membership and looked them up in turn, accumulating counts for each constituency. It took a couple of minutes to load the constituency data and a few seconds to run and accumulate the postcode counts. In the end, it probably involved less than 20 lines of actual Python code.

A much more interesting example of an application built on Redis is Hurl, a tool for debugging HTTP requests built in 48 hours by Leah Culver and Chris Wanstrath. The code is now open source, and Chris talks a bit more about the implementation (in particular their use of sort in Redis) on his blog. Redis also gets a mention in Tom Preston-Werner's epic writeup of the new scalable architecture behind GitHub.

Tags: chris-wanstrath, github, guardian, hurl, interactivedevelopment, ipython, leah-culver, open-source, performance, python, redis

The Guardian 1000 Novels Everyone Must Read in FluidDB

2009-09-13T23:48:26+00:00

The Guardian 1000 Novels Everyone Must Read in FluidDB

Nicholas J. Radcliffe loaded the Guardian’s list of 1000 novels in to FluidDB, where the ability for users to add their own ratings style metadata makes it an ideal dataset for exploring the capabilities of the platform.

Tags: fluiddb, guardian

Hack Day tools for non-developers

2009-07-28T14:23:53+00:00

We're about to run our second internal hack day at the Guardian. The first was an enormous amount of fun and the second one looks set to be even more productive.

There's only one rule at hack day: build something you can demonstrate at the end of the event (Powerpoint slides don't count). Importantly though, our hack days are not restricted to just our development team: anyone from the technology department can get involved, and we extend the invitation to other parts of the organisation as well. At the Guardian, this includes journalists.

For our first hack day, I put together a list of "tools for non-developers" - sites, services and software that could be used for hacking without programming knowledge as a pre-requisite. I'm now updating that list with recommendations from elsewhere. Here's the list so far:

Freebase

Originally a kind of structured version of Wikipedia, Freebase changed its focus last year towards being a "social database about things you know and love". In other words, it's the most powerful OCD-enabler in the history of the world. Create your own "Base" on any subject you like, set up your own types and start gathering together topics from the millions already available in Freebase - or add your own. Examples include the Battlestar Galactica base, the Tall Ships base and the fabulous Database base. If you are a developer the tools in the Make Things with Freebase section are top notch.

Dabble DB

Dabble is a weird combination of a spreadsheet, an online database and a set of visualisation tools. Watch the 8 minute demo to get an idea of how powerful this is - you can start off by loading in an existing spreadsheet and take it from there. You'll need to sign up for the free 30 day trial.

Google Docs

You can always build a hack in Excel, but Google Spreadsheets is surprisingly powerful and means that you can collaborate with others on your hack (including developers, who can use the Google Docs API to get at the data in your spreadsheet). Check out the following tutorials, which describe ways of using Google Spreadsheets to scrape in data from other webpages and output it in interesting formats:

There's also a simple way to create a form that submits data in to a Google Spreadsheet.

Yahoo! Pipes

Visual tools for combining, filtering and modifying RSS feeds. Combine with the large number of full-content feeds on guardian.co.uk for all sorts of interesting possibilities. Here's a tutorial that incorporates Google Docs as well.

Google My Maps

Google provide a really neat interface for adding your own points, lines and areas to a Google Map. Outputs KML, a handy file format for carting geographic data around between different tools.

If you already have a KML or GeoRSS feed URL from somewhere (e.g. the output of a Yahoo! Pipe), you can paste it directly in to the Google Maps search box to see the points rendered on a map.

Google SketchUp

A simple to use 3D drawing package that lets you create 3D models of real-world buildings and then import them in to Google Earth.

OpenStreetMap

Try your hand at some open source cartography on OpenStreetMap, the geographic world's answer to Wikipedia. If you have the equipment you can contribute GPS traces, otherwise there's a clever online editor that will let you trace out roads from satellite photos - or you could just make sure your favourite pub is included on the map. The export tools can provide vector or static maps, and if you export as SVG you can further edit your map in Illustrator or Inkscape.

CloudMade Maps

Commercial tools built on top of OpenStreetMap, the most exciting of which allows you to create your own map theme by setting your preferred colours and line widths for various types of map feature.

Many Eyes

IBM Research's suite of data visualisation tools, with a wiki-style collaboration platform for publishing data and creating visualisations.

Dapper

Dapper provides a powerful tool for screen scraping websites, without needing to write any code. Output formats include RSS, iCalendar and Google Maps.

TiddlyWiki

TiddlyWiki is a complete wiki in a single HTML file, which you can save locally and use as a notebook, collaboration tool and much more. There's a large ecosystem of plugins and macros which can be used to extend it with new features - see TiddlyVault for an index.

WolframAlpha

The "computational knowledge engine" with the hubristic search-based interface, potentially useful as a source of data and a tool for processing and visualising that data.

Tumblr

Useful as both an input and an output for feeds processed using other tools, and with a smart bookmarklet for collecting bits and pieces from around the web.

The UCSB Toy Chest

An outstanding list of tools that people "without programming skills (but with basic computer and Internet literacy) can use to create interesting projects", compiled by the English department at UC Santa Barbara.

Your help needed

There must be dozens, if not hundreds of useful tools missing from the above. Tell me in the comments and I'll add them to the list.

Tags: freebase, google, google-maps, guardian, hackday, mapping, nondevelopers, openstreetmap, pipes, sketchup, tools, yahoo-pipes, computer-literacy

Curating conversations

2009-07-16T19:34:08+00:00

Curating conversations

Chris Thorpe has open-sourced the Guardian’s moderated Twitter backchannel app, for displaying back channels at high profile (and hence high potential for abuse) events. It’s a Python application that runs on App Engine.

Tags: chris-thorpe, google-app-engine, guardian, python, twitter

Four crowdsourcing lessons from the Guardian's (spectacular) expenses-scandal experiment

2009-06-24T15:31:59+00:00

Four crowdsourcing lessons from the Guardian's (spectacular) expenses-scandal experiment

Michael Andersen from the Nieman Journalism Lab interviewed me about the MP expenses crowdsourcing site.

Tags: crowdsourcing, guardian, interviews, mpsexpences

The breakneck race to build an application to crowdsource MPs' expenses

2009-06-19T22:16:04+00:00

The breakneck race to build an application to crowdsource MPs' expenses

Charles Arthur wrote up a very nice piece on the development effort behind the Guardian’s crowdsourcing expenses app.

Tags: charles-aurthur, crowdsourcing, guardian, mpsexpenses

Investigate your MP's expenses

2009-06-18T23:16:43+00:00

Investigate your MP's expenses

Launched today, this is the project that has been keeping me ultra-busy for the past week—we’re crowdsourcing the analysis of the 700,000+ scanned MP expenses documents released this morning. It’s the Guardian’s first live Django-powered application, and also the first time we’ve hosted something on EC2.

Tags: crowdsourcing, django, ec2, guardian, mpexpenses, projects, python

Dealing with election results data

2009-06-12T18:06:05+00:00

Dealing with election results data

Alf Eaton loaded the Guardian’s European election results spreadsheet in to Google’s new Fusion Tables tool.

Tags: alf-eaton, datablog, datastore, elections, fusiontables, google, guardian

Exactly how well did the BNP do where you live?

2009-06-11T11:37:56+00:00

Exactly how well did the BNP do where you live?

Guardian journalists spent a day and a half calling round different local authorities to get a proper breakdown of the European election results (which are only officially published in aggregate) and published the results as a spreadsheet on the Datablog.

Tags: bnp, datablog, datastore, elections, guardian

You ask, they answer: Neal's Yard Remedies

2009-05-27T10:35:45+00:00

You ask, they answer: Neal's Yard Remedies

After reading the comments, something tells me Neal’s Yard Remedies may be regretting their decision to answer questions from Guardian readers.

Tags: guardian, homeopathy

Muck Rack: Links posted by Guardian Journalists on Twitter

2009-05-22T22:02:34+00:00

Muck Rack: Links posted by Guardian Journalists on Twitter

I’m rather impressed by the Sawhorse Media collection of Twitter aggregation sites (Muck Rack aggregates journalists)—a simple idea very well executed. Here’s a nice example—this page shows links posted to Twitter by known Guardian journalists, but goes a step further and scrapes in the favicon, the real title of the page and resolves the domain from any shortened links.

Tags: aggregation, favicons, guardian, journalists, mashups, muckrack, sawhorsemedia, twitter