Simon Willison's Weblog: mapreduce

The humble hash aggregate

2021-06-06T16:03:51+00:00

Today I learned that “hash aggregate” is the name for the algorithm where you split a list of tuples on a common key, run an aggregation against each resulting group and combine the results back together again—I’d previously thought if this in terms of map/reduce but hash aggregate is a much older term used widely by SQL engines—I’ve seen it come up in PostgreSQL explain query output (for GROUP BY) before but didn’t know what it meant.

Via @vboykis

Tags: algorithms, mapreduce, sql

The Friendship That Made Google Huge

2018-12-31T03:56:45+00:00

The Friendship That Made Google Huge

The New Yorker profiles Jeff Dean and Sanjay Ghemawat, Google’s first and only level 11 Senior Fellows. This is some of the best writing on complex software engineering topics (map-reduce, Tensor Flow and the like) aimed at a general audience that I’ve ever seen. Also a very compelling case study in pair programming.

Tags: google, mapreduce, new-yorker, tensorflow

App Engine at Google I/O 2010

2010-05-20T15:30:00+00:00

App Engine at Google I/O 2010

OpenID and OAuth are now baked in to the AppEngine users API. They’re also demoing two very exciting new features—a mapper API for doing map/reduce style queries against the data store, and a Channel API for building comet applications.

Tags: comet, google, google-app-engine, mapreduce, oauth, openid, google-io, recovered

BashReduce

2009-06-28T15:03:15+00:00

BashReduce

Map/Reduce in Bash is no longer a joke project (if it ever was)—Richard Crowley is extending it and using it for analysis at OpenDNS.

Tags: bash, bashreduce, mapreduce, opendns, richard-crowley

Finding similar items with Amazon Elastic MapReduce, Python, and Hadoop streaming

2009-04-07T09:19:38+00:00

Tutorial for running Hadoop jobs on Elastic MapReduce using Python and the 2005 Audioscrobbler dataset.

Tags: amazon, amazon-web-services, audioscrobbler, elasticmapreduce, hadoop, mapreduce, python

Amazon Elastic MapReduce

2009-04-02T10:25:37+00:00

Amazon Elastic MapReduce

Hadoop as a service. Basically a web based GUI around Hadoop—you could roll this yourself on EC2 but for a small markup on regular EC2 prices you get to avoid the extra work setting everything up. Data processing scripts can be written in Java, Ruby, Perl, Python, PHP, R, or C++ and are loaded in to S3 before firing off the job.

Via Joe Drumgoole

Tags: amazon, amazon-web-services, cloud-computing, ec2, hadoop, mapreduce, s3

Cascading

2008-10-01T13:22:19+00:00

Cascading

A Java API abstraction layer over Hadoop that lets developers think in terms of pipes and filters rather than map/reduce. The Cascading developers claim that this model is easier to understand and less error prone.

Tags: cascading, hadoop, java, mapreduce, pipesfilters

Python + Hadoop = Flying Circus Elephant

2008-05-31T14:14:56+00:00

Python + Hadoop = Flying Circus Elephant

Last.fm have released Dumbo, a Python module that lets you easily write Hadoop map/reduce tasks using Python and generators.

Tags: dumbo, generators, hadoop, lastfm, mapreduce, python

Writing An Hadoop MapReduce Program In Python

2007-10-09T11:33:58+00:00

Writing An Hadoop MapReduce Program In Python

Hadoop (the open source map/reduce framework) can interact with any program that reads from stdin and outputs on stdout—so it’s trivial to drop in Python scripts for the map and reduce steps.

Tags: hadoop, mapreduce, python

CouchDB: Thinking beyond the RDBMS

2007-09-03T09:48:43+00:00

CouchDB: Thinking beyond the RDBMS

CouchDB is a fascinating project—an Erlang powered non-relational database with a JSON API that lets you define “views” (really computed tables) based on JavaScript functions that execute using map/reduce. Damien Katz, the main developer currently works for MySQL and used to work on Lotus Notes.

Tags: couchdb, damien-katz, databases, erlang, javascript, json, lotusnotes, mapreduce, mysql