Simon Willison's Weblog: big-data

Quoting Javi Santana

2024-12-01T05:02:42+00:00

Most people don’t have an intuition about what current hardware can and can’t do. There is a simple math that can help you with that: “you can process about 500MB in one second on a single machine”. I know it’s not a universal truth and there are a lot of details that can change that but believe me, this estimation is a pretty good tool to have under your belt.

— Javi Santana

Tags: scaling, big-data

Big Data is Dead

2023-02-07T19:25:40+00:00

Big Data is Dead

Don’t be distracted by the headline, this is very worth your time. Jordan Tigani spent ten years working on Google BigQuery, during which time he was surprised to learn that the median data storage size for regular customers was much less than 100GB. In this piece he argues that genuine Big Data solutions are relevant to a tiny fraction of companies, and there’s way more value in solving problems for everyone else. I’ve been talking about Datasette as a tool for solving “small data” problems for a while, and this article has given me a whole bunch of new arguments I can use to support that concept.

Via Hacker News

Tags: big-data, small-data

What I've learned about data recently

2021-06-22T17:09:07+00:00

What I've learned about data recently

Laurie Voss talks about the structure of data teams, based on his experience at npm and more recently Netlify. He suggests that Airflow and dbt are the data world’s equivalent of frameworks like Rails: opinionated tools that solve core problems and which mean that you can now hire people who understand how your data pipelines work on their first day on the job.

Via @seldo

Tags: data, big-data, data-science, laurie-voss

Everything You Always Wanted To Know About GitHub (But Were Afraid To Ask)

2021-01-05T01:02:40+00:00

Everything You Always Wanted To Know About GitHub (But Were Afraid To Ask)

ClickHouse by Yandex is an open source column-oriented data warehouse, designed to run analytical queries against TBs of data. They've loaded the full GitHub Archive of events since 2011 into a public instance, which is a great way of both exploring GitHub activity and trying out ClickHouse. Here's a query I just ran that shows number of watch events per year, for example:

SELECT toYear(created_at) as yyyy, count()
FROM github_events
WHERE event_type = 'WatchEvent' group by yyyy

Via A Hacker News comment

Tags: analytics, github, sql, big-data, clickhouse

Quoting Netflix Technology Blog

2018-08-18T17:35:45+00:00

Every day more than 1 trillion events are written into a streaming ingestion pipeline, which is processed and written to a 100PB cloud-native data warehouse. And every day, our users run more than 150,000 jobs against this data, spanning everything from reporting and analysis to machine learning and recommendation algorithms.

— Netflix Technology Blog

Tags: netflix, big-data, jupyter

Usage of ARIA attributes via HTTP Archive

2018-07-12T03:16:26+00:00

Usage of ARIA attributes via HTTP Archive

A neat example of a Google BigQuery query you can run against the HTTP Archive public dataset (a crawl of the “top” websites run periodically by the Internet Archive, which captures the full details of every resource fetched) to see which ARIA attributes are used the most often. Linking to this because I used it successfully today as the basis for my own custom query—I love that it’s possible to analyze a huge representative sample of the modern web in this way.

Tags: aria, http, internet-archive, big-data

ActorDB

2018-06-24T21:48:29+00:00

ActorDB

Distributed SQL database written in Erlang built on top of SQLite (on top of LMDB), adding replication using the raft consensus algorithm (so sharded with no single-points of failure) and a MySQL protocol interface. Interesting combination of technologies.

Tags: erlang, scaling, sqlite, big-data

Query Parquet files in SQLite

2018-06-24T19:44:06+00:00

Query Parquet files in SQLite

Colin Dellow built a SQLite virtual table extension that lets you query Parquet files directly using SQL. Parquet is interesting because it’s a columnar format that dramatically reduces the space needed to store tables with lots of duplicate column data—most CSV files, for example. Colin reports being able to shrink a 1291 MB CSV file from the Canadian census to an equivalent Parquet file weighing just 42MB (3% of the original)—then running a complex query against the data in just 60ms. I’d love to see someone get this extension working with Datasette.

Tags: sqlite, big-data, datasette, parquet, colin-dellow

Mozilla Telemetry: In-depth Data Pipeline

2018-04-12T15:44:42+00:00

Mozilla Telemetry: In-depth Data Pipeline

Detailed behind-the-scenes look at an extremely sophisticated big data telemetry processing system built using open source tools. Some of this is unsurprising (S3 for storage, Spark and Kafka for streams) but the details are fascinating. They use a custom nginx module for the ingestion endpoint and have a “tee” server written in Lua and OpenResty which lets them route some traffic to alternative backend.

Via @reid_write

Tags: analytics, lua, mozilla, nginx, big-data, kafka

What can startups do on big data day one?

2013-09-16T13:05:00+00:00

My answer to What can startups do on big data day one? on Quora

Log everything, and then forget about it. That way you'll have data you can analyse later on, but aside from setting up logging and log storage you won't waste any time messing around with Big Data when you haven't yet found product-market fit.

Tags: entrepreneurship, startups, quora, big-data

I would like to attend a Big Data conference but I am short of funds. Is there any big data conference that helps students attend those conference through scholarship?

2013-09-14T15:09:00+00:00

My answer to I would like to attend a Big Data conference but I am short of funds. Is there any big data conference that helps students attend those conference through scholarship? on Quora

The traditional route for students who can't afford to attend a conference is for them to volunteer. Contact event organisers of Big Data conferences that look relevant and ask if they are looking for volunteers.

Tags: conferences, quora, big-data

What is a good list of conferences, speaking gigs, hackathons, and other technology-centric events where one can reach software architects and developers?

2013-08-21T17:15:00+00:00

My answer to What is a good list of conferences, speaking gigs, hackathons, and other technology-centric events where one can reach software architects and developers? on Quora

We have a pretty comprehensive list of (mostly tech) conferences in the Midwest USA here: http://lanyrd.com/places/midwest...

Tags: conferences, entrepreneurship, startups, quora, hackathons, big-data

Where can I find an updated DB of countries, states and cities?

2013-06-13T09:11:00+00:00

My answer to Where can I find an updated DB of countries, states and cities? on Quora

This is a surprisingly complicated question. The first thing you might want to ask yourself is "what's a country" - how do you deal with places on this List of states with limited recognition for example?

That said, data sets do exist that can help you with this. I'm a big fan of Yahoo!'s GeoPlanet dataset, which represents places around the world as a hierarchy. We use it as the basis for the geo features of our site Lanyrd - take a look at the breadcrumbs on Conferences in London | Conferences & Events | Lanyrd for example.

You can access GeoPlanet through the GeoPlanet API Yahoo! GeoPlanet™ - YDN or using Yahoo!'s YQL API language: YQL Console

Yahoo! used to make a full dump of the GeoPlanet database available at Yahoo! GeoPlanet - YDN but they no longer publish it. However, there are copies of their old releases available on the Internet Archive: Internet Archive Search: geoplanet

Tags: databases, web-development, quora, big-data

What are the best big data conferences?

2011-07-21T15:01:00+00:00

My answer to What are the best big data conferences? on Quora

O'Reilly's Strata is excellent - I went to their first event in February in Santa Clara, and they're running another one in New York on 22nd-23rd September: http://lanyrd.com/2011/stratany/

If you're interested in Hadoop, Hadoop World is coming up in November (again in New York): http://lanyrd.com/2011/hadoop-wo...

OSCON have a separate Big Data event running parallel to the main conference - that's next week 25th-27th July in Portland: http://lanyrd.com/2011/oscon-data/ - it's the first time they've done it, but it's likely to be excellent.

You can track our collection of all the big data conferences we know about on these pages: http://lanyrd.com/topics/big-data/ and http://lanyrd.com/topics/hadoop/ ( http://lanyrd.com/topics/nosql/ may be of interest as well)

Tags: conferences, quora, big-data

The Seven Secrets of Successful Data Scientists

2010-09-03T00:36:00+00:00

The Seven Secrets of Successful Data Scientists

Some sensible advice, including pick the right sized tool, compress everything, split up your data, use open source and run the analysis where the data is.

Tags: data, big-data, recovered