<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: big-data</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/big-data.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2024-12-01T05:02:42+00:00</updated><author><name>Simon Willison</name></author><entry><title>Quoting Javi Santana</title><link href="https://simonwillison.net/2024/Dec/1/javi-santana/#atom-tag" rel="alternate"/><published>2024-12-01T05:02:42+00:00</published><updated>2024-12-01T05:02:42+00:00</updated><id>https://simonwillison.net/2024/Dec/1/javi-santana/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://javisantana.com/2024/11/30/learnings-after-4-years-data-eng.html"&gt;&lt;p&gt;Most people don’t have an intuition about what current hardware can and can’t do. There is a simple math that can help you with that: “you can process about 500MB in one second on a single machine”. I know it’s not a universal truth and there are a lot of details that can change that but believe me, this estimation is a pretty good tool to have under your belt.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://javisantana.com/2024/11/30/learnings-after-4-years-data-eng.html"&gt;Javi Santana&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/scaling"&gt;scaling&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/big-data"&gt;big-data&lt;/a&gt;&lt;/p&gt;



</summary><category term="scaling"/><category term="big-data"/></entry><entry><title>Big Data is Dead</title><link href="https://simonwillison.net/2023/Feb/7/big-data-is-dead/#atom-tag" rel="alternate"/><published>2023-02-07T19:25:40+00:00</published><updated>2023-02-07T19:25:40+00:00</updated><id>https://simonwillison.net/2023/Feb/7/big-data-is-dead/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://motherduck.com/blog/big-data-is-dead/"&gt;Big Data is Dead&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Don’t be distracted by the headline, this is very worth your time. Jordan Tigani spent ten years working on Google BigQuery, during which time he was surprised to learn that the median data storage size for regular customers was much less than 100GB. In this piece he argues that genuine Big Data solutions are relevant to a tiny fraction of companies, and there’s way more value in solving problems for everyone else. I’ve been talking about Datasette as a tool for solving “small data” problems for a while, and this article has given me a whole bunch of new arguments I can use to support that concept.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=34694926"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/big-data"&gt;big-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/small-data"&gt;small-data&lt;/a&gt;&lt;/p&gt;



</summary><category term="big-data"/><category term="small-data"/></entry><entry><title>What I've learned about data recently</title><link href="https://simonwillison.net/2021/Jun/22/what-ive-learned-about-data-recently/#atom-tag" rel="alternate"/><published>2021-06-22T17:09:07+00:00</published><updated>2021-06-22T17:09:07+00:00</updated><id>https://simonwillison.net/2021/Jun/22/what-ive-learned-about-data-recently/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://seldo.com/posts/what-i-ve-learned-about-data-recently"&gt;What I&amp;#x27;ve learned about data recently&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Laurie Voss talks about the structure of data teams, based on his experience at npm and more recently Netlify. He suggests that Airflow and dbt are the data world’s equivalent of frameworks like Rails: opinionated tools that solve core problems and which mean that you can now hire people who understand how your data pipelines work on their first day on the job.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/seldo/status/1407370508576780290"&gt;@seldo&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data"&gt;data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/big-data"&gt;big-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-science"&gt;data-science&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/laurie-voss"&gt;laurie-voss&lt;/a&gt;&lt;/p&gt;



</summary><category term="data"/><category term="big-data"/><category term="data-science"/><category term="laurie-voss"/></entry><entry><title>Everything You Always Wanted To Know About GitHub (But Were Afraid To Ask)</title><link href="https://simonwillison.net/2021/Jan/5/clickhouse-github/#atom-tag" rel="alternate"/><published>2021-01-05T01:02:40+00:00</published><updated>2021-01-05T01:02:40+00:00</updated><id>https://simonwillison.net/2021/Jan/5/clickhouse-github/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://gh.clickhouse.tech/explorer/"&gt;Everything You Always Wanted To Know About GitHub (But Were Afraid To Ask)&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
ClickHouse by Yandex is an open source column-oriented data warehouse, designed to run analytical queries against TBs of data. They've loaded the full GitHub Archive of events since 2011 into a public instance, which is a great way of both exploring GitHub activity and trying out ClickHouse. Here's a query I just ran that shows number of watch events per year, for example:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;SELECT toYear(created_at) as yyyy, count()
FROM github_events
WHERE event_type = 'WatchEvent' group by yyyy
&lt;/code&gt;&lt;/pre&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=25638853"&gt;A Hacker News comment&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/analytics"&gt;analytics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sql"&gt;sql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/big-data"&gt;big-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/clickhouse"&gt;clickhouse&lt;/a&gt;&lt;/p&gt;



</summary><category term="analytics"/><category term="github"/><category term="sql"/><category term="big-data"/><category term="clickhouse"/></entry><entry><title>Quoting Netflix Technology Blog</title><link href="https://simonwillison.net/2018/Aug/18/netflix-technology-blog/#atom-tag" rel="alternate"/><published>2018-08-18T17:35:45+00:00</published><updated>2018-08-18T17:35:45+00:00</updated><id>https://simonwillison.net/2018/Aug/18/netflix-technology-blog/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://medium.com/netflix-techblog/notebook-innovation-591ee3221233"&gt;&lt;p&gt;Every day more than 1 trillion events are written into a streaming ingestion pipeline, which is processed and written to a 100PB cloud-native data warehouse. And every day, our users run more than 150,000 jobs against this data, spanning everything from reporting and analysis to machine learning and recommendation algorithms.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://medium.com/netflix-techblog/notebook-innovation-591ee3221233"&gt;Netflix Technology Blog&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/netflix"&gt;netflix&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/big-data"&gt;big-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/jupyter"&gt;jupyter&lt;/a&gt;&lt;/p&gt;



</summary><category term="netflix"/><category term="big-data"/><category term="jupyter"/></entry><entry><title>Usage of ARIA attributes via HTTP Archive</title><link href="https://simonwillison.net/2018/Jul/12/usage-aria-attributes-http-archive/#atom-tag" rel="alternate"/><published>2018-07-12T03:16:26+00:00</published><updated>2018-07-12T03:16:26+00:00</updated><id>https://simonwillison.net/2018/Jul/12/usage-aria-attributes-http-archive/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://discuss.httparchive.org/t/usage-of-aria-attributes/778"&gt;Usage of ARIA attributes via HTTP Archive&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
A neat example of a Google BigQuery query you can run against the HTTP Archive public dataset (a crawl of the “top” websites run periodically by the Internet Archive, which captures the full details of every resource fetched) to see which ARIA attributes are used the most often. Linking to this because I used it successfully today as the basis for my own custom query—I love that it’s possible to analyze a huge representative sample of the modern web in this way.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aria"&gt;aria&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/http"&gt;http&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/internet-archive"&gt;internet-archive&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/big-data"&gt;big-data&lt;/a&gt;&lt;/p&gt;



</summary><category term="aria"/><category term="http"/><category term="internet-archive"/><category term="big-data"/></entry><entry><title>ActorDB</title><link href="https://simonwillison.net/2018/Jun/24/actordb/#atom-tag" rel="alternate"/><published>2018-06-24T21:48:29+00:00</published><updated>2018-06-24T21:48:29+00:00</updated><id>https://simonwillison.net/2018/Jun/24/actordb/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/biokoda/actordb"&gt;ActorDB&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Distributed SQL database written in Erlang built on top of SQLite (on top of LMDB), adding replication using the raft consensus algorithm (so sharded with no single-points of failure) and a MySQL protocol interface. Interesting combination of technologies.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/erlang"&gt;erlang&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scaling"&gt;scaling&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/big-data"&gt;big-data&lt;/a&gt;&lt;/p&gt;



</summary><category term="erlang"/><category term="scaling"/><category term="sqlite"/><category term="big-data"/></entry><entry><title>Query Parquet files in SQLite</title><link href="https://simonwillison.net/2018/Jun/24/query-parquet-files-sqlite/#atom-tag" rel="alternate"/><published>2018-06-24T19:44:06+00:00</published><updated>2018-06-24T19:44:06+00:00</updated><id>https://simonwillison.net/2018/Jun/24/query-parquet-files-sqlite/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://cldellow.com/2018/06/22/sqlite-parquet-vtable.html"&gt;Query Parquet files in SQLite&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Colin Dellow built a SQLite virtual table extension that lets you query Parquet files directly using SQL. Parquet is interesting because it’s a columnar format that dramatically reduces the space needed to store tables with lots of duplicate column data—most CSV files, for example. Colin reports being able to shrink a 1291 MB CSV file from the Canadian census to an equivalent Parquet file weighing just 42MB (3% of the original)—then running a complex query against the data in just 60ms. I’d love to see someone get this extension working with Datasette.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/big-data"&gt;big-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/colin-dellow"&gt;colin-dellow&lt;/a&gt;&lt;/p&gt;



</summary><category term="sqlite"/><category term="big-data"/><category term="datasette"/><category term="parquet"/><category term="colin-dellow"/></entry><entry><title>Mozilla Telemetry: In-depth Data Pipeline</title><link href="https://simonwillison.net/2018/Apr/12/in-depth-data-pipeline-detail/#atom-tag" rel="alternate"/><published>2018-04-12T15:44:42+00:00</published><updated>2018-04-12T15:44:42+00:00</updated><id>https://simonwillison.net/2018/Apr/12/in-depth-data-pipeline-detail/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://docs.telemetry.mozilla.org/concepts/pipeline/data_pipeline_detail.html#a-detailed-look-at-the-data-platform"&gt;Mozilla Telemetry: In-depth Data Pipeline&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Detailed behind-the-scenes look at an extremely sophisticated big data telemetry processing system built using open source tools. Some of this is unsurprising (S3 for storage, Spark and Kafka for streams) but the details are fascinating. They use a custom nginx module for the ingestion endpoint and have a “tee” server written in Lua and OpenResty which lets them route some traffic to alternative backend.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/reid_write/status/984412694336933889"&gt;@reid_write&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/analytics"&gt;analytics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lua"&gt;lua&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mozilla"&gt;mozilla&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nginx"&gt;nginx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/big-data"&gt;big-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/kafka"&gt;kafka&lt;/a&gt;&lt;/p&gt;



</summary><category term="analytics"/><category term="lua"/><category term="mozilla"/><category term="nginx"/><category term="big-data"/><category term="kafka"/></entry><entry><title>What can startups do on big data day one?</title><link href="https://simonwillison.net/2013/Sep/16/what-can-startups-do/#atom-tag" rel="alternate"/><published>2013-09-16T13:05:00+00:00</published><updated>2013-09-16T13:05:00+00:00</updated><id>https://simonwillison.net/2013/Sep/16/what-can-startups-do/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;em&gt;My answer to &lt;a href="https://www.quora.com/What-can-startups-do-on-big-data-day-one/answer/Simon-Willison"&gt;What can startups do on big data day one?&lt;/a&gt; on Quora&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Log everything, and then forget about it. That way you'll have data you can analyse later on, but aside from setting up logging and log storage you won't waste any time messing around with Big Data when you haven't yet found product-market fit.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/entrepreneurship"&gt;entrepreneurship&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/startups"&gt;startups&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/quora"&gt;quora&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/big-data"&gt;big-data&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="entrepreneurship"/><category term="startups"/><category term="quora"/><category term="big-data"/></entry><entry><title>I would like to attend a Big Data conference but I am short of funds. Is there any big data conference that helps students attend those conference through scholarship?</title><link href="https://simonwillison.net/2013/Sep/14/i-would-like-to/#atom-tag" rel="alternate"/><published>2013-09-14T15:09:00+00:00</published><updated>2013-09-14T15:09:00+00:00</updated><id>https://simonwillison.net/2013/Sep/14/i-would-like-to/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;em&gt;My answer to &lt;a href="https://www.quora.com/I-would-like-to-attend-a-Big-Data-conference-but-I-am-short-of-funds-Is-there-any-big-data-conference-that-helps-students-attend-those-conference-through-scholarship/answer/Simon-Willison"&gt;I would like to attend a Big Data conference but I am short of funds. Is there any big data conference that helps students attend those conference through scholarship?&lt;/a&gt; on Quora&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The traditional route for students who can't afford to attend a conference is for them to volunteer. Contact event organisers of Big Data conferences that look relevant and ask if they are looking for volunteers.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/conferences"&gt;conferences&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/quora"&gt;quora&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/big-data"&gt;big-data&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="conferences"/><category term="quora"/><category term="big-data"/></entry><entry><title>What is a good list of conferences, speaking gigs, hackathons, and other technology-centric events where one can reach software architects and developers?</title><link href="https://simonwillison.net/2013/Aug/21/what-is-a-good/#atom-tag" rel="alternate"/><published>2013-08-21T17:15:00+00:00</published><updated>2013-08-21T17:15:00+00:00</updated><id>https://simonwillison.net/2013/Aug/21/what-is-a-good/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;em&gt;My answer to &lt;a href="https://www.quora.com/What-is-a-good-list-of-conferences-speaking-gigs-hackathons-and-other-technology-centric-events-where-one-can-reach-software-architects-and-developers/answer/Simon-Willison"&gt;What is a good list of conferences, speaking gigs, hackathons, and other technology-centric events where one can reach software architects and developers?&lt;/a&gt; on Quora&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We have a pretty comprehensive list of (mostly tech) conferences in the Midwest USA here: &lt;span&gt;&lt;a href="http://lanyrd.com/places/midwestern-usa/"&gt;http://lanyrd.com/places/midwest...&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/conferences"&gt;conferences&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/entrepreneurship"&gt;entrepreneurship&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/startups"&gt;startups&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/quora"&gt;quora&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hackathons"&gt;hackathons&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/big-data"&gt;big-data&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="conferences"/><category term="entrepreneurship"/><category term="startups"/><category term="quora"/><category term="hackathons"/><category term="big-data"/></entry><entry><title>Where can I find an updated DB of countries, states and cities?</title><link href="https://simonwillison.net/2013/Jun/13/where-can-i-find/#atom-tag" rel="alternate"/><published>2013-06-13T09:11:00+00:00</published><updated>2013-06-13T09:11:00+00:00</updated><id>https://simonwillison.net/2013/Jun/13/where-can-i-find/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;em&gt;My answer to &lt;a href="https://www.quora.com/Where-can-I-find-an-updated-DB-of-countries-states-and-cities/answer/Simon-Willison"&gt;Where can I find an updated DB of countries, states and cities?&lt;/a&gt; on Quora&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is a surprisingly complicated question. The first thing you might want to ask yourself is "what's a country" - how do you deal with places on this &lt;span&gt;&lt;a href="http://en.wikipedia.org/wiki/List_of_states_with_limited_recognition"&gt;List of states with limited recognition&lt;/a&gt;&lt;/span&gt; for example?&lt;/p&gt;

&lt;p&gt;That said, data sets do exist that can help you with this. I'm a big fan of Yahoo!'s GeoPlanet dataset, which represents places around the world as a hierarchy. We use it as the basis for the geo features of our site Lanyrd - take a look at the breadcrumbs on &lt;span&gt;&lt;a href="http://lanyrd.com/places/london/"&gt;Conferences in London | Conferences &amp;amp; Events | Lanyrd&lt;/a&gt;&lt;/span&gt; for example.&lt;/p&gt;

&lt;p&gt;You can access GeoPlanet through the GeoPlanet API &lt;span&gt;&lt;a href="http://developer.yahoo.com/geo/geoplanet/"&gt;Yahoo! GeoPlanet™ - YDN&lt;/a&gt;&lt;/span&gt; or using Yahoo!'s YQL API language: &lt;span&gt;&lt;a href="http://developer.yahoo.com/yql/console/#h=select%20*%20from%20geo.places%20where%20text%3D%22san%20francisco%2C%20ca%22"&gt;YQL Console&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Yahoo! used to make a full dump of the GeoPlanet database available at &lt;span&gt;&lt;a href="http://developer.yahoo.com/geo/geoplanet/data/"&gt;Yahoo! GeoPlanet - YDN&lt;/a&gt;&lt;/span&gt; but they no longer publish it. However, there are copies of their old releases available on the Internet Archive: &lt;span&gt;&lt;a href="http://archive.org/search.php?query=geoplanet"&gt;Internet Archive Search: geoplanet&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/databases"&gt;databases&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/web-development"&gt;web-development&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/quora"&gt;quora&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/big-data"&gt;big-data&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="databases"/><category term="web-development"/><category term="quora"/><category term="big-data"/></entry><entry><title>What are the best big data conferences?</title><link href="https://simonwillison.net/2011/Jul/21/what-are-the-best/#atom-tag" rel="alternate"/><published>2011-07-21T15:01:00+00:00</published><updated>2011-07-21T15:01:00+00:00</updated><id>https://simonwillison.net/2011/Jul/21/what-are-the-best/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;em&gt;My answer to &lt;a href="https://www.quora.com/What-are-the-best-big-data-conferences/answer/Simon-Willison"&gt;What are the best big data conferences?&lt;/a&gt; on Quora&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;O'Reilly's Strata is excellent - I went to their first event in February in Santa Clara, and they're running another one in New York on 22nd-23rd September: &lt;span&gt;&lt;a href="http://lanyrd.com/2011/stratany/"&gt;http://lanyrd.com/2011/stratany/&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;If you're interested in Hadoop, Hadoop World is coming up in November (again in New York): &lt;span&gt;&lt;a href="http://lanyrd.com/2011/hadoop-world/"&gt;http://lanyrd.com/2011/hadoop-wo...&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;OSCON have a separate Big Data event running parallel to the main conference - that's next week 25th-27th July in Portland: &lt;span&gt;&lt;a href="http://lanyrd.com/2011/oscon-data/"&gt;http://lanyrd.com/2011/oscon-data/&lt;/a&gt;&lt;/span&gt; - it's the first time they've done it, but it's likely to be excellent.&lt;/p&gt;

&lt;p&gt;You can track our collection of all the big data conferences we know about on these pages: &lt;span&gt;&lt;a href="http://lanyrd.com/topics/big-data/"&gt;http://lanyrd.com/topics/big-data/&lt;/a&gt;&lt;/span&gt; and &lt;span&gt;&lt;a href="http://lanyrd.com/topics/hadoop/"&gt;http://lanyrd.com/topics/hadoop/&lt;/a&gt;&lt;/span&gt; ( &lt;span&gt;&lt;a href="http://lanyrd.com/topics/nosql/"&gt;http://lanyrd.com/topics/nosql/&lt;/a&gt;&lt;/span&gt; may be of interest as well)&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/conferences"&gt;conferences&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/quora"&gt;quora&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/big-data"&gt;big-data&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="conferences"/><category term="quora"/><category term="big-data"/></entry><entry><title>The Seven Secrets of Successful Data Scientists</title><link href="https://simonwillison.net/2010/Sep/3/seven/#atom-tag" rel="alternate"/><published>2010-09-03T00:36:00+00:00</published><updated>2010-09-03T00:36:00+00:00</updated><id>https://simonwillison.net/2010/Sep/3/seven/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://dataspora.com/blog/the-seven-secrets-of-successful-data-scientists/"&gt;The Seven Secrets of Successful Data Scientists&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Some sensible advice, including pick the right sized tool, compress everything, split up your data, use open source and run the analysis where the data is.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data"&gt;data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/big-data"&gt;big-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/recovered"&gt;recovered&lt;/a&gt;&lt;/p&gt;



</summary><category term="data"/><category term="big-data"/><category term="recovered"/></entry></feed>