<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: crisishacking</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/crisishacking.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2017-10-10T06:48:07+00:00</updated><author><name>Simon Willison</name></author><entry><title>Changelogs to help understand the fires in the North Bay</title><link href="https://simonwillison.net/2017/Oct/10/fires-in-the-north-bay/#atom-tag" rel="alternate"/><published>2017-10-10T06:48:07+00:00</published><updated>2017-10-10T06:48:07+00:00</updated><id>https://simonwillison.net/2017/Oct/10/fires-in-the-north-bay/#atom-tag</id><summary type="html">
    &lt;p&gt;The situation in the counties north of San Francisco &lt;a href="http://www.sfgate.com/bayarea/article/Latest-on-North-Bay-fires-A-really-rough-12263721.php"&gt;is horrifying right now&lt;/a&gt;. I’ve repurposed some of &lt;a href="https://simonwillison.net/2017/Sep/10/scraping-irma/"&gt;the tools I built to for the Irma Response project&lt;/a&gt; last month to collect and track some data that might be of use to anyone trying to understand what’s happening up there. I’m sharing these now in the hope that they might prove useful.&lt;/p&gt;
&lt;p&gt;I’m scraping a number of sources relevant to the crisis, and making the data available in &lt;a href="https://github.com/simonw/irma-scraped-data/"&gt;a repository on GitHub&lt;/a&gt;. Because it’s a git repository, changes to those sources are tracked automatically. The value I’m providing here isn’t so much the data itself, it’s the history of the data. If you need to see what has changed and when, my repository’s commit log should have the answers for you. Or maybe you’ll just want to occasionally hit refresh on &lt;a href="https://github.com/simonw/irma-scraped-data/commits/master/santa-rosa-emergency.json"&gt;this history of changes&lt;/a&gt; to &lt;a href="https://srcity.org/610/Emergency-Information"&gt;srcity.org/610/Emergency-Information&lt;/a&gt; to see when they edited the information.&lt;/p&gt;
&lt;p&gt;The sources I’m tracking right now are:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The &lt;a href="https://srcity.org/610/Emergency-Information"&gt;Santa Rosa Fire Department’s Emergency Information&lt;/a&gt; page. This is being maintained by hand so it’s not a great source of structured data, but it has key details like the location and availability of shelters and it’s useful to know what was changed and when. &lt;a href="https://github.com/simonw/irma-scraped-data/commits/master/santa-rosa-emergency.json"&gt;History of changes to that page&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://m.pge.com/#outages"&gt;PG&amp;amp;E power outages&lt;/a&gt;. This is probably the highest quality dataset with the &lt;a href="https://github.com/simonw/irma-scraped-data/commit/50ab3d3f3a5f117054e3209c7f0d520e6b483f0e#diff-2432d375ba73b2c87c88f55b12a0a2f0"&gt;neatest commit messages&lt;/a&gt;. The &lt;a href="https://github.com/simonw/irma-scraped-data/commits/master/pge-outages-individual.json"&gt;commit history of these&lt;/a&gt; shows exactly when new outages are reported and how many customers were affected.&lt;/li&gt;
&lt;li&gt;&lt;a href="http://roadconditions.sonoma-county.org/"&gt;Road Conditions in the County of Sonoma&lt;/a&gt;. If you want to understand how far the fire has spread, this is a useful source of data as it shows which roads have been closed due to fire or other reasons. &lt;a href="https://github.com/simonw/irma-scraped-data/commits/master/sonoma-road-conditions.json"&gt;History of changes&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;California Highway Patrol Incidents, extracted from a KML feed on &lt;a href="http://quickmap.dot.ca.gov/"&gt;quickmap.dot.ca.gov&lt;/a&gt;. Since these cover the whole state of California there’s a lot of stuff in here that isn’t directly relevant to the North Bay, but the incidents that mention fire still help tell the story of what’s been happening. &lt;a href="https://github.com/simonw/irma-scraped-data/commits/master/chp-incidents.json"&gt;History of changes&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The code for the scrapers can be &lt;a href="https://github.com/simonw/irma-scrapers/blob/master/north_bay.py"&gt;found in north_bay.py&lt;/a&gt;. Please leave comments, feedback or suggestions on other useful potential sources of data &lt;a href="https://github.com/simonw/simonwillisonblog/issues/4"&gt;in this GitHub issue&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/crisishacking"&gt;crisishacking&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="scraping"/><category term="crisishacking"/><category term="git-scraping"/></entry><entry><title>Scraping hurricane Irma</title><link href="https://simonwillison.net/2017/Sep/10/scraping-irma/#atom-tag" rel="alternate"/><published>2017-09-10T06:21:17+00:00</published><updated>2017-09-10T06:21:17+00:00</updated><id>https://simonwillison.net/2017/Sep/10/scraping-irma/#atom-tag</id><summary type="html">
    &lt;p&gt;The &lt;a href="https://www.irmaresponse.org/"&gt;Irma Response project&lt;/a&gt; is a team of volunteers working together to make information available during and after the storm. There is a huge amount of information out there, on many different websites. The &lt;a href="https://irma-api.herokuapp.com/"&gt;Irma API&lt;/a&gt; is an attempt to gather key information in one place, verify it and publish it in a reuseable way. It currently powers the &lt;a href="https://www.irmashelters.org/"&gt;irmashelters.org&lt;/a&gt; website.&lt;/p&gt;
&lt;p&gt;To aid this effort, I built a collection of screen scrapers that pull data from a number of different websites and APIs. That data is then stored in &lt;a href="https://github.com/simonw/irma-scraped-data/"&gt;a Git repository&lt;/a&gt;, providing a clear history of changes made to the various sources that are being tracked.&lt;/p&gt;
&lt;p&gt;Some of the scrapers also publish their findings to Slack in a format designed to make it obvious when key events happen, such as new shelters being added or removed from public listings.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Tracking_changes_over_time_8"&gt;&lt;/a&gt;Tracking changes over time&lt;/h3&gt;
&lt;p&gt;A key goal of this screen scraping mechanism is to allow changes to the underlying data sources to be tracked over time. This is achieved using git, via the GitHub API. Each scraper pulls down data from a source (an API or a website) and reformats that data into a sanitized JSON format. That JSON is then written to the git repository. If the data has changed since the last time the scraper ran, those changes will be captured by git and made available in the commit log.&lt;/p&gt;
&lt;p&gt;Recent changes tracked by the scraper collection can be seen here: &lt;a href="https://github.com/simonw/irma-scraped-data/commits/master"&gt;https://github.com/simonw/irma-scraped-data/commits/master&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;&lt;a id="Generating_useful_commit_messages_14"&gt;&lt;/a&gt;Generating useful commit messages&lt;/h3&gt;
&lt;p&gt;The most complex code for most of the scrapers isn’t in fetching the data: it’s in generating useful, human-readable commit messages that summarize the underlying change. For example, here is &lt;a href="https://github.com/simonw/irma-scraped-data/commit/7919aeff0913ec26d1bea8dc"&gt;a commit message&lt;/a&gt; generated by the scraper that tracks the &lt;a href="http://www.floridadisaster.org/shelters/summary.aspx"&gt;http://www.floridadisaster.org/shelters/summary.aspx&lt;/a&gt; page:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;florida-shelters.json: 2 shelters added

Added shelter: Atwater Elementary School (Sarasota County)
Added shelter: DEBARY ELEMENTARY SCHOOL (Volusia County)
Change detected on http://www.floridadisaster.org/shelters/summary.aspx
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The full commit also shows the changes to the underlying JSON, but the human-readable message provides enough information that people who are not JSON-literate programmers can still derive value from the commit.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Publishing_to_Slack_26"&gt;&lt;/a&gt;Publishing to Slack&lt;/h3&gt;
&lt;p&gt;The Irma Response team use Slack to co-ordinate their efforts. You can join their Slack here: &lt;a href="https://irma-response-slack.herokuapp.com/"&gt;https://irma-response-slack.herokuapp.com/&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Some of the scrapers publish detected changes in their data source to Slack, as links to the commits generated for each change. The human-readable message is posted directly to the channel.&lt;/p&gt;
&lt;p&gt;&lt;img style="width: 100%" src="http://static.simonwillison.net.s3.amazonaws.com/static/2017/irma-slack.jpg" alt="Bot publishing to Slack" /&gt;&lt;/p&gt;
&lt;p&gt;The source code for all of the scrapers can be found at &lt;a href="https://github.com/simonw/irma-scrapers"&gt;https://github.com/simonw/irma-scrapers&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This Entry started out as &lt;a href="https://github.com/simonw/irma-scrapers/blob/master/README.md"&gt;README file&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/crisishacking"&gt;crisishacking&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="scraping"/><category term="crisishacking"/><category term="git-scraping"/></entry></feed>