<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: haki-benita</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/haki-benita.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2024-08-05T15:15:30+00:00</updated><author><name>Simon Willison</name></author><entry><title>How to Get or Create in PostgreSQL</title><link href="https://simonwillison.net/2024/Aug/5/how-to-get-or-create-in-postgresql/#atom-tag" rel="alternate"/><published>2024-08-05T15:15:30+00:00</published><updated>2024-08-05T15:15:30+00:00</updated><id>https://simonwillison.net/2024/Aug/5/how-to-get-or-create-in-postgresql/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://hakibenita.com/postgresql-get-or-create"&gt;How to Get or Create in PostgreSQL&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Get or create - for example to retrieve an existing tag record from a database table if it already exists or insert it if it doesn’t - is a surprisingly difficult operation.&lt;/p&gt;
&lt;p&gt;Haki Benita uses it to illustrate a variety of interesting PostgreSQL concepts.&lt;/p&gt;
&lt;p&gt;New to me: a pattern that runs &lt;code&gt;INSERT INTO tags (name) VALUES (tag_name) RETURNING *;&lt;/code&gt; and then catches the constraint violation and returns a record instead has a disadvantage at scale: “The table contains a dead tuple for every attempt to insert a tag that already existed” - so until vacuum runs you can end up with significant table bloat!&lt;/p&gt;
&lt;p&gt;Haki’s conclusion is that the best solution relies on an upcoming feature &lt;a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c649fa24a42ba89bf5460c7110e4fc8eeca65959"&gt;coming in PostgreSQL 17&lt;/a&gt;: the ability to combine the &lt;a href="https://www.postgresql.org/docs/current/sql-merge.html"&gt;MERGE operation&lt;/a&gt; with a RETURNING clause:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;WITH new_tags AS (
    MERGE INTO tags
    USING (VALUES ('B'), ('C')) AS t(name)
    ON tags.name = t.name
WHEN NOT MATCHED THEN
    INSERT (name) VALUES (t.name)
    RETURNING *
)
SELECT * FROM tags WHERE name IN ('B', 'C')
    UNION ALL
SELECT * FROM new_tags;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I wonder what the best pattern for this in SQLite is. Could it be as simple as this?&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;INSERT OR IGNORE INTO tags (name) VALUES ('B'), ('C');
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The SQLite &lt;a href="https://www.sqlite.org/lang_insert.html"&gt;INSERT documentation&lt;/a&gt; doesn't currently provide extensive details for &lt;code&gt;INSERT OR IGNORE&lt;/code&gt;, but there are some hints &lt;a href="https://sqlite.org/forum/forumpost/f13dc431f9f3e669"&gt;in this forum thread&lt;/a&gt;. &lt;a href="https://hoelz.ro/blog/with-sqlite-insert-or-ignore-is-often-not-what-you-want"&gt;This post&lt;/a&gt; by Rob Hoelz points out that &lt;code&gt;INSERT OR IGNORE&lt;/code&gt; will silently ignore &lt;em&gt;any&lt;/em&gt; constraint violation, so &lt;code&gt;INSERT INTO tags (tag) VALUES ('C'), ('D') ON CONFLICT(tag) DO NOTHING&lt;/code&gt; may be a better option.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=41159797"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/postgresql"&gt;postgresql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sql"&gt;sql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/haki-benita"&gt;haki-benita&lt;/a&gt;&lt;/p&gt;



</summary><category term="postgresql"/><category term="sql"/><category term="sqlite"/><category term="haki-benita"/></entry><entry><title>Fastest Way to Read Excel in Python</title><link href="https://simonwillison.net/2024/Jan/3/fastest-way-to-read-excel-in-python/#atom-tag" rel="alternate"/><published>2024-01-03T20:04:00+00:00</published><updated>2024-01-03T20:04:00+00:00</updated><id>https://simonwillison.net/2024/Jan/3/fastest-way-to-read-excel-in-python/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://hakibenita.com/fast-excel-python"&gt;Fastest Way to Read Excel in Python&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Haki Benita produced a meticulously researched and written exploration of the options for reading a large Excel spreadsheet into Python. He explored Pandas, Tablib, Openpyxl, shelling out to LibreOffice, DuckDB and python-calamine (a Python wrapper of a Rust library). Calamine was the winner, taking 3.58s to read 500,00 rows—compared to Pandas in last place at 32.98s.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://lobste.rs/s/5tugrd/fastest_way_read_excel_python"&gt;lobste.rs&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/excel"&gt;excel&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pandas"&gt;pandas&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rust"&gt;rust&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/haki-benita"&gt;haki-benita&lt;/a&gt;&lt;/p&gt;



</summary><category term="excel"/><category term="pandas"/><category term="python"/><category term="rust"/><category term="duckdb"/><category term="haki-benita"/></entry><entry><title>Practical SQL for Data Analysis</title><link href="https://simonwillison.net/2021/May/4/practical-sql-for-data-analysis/#atom-tag" rel="alternate"/><published>2021-05-04T03:11:37+00:00</published><updated>2021-05-04T03:11:37+00:00</updated><id>https://simonwillison.net/2021/May/4/practical-sql-for-data-analysis/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://hakibenita.com/sql-for-data-analysis"&gt;Practical SQL for Data Analysis&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This is a really great SQL tutorial: it starts with the basics, but quickly moves on to a whole array of advanced PostgreSQL techniques - CTEs, window functions, efficient sampling, rollups, pivot tables and even linear regressions executed directly in the database using &lt;code&gt;regr_slope()&lt;/code&gt;, &lt;code&gt;regr_intercept()&lt;/code&gt; and &lt;code&gt;regr_r2()&lt;/code&gt;. I picked up a whole bunch of tips for things I didn't know you could do with PostgreSQL here.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/be_haki/status/1386646932588740612"&gt;@be_haki&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/postgresql"&gt;postgresql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sql"&gt;sql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/haki-benita"&gt;haki-benita&lt;/a&gt;&lt;/p&gt;



</summary><category term="postgresql"/><category term="sql"/><category term="haki-benita"/></entry><entry><title>Optimizing Django Admin Paginator</title><link href="https://simonwillison.net/2018/Nov/6/optimizing-django-admin-paginator/#atom-tag" rel="alternate"/><published>2018-11-06T18:17:42+00:00</published><updated>2018-11-06T18:17:42+00:00</updated><id>https://simonwillison.net/2018/Nov/6/optimizing-django-admin-paginator/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://hakibenita.com/optimizing-the-django-admin-paginator"&gt;Optimizing Django Admin Paginator&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The Django admin paginator uses a count(*) to calculate the total number of rows, so it knows how many pages to display. This makes it unpleasantly slow over large datasets. Haki Benita has an ingenious solution: drop in a custom paginator which uses the PostgreSQL “SET LOCAL statement_timeout TO 200” statement first, then if a timeout error is raised returns 9999999999 as the count instead. This means small tables get accurate page counts and giant tables load display in the admin within a reasonable time period.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/django"&gt;django&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/django-admin"&gt;django-admin&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/postgresql"&gt;postgresql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/haki-benita"&gt;haki-benita&lt;/a&gt;&lt;/p&gt;



</summary><category term="django"/><category term="django-admin"/><category term="postgresql"/><category term="haki-benita"/></entry></feed>