Simon Willison's Weblog: orm

SQLModel

2021-08-24T23:16:42+00:00

A new project by FastAPI creator Sebastián Ramírez: SQLModel builds on top of both SQLAlchemy and Sebastián’s Pydantic validation library to provide a new ORM that’s designed around Python 3’s optional typing. The real brilliance here is that a SQLModel subclass is simultaneously a valid SQLAlchemy ORM model AND a valid Pydantic validation model, saving on duplicate code by allowing the same class to be used both for form/API validation and for interacting with the database.

Tags: orm, python, sql, sqlalchemy, pydantic

PugSQL

2019-07-03T18:19:38+00:00

PugSQL

Interesting new twist on a definitely-not-an-ORM library for Python. With PugSQL you define SQL queries in files, give them names and then load them into a module which allows you to execute them as Python methods with keyword arguments. You can mark statements as only returning a single row (or a single scalar value) with a comment at the top of their file.

Tags: orm, python, sql, dan-mckinley

Sqorn

2018-09-19T18:34:16+00:00

Sqorn

JavaScript library for building SQL queries that makes really smart usage of ES6 tagged template literals. The magic of tagged template literals is that they let you intercept and process interpolated values, making them ideally suited to escaping parameters in SQL queries. Sqorn takes that basic ability and layers on some really interesting API design to allow you to further compose queries.

Via Michele Nasti

Tags: javascript, orm, sql

Describing events in code

2018-03-28T15:41:59+00:00

Describing events in code

Phil Gyford built an online directory of every play, movie, gig and exhibition he has been to in the past 38 years using a combination of digital archaeology and saved ticket stubs. He built it using Django and published this piece extensively describing the process he went through to design the data model.

Via @philgyford

Tags: django, orm, phil-gyford

Building a combined stream of recent additions using the Django ORM

2018-03-25T00:47:54+00:00

I’m a big believer in the importance of a “recent additions” feed. Any time you’re building an application that involves users adding and editing records it’s useful to have a page somewhere that shows the most recent objects that have been created across multiple different types of data.

I’ve used a number of techniques to build these in the past - from an extra database table (e.g. the Django Admin’s LogEntry model) to a Solr or Elasticsearch index that exists just to serve recent additions.

For a recent small project I found myself needing a recent additions feed and realized that there’s a new, simple way to build one thanks to the QuerySet.union() method introduced in Django 1.11 back in April 2017.

Consider a number of different ORM models that can be added by users, each with a created timestamp field.

Prior to QuerySet.union(), building a combined recent additions feed across multiple models was difficult: it’s easy to show recent additions for a single model, but how can we intersperse and paginate additions made to models stored across more than one table?

Using .union() to combine records from different models

Consider the following three models:

class Project(models.Model):
    name = models.CharField(max_length=128)
    description = models.TextField()
    created = models.DateTimeField(auto_now_add=True)

class Image(models.Model):
    project = models.ForeignKey(
        Project, related_name='images', on_delete=models.CASCADE
    )
    image = models.ImageField()
    created = models.DateTimeField(auto_now_add=True)

class Comment(models.Model):
    project = models.ForeignKey(
        Project, related_name='comments', on_delete=models.CASCADE
    )
    comment = models.TextField()
    created = models.DateTimeField(auto_now_add=True)

Let’s build a single QuerySet that returns objects from all three models ordered by their created dates, most recent first.

Using .values() we can reduce these different models to a common subset of fields, which we can then .union() together like so:

recent = Project.objects.values(
    'pk', 'created'
).union(
    Image.objects.values('pk', 'created'),
    Comment.objects.values('pk', 'created'),
).order_by('-created')[:4]

Now if we print out list(recent) it will look something like this:

[{'created': datetime.datetime(2018, 3, 24, 1, 27, 23, 625195, tzinfo=<UTC>),
  'pk': 28},
 {'created': datetime.datetime(2018, 3, 24, 15, 51, 29, 116511, tzinfo=<UTC>),
  'pk': 15},
 {'created': datetime.datetime(2018, 3, 23, 20, 14, 3, 31648, tzinfo=<UTC>),
  'pk': 5},
 {'created': datetime.datetime(2018, 3, 23, 18, 57, 36, 585376, tzinfo=<UTC>),
  'pk': 11}]

We’ve successfully combined recent additions from three different tables! Here’s what the SQL for that looks like:

>>> from django.db import connection
>>> print(connection.queries[-1]['sql'])
(SELECT "myapp_project"."id", "myapp_project"."created" FROM "myapp_project")
 UNION (SELECT "myapp_image"."id", "myapp_image"."created" FROM "myapp_image")
 UNION (SELECT "myapp_comment"."id", "myapp_comment"."created" FROM "myapp_comment")

There’s just one problem: we got back a bunch of pk and created records, but we don’t know which model each of those rows represents.

Using .annotate() to add a type constant to the rows

We can fix this by using Django’s annotate() method combined with a Value() object to attach a constant string to each record specifying the type of the row it represents. Here’s how to do that for a single model:

>>> from django.db.models import Value, CharField
>>> list(Image.objects.annotate(
...     type=Value('image', output_field=CharField()
... )).values('pk','type', 'created')[:2])
[{'created': datetime.datetime(2018, 3, 22, 17, 16, 33, 964900, tzinfo=<UTC>),
  'pk': 3,
  'type': 'image'},
 {'created': datetime.datetime(2018, 3, 22, 17, 49, 47, 527907, tzinfo=<UTC>),
  'pk': 4,
  'type': 'image'}]

We’ve added the key/value pair 'type': 'image' to every record returned from the querystring. Now let’s do that to all three of our models and combine the results using .union():

recent = Project.objects.annotate(
    type=Value('project', output_field=CharField())
).values(
    'pk', 'created', 'type'
).union(
    Image.objects.annotate(
        type=Value('image', output_field=CharField())
    ).values('pk', 'created', 'type'),
    Comment.objects.annotate(
        type=Value('comment', output_field=CharField())
    ).values('pk', 'created', 'type'),
).order_by('-created')[:4]

If we run list(recent) we get this:

[{'created': datetime.datetime(2018, 3, 24, 15, 51, 29, 116511, tzinfo=<UTC>),
  'pk': 15,
  'type': 'comment'},
 {'created': datetime.datetime(2018, 3, 24, 15, 50, 3, 901320, tzinfo=<UTC>),
  'pk': 29,
  'type': 'image'},
 {'created': datetime.datetime(2018, 3, 24, 15, 46, 35, 42123, tzinfo=<UTC>),
  'pk': 15,
  'type': 'project'},
 {'created': datetime.datetime(2018, 3, 24, 7, 53, 15, 222029, tzinfo=<UTC>),
  'pk': 14,
  'type': 'comment'}]

This is looking pretty good! We’ve successfully run a single SQL UNION query across three different tables and returned the combined results in reverse chronological order. Thanks to the type column we know which model each record corresponds to.

Inflating the full referenced objects

Now we need to inflate those primary key references a full ORM object from each corresponding table.

The most efficient way to do this is to collect together the IDs for each type and then run a single SQL query per type to load the full objects.

Here’s code that does exactly that: it first collects the list of primary keys that need to be loaded for each type, then executes an efficient SQL IN query against each type to fetch the underlying objects:

records = list(recent)

type_to_queryset = {
    'image': Image.objects.all(),
    'comment': Comment.objects.all(),
    'project': Project.objects.all(),
}

# Collect the pks we need to load for each type:
to_load = {}
for record in records:
    to_load.setdefault(record['type'], []).append(record['pk'])

# Fetch them 
fetched = {}
for type, pks in to_load.items():
    for object in type_to_queryset[type].filter(pk__in=pks):
        fetched[(type, object.pk)] = object

# Annotate 'records' with loaded objects
for record in records:
    key = (record['type'], record['pk'])
    record['object'] = fetched[key]

After running the above code, records looks like this:

[{'created': datetime.datetime(2018, 3, 24, 15, 51, 29, 116511, tzinfo=<UTC>),
  'object': <Comment: a comment>,
  'pk': 15,
  'type': 'comment'},
 {'created': datetime.datetime(2018, 3, 24, 15, 50, 3, 901320, tzinfo=<UTC>),
  'object': <Image: Image object (29)>,
  'pk': 29,
  'type': 'image'},
 {'created': datetime.datetime(2018, 3, 24, 15, 46, 35, 42123, tzinfo=<UTC>),
  'object': <Project: Recent changes demo>,
  'pk': 15,
  'type': 'project'},
 {'created': datetime.datetime(2018, 3, 24, 7, 53, 15, 222029, tzinfo=<UTC>),
  'object': <Comment: Here is another comment>,
  'pk': 14,
  'type': 'comment'}]

We can now feed this to a template and use it to render our recent additions page.

Wrapping it in a re-usable function

Here’s a function that implements the above in a re-usable way:

def combined_recent(limit, **kwargs):
    datetime_field = kwargs.pop('datetime_field', 'created')
    querysets = []
    for key, queryset in kwargs.items():
        querysets.append(
            queryset.annotate(
                recent_changes_type=Value(
                    key, output_field=CharField()
                )
            ).values('pk', 'recent_changes_type', datetime_field)
        )
    union_qs = querysets[0].union(*querysets[1:])
    records = []
    for row in union_qs.order_by('-{}'.format(datetime_field))[:limit]:
        records.append({
            'type': row['recent_changes_type'],
            'when': row[datetime_field],
            'pk': row['pk']
        })
    # Now we bulk-load each object type in turn
    to_load = {}
    for record in records:
        to_load.setdefault(record['type'], []).append(record['pk'])
    fetched = {}
    for key, pks in to_load.items():
        for item in kwargs[key].filter(pk__in=pks):
            fetched[(key, item.pk)] = item
    # Annotate 'records' with loaded objects
    for record in records:
        record['object'] = fetched[(record['type'], record['pk'])]
    return records

This is also available as a gist.

I can now use that function to combine arbitrary querysets (provided they share a created datestamp field) like so:

recent = combined_recent(
    20,
    project=Project.objects.all(),
    image=Image.objects.all(),
    comment=Comment.objects.all(),
)

This will return the most recent 20 records across all three types, with the results looking like this:

[{'when': datetime.datetime(2018, 3, 24, 15, 51, 29, 116511, tzinfo=<UTC>),
  'object': <Comment: a comment>,
  'pk': 15,
  'type': 'comment'},
 {'when': datetime.datetime(2018, 3, 24, 15, 50, 3, 901320, tzinfo=<UTC>),
  'object': <Image: Image object (29)>,
  'pk': 29,
  'type': 'image'},
 {'when': datetime.datetime(2018, 3, 24, 15, 46, 35, 42123, tzinfo=<UTC>),
  'object': <Project: Recent changes demo>,
  'pk': 15,
  'type': 'project'},
 {'when': datetime.datetime(2018, 3, 24, 7, 53, 15, 222029, tzinfo=<UTC>),
  'object': <Comment: Here is another comment>,
  'pk': 14,
  'type': 'comment'}]

Efficient object loading with select/prefetch_related

If you’re going to render these objects on a page, it’s pretty likely you’ll need to load additional data about them. My example models above are deliberately simplified, but in any serious Django project it’s likely they will have additional references to other tables.

We can apply Django’s magic select_related() and prefetch_related() methods directly to the querysets we pass to the function, like so:

recent = combined_recent(
    20,
    project=Project.objects.all().prefetch_related('tags'),
    image=Image.objects.all().select_related('uploaded_by'),
    comment=Comment.objects.all().select_related('author'),
)

Django’s query optimizer is smart enough to ignore those calls entirely when building the initial union queries, so even with the above extras the initial union query will still look like this:

(SELECT "myapp_project"."id", "myapp_project"."created", 'project' AS "recent_changes_type" FROM "myapp_project")
 UNION (SELECT "myapp_image"."id", "myapp_image"."created", 'image' AS "recent_changes_type" FROM "myapp_image")
 UNION (SELECT "myapp_comment"."id", "myapp_comment"."created", 'comment' AS "recent_changes_type" FROM "myapp_comment")
ORDER BY (2) DESC LIMIT 20

The select_related() and prefetch_related() clauses will then be incorporated into the subsequent SQL queries that are used to efficiently inflate the full objects from the database.

Taking it further

There are a bunch of interesting extensions that can be made to this pattern.

Want pagination? The initial unioned queryset can be paginated using offset/limit by slicing the queryset, or using the Django Paginator class.

Want more efficient pagination (since offset/limit tends to get slow after the first few thousand rows)? We’re ordering by created already which means it’s not difficult to build efficient range-based pagination, requesting all records where the created date is less than the earliest date seen on the previous page.

Since everything is based on regular Django querysets, it’s possible to build all kinds of variants of the recent additions feed. So far we’ve just built one showing all changes across an entire application, but it’s not hard to apply additional filters to only show changes made by a specific user, or changes made relating to a specific foreign key relationship. If you can represent it as a collection of querysets that each expose a created column you can combine them into a single feed.

You don’t even need to use records that share a created column: if you have objects with columns of differing names you can use an annotation to alias those columns, like so:

recent = combined_recent(
    20,
    project=Project.objects.annotate(
        when=models.F('updated')
    ).prefetch_related('tags'),
    image=Image.objects.annotate(
        when=models.F('uploaded_at')
    ).select_related('uploaded_by'),
    comment=Comment.objects.annotate(
        when=models.F('commented_at')
    ).select_related('created_by'),
    datetime_field='when'
)

I haven’t extensively load-tested this pattern, but I expect it will work fine for databases with tens-of-thousands of records but may start running into trouble if you have millions of records (though an index on the created column should help a lot). If you need a recent additions feed on something larger scale than that you should probably look at a separate logging table or an external index in something like Elasticsearch instead.

For another interesting thing you can do with .union() check out my article on Implementing faceted search with Django and PostgreSQL.

Tags: django, orm

Implementing faceted search with Django and PostgreSQL

2017-10-05T14:12:27+00:00

I’ve added a faceted search engine to this blog, powered by PostgreSQL. It supports regular text search (proper search, not just SQL"like" queries), filter by tag, filter by date, filter by content type (entries vs blogmarks vs quotation) and any combination of the above. Some example searches:

It also provides facet counts, so you can tell how many results you will get back before you apply one of these filters - and get a general feeling for the shape of the corpus as you navigate it.

I love this kind of search interface, because the counts tell you so much more about the underlying data. Turns out I was most active in quoting people talking about JavaScript back in 2007, for example.

I usually build faceted search engines using either Solr or Elasticsearch (though the first version of search on this blog was actually powered by Hyper Estraier) - but I’m hosting this blog as simply and inexpensively as possible on Heroku and I don’t want to shell out for a SaaS search solution or run an Elasticsearch instance somewhere myself. I thought I’d have to go back to using Google Custom Search.

Then I read Postgres full-text search is Good Enough! by Rachid Belaid - closely followed by Postgres Full-Text Search With Django by Nathan Shafer - and I decided to have a play with the new PostgreSQL search functionality that was introduced in Django 1.10.

… and wow! Full-text search is yet another example of a feature that’s been in PostgreSQL for nearly a decade now, incrementally improving with every release to the point where it’s now really, really good.

At its most basic level a search system needs to handle four things:

It needs to take user input and find matching documents.
It needs to understand and ignore stopwords (common words like “the” and “and”) and apply stemming - knowing that “ridicule” and “ridiculous” should be treated as the same root, for example. Both of these features need to be language-aware.
It needs to be able to apply relevance ranking, calculating which documents are the best match for a search query.
It needs to be fast - working against some kind of index rather than scanning every available document in full.

Modern PostgreSQL ticks all of those boxes. Let’s put it to work.

Simple search without an index

Here’s how to execute a full-text search query against a simple text column:

from blog.models import Entry
from django.contrib.postgres.search import SearchVector

results = Entry.objects.annotate(
    searchable=SearchVector('body')
).filter(searchable='django')

The generated SQL looks something like this:

SELECT "blog_entry"."id", ...,
to_tsvector(COALESCE("blog_entry"."body", %s)) AS "searchable"
FROM "blog_entry"
WHERE to_tsvector(COALESCE("blog_entry"."body", "django"))
    @@ (plainto_tsquery("django")) = true
ORDER BY "blog_entry"."created" DESC

The SearchVector class constructs a stemmed, stopword-removed representation of the body column ready to be searched. The resulting queryset contains entries that are a match for “django”.

My blog entries are stored as HTML, but I don’t want search to include those HTML tags. One (extremely un-performant) solution is to use Django’s Func helper to apply a regular expression inside PostgreSQL to strip tags before they are considered for search:

from django.db.models import Value, F, Func

results = Entry.objects.annotate(
    searchable=SearchVector(
        Func(
            F('body'), Value('<.*?>'), Value(''), Value('g'),
            function='regexp_replace'
        )
    )
).filter(searchable='http')

Update 6th October 8:23pm UTC - it turns out this step is entirely unnecessary. Paolo Melchiorre points out that the PostgreSQL ts_vector() function already handles tag removal. Sure enough, executing SELECT to_tsvector('<div>Hey look what happens to <blockquote>this tag</blockquote></div>') using SQL Fiddle returns 'happen':4 'hey':1 'look':2 'tag':7, with the tags already stripped.

This works, but performance isn’t great. PostgreSQL ends up having to scan every row and construct a list of search vectors for each one every time you execute a query.

If you want it to go fast, you need to add a special search vector column to your table and then create the appropriate index on it. As of Django 1.11 this is trivial:

from django.contrib.postgres.search import SearchVectorField
from django.contrib.postgres.indexes import GinIndex

class Entry(models.Model):
    # ...
    search_document = SearchVectorField(null=True)

    class Meta:
        indexes = [
            GinIndex(fields=['search_document'])
        ]

Django’s migration system will automatically add both the field and the special GIN index.

What’s trickier is populating that search_document field. Django does not yet support a easy method to populate it directly in your initial INSERT call, instead recommending that you populated with a SQL UPDATE statement after the fact. Here is a one-liner that will populate the field for everything in that table (and strip tags at the same time):

def strip_tags_func(field):
    return Func(
        F(field), Value('<.*?>'), Value(''), Value('g'),
        function='regexp_replace'
    )
 
Entry.objects.update(
    search_document=(
        SearchVector('title', weight='A') +
        SearchVector(strip_tags_func('body'), weight='C')
    )
)

I’m using a neat feature of the SearchVector class here: it can be concatenated together using the + operator, and each component can be assigned a weight of A, B, C or D. These weights affect ranking calculations later on.

Updates using signals

We could just set this up to run periodically (as I did in my initial implementation), but we can get better real-time results by ensuring this field gets updated automatically when the rest of the model is modified. Some people solve this with PostgreSQL triggers, but I’m still more comfortable handling this kind of thing in python code - so I opted to use Django’s signals mechanism instead.

Since I need to run search queries across three different types of blog content - Entries, Blogmarks and Quotations - I added a method to each model that returns the text fragments corresponding to each of the weight values. Here’s that method for my Quotation model:

class Quotation(models.Model):
    quotation = models.TextField()
    source = models.CharField(max_length=255)
    tags = models.ManyToManyField(Tag, blank=True)

    def index_components(self):
        return {
            'A': self.quotation,
            'B': ' '.join(self.tags.values_list('tag', flat=True)),
            'C': self.source,
        }

As you can see, I’m including the tags that have been assigned to the quotation in the searchable document.

Here are my signals - loaded once via an import statement in my blog application’s AppConfig.ready() method:

@receiver(post_save)
def on_save(sender, **kwargs):
    if not issubclass(sender, BaseModel):
        return
    transaction.on_commit(make_updater(kwargs['instance']))

@receiver(m2m_changed)
def on_m2m_changed(sender, **kwargs):
    instance = kwargs['instance']
    model = kwargs['model']
    if model is Tag:
        transaction.on_commit(make_updater(instance))
    elif isinstance(instance, Tag):
        for obj in model.objects.filter(pk__in=kwargs['pk_set']):
            transaction.on_commit(make_updater(obj))

def make_updater(instance):
    components = instance.index_components()
    pk = instance.pk

    def on_commit():
        search_vectors = []
        for weight, text in components.items():
            search_vectors.append(
                SearchVector(Value(text, output_field=models.TextField()), weight=weight)
            )
        instance.__class__.objects.filter(pk=pk).update(
            search_document=reduce(operator.add, search_vectors)
        )
    return on_commit

(The full code can be found here).

The on_save method is pretty straightforward - it checks if the model that was just saved has my BaseModel as a base class, then it calls make_updater to get a function to be executed by the transaction.on_commit hook.

The on_m2m_changed handler is significantly more complicated. There are a number of scenarios in which this will be called - I’m reasonably confident that the idiom I use here will capture all of the modifications that should trigger a re-indexing operation.

Running a search now looks like this:

results = Entry.objects.filter(
    search_document=SearchQuery('django')
)

We need one more thing though: we need to sort our search results by relevance. PostgreSQL has pretty good relevance built in, and sorting by the relevance score can be done by applying a Django ORM annotation:

query = SearchQuery('ibm')

results = Entry.objects.filter(
    search_document=query
).annotate(
    rank=SearchRank(F('search_document'), query)
).order_by('-rank')

We now have basic full text search implemented against a single Django model, making use of a GIN index. This is lightning fast.

Searching multiple tables using queryset.union()

My site has three types of content, represented in three different models and hence three different underlying database tables.

I’m using an abstract base model to define common fields shared by all three: the created date, the slug (used to construct permalink urls) and the search_document field populated above.

As of Django 1.11 It’s possible to combine queries across different tables using the SQL union operator.

Here’s what that looks like for running a search across three tables, all with the same search_document search vector field. I need to use .values() to restrict the querysets I am unioning to the same subset of fields:

query = SearchQuery('django')
rank_annotation = SearchRank(F('search_document'), query)
qs = Blogmark.objects.annotate(
    rank=rank_annotation,
).filter(
    search_document=query
).values('pk', 'created', 'rank').union(
    Entry.objects.annotate(
        rank=rank_annotation,
    ).filter(
        search_document=query
    ).values('pk', 'created', 'rank'),
    Quotation.objects.annotate(
        rank=rank_annotation,
    ).filter(
        search_document=query
    ).values('pk', 'created', 'rank'),
).order_by('-rank')[:5]

# Output
<QuerySet [
    {'pk': 186, 'rank': 0.875179, 'created': datetime.datetime(2008, 4, 8, 13, 48, 18, tzinfo=<UTC>)},
    {'pk': 134, 'rank': 0.842655, 'created': datetime.datetime(2007, 10, 20, 13, 46, 56, tzinfo=<UTC>)},
    {'pk': 1591, 'rank': 0.804502, 'created': datetime.datetime(2009, 9, 28, 23, 32, 4, tzinfo=<UTC>)},
    {'pk': 5093, 'rank': 0.788616, 'created': datetime.datetime(2010, 2, 26, 19, 22, 47, tzinfo=<UTC>)},
    {'pk': 2598, 'rank': 0.786928, 'created': datetime.datetime(2007, 1, 26, 12, 38, 46, tzinfo=<UTC>)}
]>

This is not enough information though - I have the primary keys, but I don’t know which type of model they belong to. In order to retrieve the actual resulting objects from the database I need to know which type of content is represented by each of those results.

I can achieve that using another annotation:

qs = Blogmark.objects.annotate(
    rank=rank_annotation,
    type=models.Value('blogmark', output_field=models.CharField())
).filter(
    search_document=query
).values('pk', 'type', 'rank').union(
    Entry.objects.annotate(
        rank=rank_annotation,
        type=models.Value('entry', output_field=models.CharField())
    ).filter(
        search_document=query
    ).values('pk', 'type', 'rank'),
    Quotation.objects.annotate(
        rank=rank_annotation,
        type=models.Value('quotation', output_field=models.CharField())
    ).filter(
        search_document=query
    ).values('pk', 'type', 'rank'),
).order_by('-rank')[:5]

# Output:
<QuerySet [
    {'pk': 186, 'type': u'quotation', 'rank': 0.875179},
    {'pk': 134, 'type': u'quotation', 'rank': 0.842655},
    {'pk': 1591, 'type': u'entry', 'rank': 0.804502},
    {'pk': 5093, 'type': u'blogmark', 'rank': 0.788616},
    {'pk': 2598, 'type': u'blogmark', 'rank': 0.786928}
]>

Now I just need to write function which can take a list of types and primary keys and return the full objects from the database:

def load_mixed_objects(dicts):
    """
    Takes a list of dictionaries, each of which must at least have a 'type'
    and a 'pk' key. Returns a list of ORM objects of those various types.
    Each returned ORM object has a .original_dict attribute populated.
    """
    to_fetch = {}
    for d in dicts:
        to_fetch.setdefault(d['type'], set()).add(d['pk'])
    fetched = {}
    for key, model in (
        ('blogmark', Blogmark),
        ('entry', Entry),
        ('quotation', Quotation),
    ):
        ids = to_fetch.get(key) or []
        objects = model.objects.prefetch_related('tags').filter(pk__in=ids)
        for obj in objects:
            fetched[(key, obj.pk)] = obj
    # Build list in same order as dicts argument
    to_return = []
    for d in dicts:
        item = fetched.get((d['type'], d['pk'])) or None
        if item:
            item.original_dict = d
        to_return.append(item)
    return to_return

One last challenge: when I add filtering by type, I’m going to want to selectively union together only a subset of these querysets. I need a queryset to start unions against, but I don’t yet know which queryset I will be using. I can abuse Django’s queryset.none() method to crate an empty ValuesQuerySet in the correct shape like this

qs = Entry.objects.annotate(
    type=models.Value('empty', output_field=models.CharField()),
    rank=rank_annotation
).values('pk', 'type', 'rank').none()

Now I can progressively build up my union in a loop like this:

for klass in (Entry, Blogmark, Quotation):
    qs = qs.union(klass.objects.annotate(
        rank=rank_annotation,
        type=models.Value('quotation', output_field=models.CharField())
    ).filter(
        search_document=query
    ).values('pk', 'type', 'rank'))

The Django ORM is smart enough to compile away the empty queryset when it constructs the SQL, which ends up looking something like this:

(((SELECT "blog_entry"."id",
            "entry" AS "type",
            ts_rank("blog_entry"."search_document", plainto_tsquery(%s)) AS "rank"
     FROM "blog_entry"
     WHERE "blog_entry"."search_document" @@ (plainto_tsquery(%s)) = TRUE
     ORDER BY "blog_entry"."created" DESC))
 UNION
   (SELECT "blog_blogmark"."id",
           "blogmark" AS "type",
           ts_rank("blog_blogmark"."search_document", plainto_tsquery(%s)) AS "rank"
    FROM "blog_blogmark"
    WHERE "blog_blogmark"."search_document" @@ (plainto_tsquery(%s)) = TRUE
    ORDER BY "blog_blogmark"."created" DESC))
UNION
  (SELECT "blog_quotation"."id",
          "quotation" AS "type",
          ts_rank("blog_quotation"."search_document", plainto_tsquery(%s)) AS "rank"
   FROM "blog_quotation"
   WHERE "blog_quotation"."search_document" @@ (plainto_tsquery(%s)) = TRUE
   ORDER BY "blog_quotation"."created" DESC)

Applying filters

So far, our search engine can only handle user-entered query strings. If I am going to build a faceted search interface I need to be able to handle filtering as well. I want the ability to filter by year, tag and type.

The key difference between filtering and querying (borrowing these definitions from Elasticsearch) is that querying is loose - it involves stemming and stopwords - while filtering is exact. Additionally, querying affects the calculated relevance score while filtering does not - a document either matches the filter or it doesn’t.

Since PostgreSQL is a relational database, filtering can be handled by simply constructing extra SQL where clauses using the Django ORM.

Each of the filters I need requires a slightly different approach. Filtering by type is easy - I just selectively include or exclude that model from my union queryset.

Year and month work like this:

selected_year = request.GET.get('year', '')
selected_month = request.GET.get('month', '')
if selected_year:
    qs = qs.filter(created__year=int(selected_year))
if selected_month:
    qs = qs.filter(created__month=int(selected_month))

Tags involve a join through a many-2-many relationship against the Tags table. We want to be able to apply more than one tag, for example this search for all items tagged both python and javascript. Django’s ORM makes this easy:

selected_tags = request.GET.getlist('tag')
for tag in selected_tags:
    qs = qs.filter(tags__tag=tag)

Adding facet counts

There is just one more ingredient needed to complete our faceted search: facet counts!

Again, the way we calculate these is different for each of our filters. For types, we need to call .count() on a separate queryset for each of the types we are searching:

queryset = make_queryset(Entry, 'entry')
type_counts['entry'] = queryset.count()

(the make_queryset function is defined here)

For years we can do this:

from django.db.models.functions import TruncYear

for row in queryset.order_by().annotate(
    year=TruncYear('created')
).values('year').annotate(n=models.Count('pk')):
    year_counts[row['year']] = year_counts.get(
        row['year'], 0
    ) + row['n']

Tags are trickiest. Let’s take advantage of he fact that Django’s ORM knows how to construct sub-selects if you pass another queryset to the __in operator.

tag_counts = {}
type_name = 'entry'
queryset = make_queryset(Entry, 'entry')
for tag, count in Tag.objects.filter(**{
    '%s__in' % type_name: queryset
}).annotate(
    n=models.Count('tag')
).values_list('tag', 'n'):
    tag_counts[tag] = tag_counts.get(tag, 0) + count

Rendering it all in a template

Having constructed the various facets counts in the view function, the template is really simple:

{% if type_counts %}
    <h3>Types</h3>
    <ul>
        {% for t in type_counts %}
            <li><a href="{% add_qsarg "type" t.type %}">{{ t.type }}</a> {{ t.n }}</a></li>
        {% endfor %}
    </ul>
{% endif %}
{% if year_counts %}
    <h3>Years</h3>
    <ul>
        {% for t in year_counts %}
            <li><a href="{% add_qsarg "year" t.year|date:"Y" %}">{{ t.year|date:"Y" }}</a> {{ t.n }}</a></li>
        {% endfor %}
    </ul>
{% endif %}
{% if tag_counts %}
    <h3>Tags</h3>
    <ul>
        {% for t in tag_counts %}
            <li><a href="{% add_qsarg "tag" t.tag %}">{{ t.tag }}</a> {{ t.n }}</a></li>
        {% endfor %}
    </ul>
{% endif %}

I am using custom templates tags here to add arguments to the current URL. I’ve built systems like this in the past where the URLs are instead generated in the view logic, which I think I prefer. As always, perfect is the enemy of shipped.

And because the results are just a Django queryset, we can use Django’s pagination helpers for the pagination links.

The final implementation

The full current version of the code at time of writing can be seen here. You can follow my initial implementation of this feature through the following commits: 7e3a0217 c7e7b30c 7f6b524c a16ddb5e 7055c7e1 74c194d9 f3ffc100 6c24d9fd cb88c2d4 2c262c75 776a562a b8484c50 0b361c78 1322ada2 79b1b13d 3955f41b 3f5ca052.

And that’s how I built faceted search on top of PostgreSQL and Django! I don’t have my blog comments up and running yet, so please post any thoughts or feedback over on this GitHub issue or over on this thread on Hacker News.

Update 9th September 2021: A few years after implementing this I started to notice performance issues with my blog, which turned out to be caused by search engine crawlers hitting every possible combination of facets, triggering a ton of expensive SQL queries. I excluded /search from being crawled using robots.txt which fixed the problem.

Tags: django, full-text-search, orm, postgresql, projects, search, facetedsearch

Easier custom Model Manager Chaining

2010-07-20T18:21:00+00:00

Easier custom Model Manager Chaining

A neat solution to the problem of wanting to write a custom QuerySet method (.published() for example) which is also available on that model’s objects manager, without having to write much boilerplate.

Tags: django, orm, queryset, recovered

On Django And Migrations

2010-06-02T16:27:00+00:00

On Django And Migrations

South author Andrew Godwin on the plans for migrations in Django. His excellent South migration library will be split in to two parts—one handling database abstraction, dependency resolution and history tracking and the other providing autodetection and the South user interface. The former will go in to Django proper, encouraging other migration libraries to share the same core abstractions.

Tags: andrew-godwin, django, migrations, orm, south, recovered

Appending the request URL to SQL statements in Django

2010-06-02T09:09:00+00:00

Appending the request URL to SQL statements in Django

A clever frame-walking monkey-patch which pulls the most recent HttpRequest object out of the Python stack and adds the current request.path to each SQL query as an SQL comment, so you can see it in debugging tools such as slow query logs and the PostgreSQL “select * from pg_stat_activity” query.

Tags: chris-lamb, debugging, django, orm, postgresql, python, sql, recovered, monkeypatch

Cache Machine: Automatic caching for your Django models

2010-03-11T19:35:32+00:00

Cache Machine: Automatic caching for your Django models

This is the third new ORM caching layer for Django I’ve seen in the past month! Cache Machine was developed for zamboni, the port of addons.mozilla.org to Django. Caching is enabled using a model mixin class (to hook up some post_delete hooks) and a custom caching manager. Invalidation works by maintaining a “flush list” of dependent cache entries for each object—this is currently stored in memcached and hence has potential race conditions, but a comment in the source code suggests that this could be solved by moving to redis.

Tags: cachemachine, caching, django, memcached, mozilla, orm, ormcaching, python, redis

Announcing django-cachebot

2010-03-06T12:48:39+00:00

Announcing django-cachebot

The ORM caching space around Django is heating up. django-cachebot is used in production at mingle.com and takes a more low level approach to cache invalidation than Johnny Cache, enabling you to specifically mark the querysets you wish to cache and providing some advanced options for cache invalidation. Unfortunately it currently relies on a patch to Django core to enable its own manager.

Tags: cachebot, caching, django, mingle, orm, ormcaching

Johnny Cache

2010-02-28T22:55:15+00:00

Johnny Cache

Clever twist on ORM-level caching for Django. Johnny Cache (great name) monkey-patches Django’s QuerySet classes and caches the result of every single SELECT query in memcached with an infinite expiry time. The cache key includes a “generation” ID for each dependent database table, and the generation is changed every single time a table is updated. For apps with infrequent writes, this strategy should work really well—but if a popular table is being updated constantly the cache will be all but useless. Impressively, the system is transaction-aware—cache entries created during a transaction are held in local memory and only pushed to memcached should the transaction complete successfully.

Tags: caching, databases, django, memcached, orm, ormcaching, performance, python

django-batch-select

2009-11-23T16:19:52+00:00

django-batch-select

A smart attempt at solving select_related for many-to-many relationships in Django. Add a custom manager to your model and call e.g. Entry.objects.all()[:10].batch_select("tags") to execute two queries - one pulling back the first ten entries and another using an "IN" query against the tags table to pull back all of the tags for those entries in one go.

Via Psychic Origami

Tags: batchselect, django, john-montgomery, manytomany, orm, python, selectrelated, sql

Django 1.2 planned features

2009-10-26T10:38:06+00:00

Django 1.2 planned features

The votes are in and the plan for Django 1.2 has taken shape - features are split in to high, medium and low priority. There's some really exciting stuff in there - outside of the things I've already talked about, I'm particularly excited about multidb, Model.objects.raw(SQL), the smarter {% if %} tag and class-based generic views.

Tags: classbasedviews, django, multidb, orm, python

Django 1.1 release notes

2009-07-29T09:34:04+00:00

Django 1.1 release notes

Django 1.1 is out! Congratulations everyone who worked on this, it’s a fantastic release. New features include aggregate support in the ORM, proxy models, deferred fields and some really nice admin improvements. Oh, and the testing framework is now up to 10 times thanks to smart use of transactions.

Via Django | Weblog | Django 1.1 released

Tags: aggregates, django, django-admin, open-source, orm, python, releases

South's Design

2009-05-13T12:30:45+00:00

South's Design

Andrew Godwin explains why South resorts to parsing your models.py file in order to construct information about for creating automatic migrations.

Tags: andrew-godwin, django, models, orm, parsing, python, south

Haystack

2009-04-17T21:53:49+00:00

Haystack

A brand new modular search plugin for Django, by Daniel Lindsley. The interface is modelled after the Django ORM (complete with declarative classes for defining your search schema) and it ships with backends for both Solr and pure-python Whoosh, with more on the way. Excellent documentation.

Via Toast Driven

Tags: daniel-lindsley, django, haystack, orm, python, search, solr, whoosh

Southerly Breezes

2009-03-15T13:17:20+00:00

Southerly Breezes

Andrew Godwin is slowly assimilating the best ideas from other Django migration systems in to South—the latest additions include ORM Freezing from Migratory and automatic change detection. Exciting stuff.

Tags: andrew-godwin, databases, django, migrations, orm, south

DB2 support for Django is coming

2009-02-18T22:58:50+00:00

DB2 support for Django is coming

From IBM, under the Apache 2.0 License. I’m not sure if this makes it hard to bundle it with the rest of Django, which uses the BSD license.

Tags: antonio-cangiano, bsd, databases, db2, django, ibm, licensing, open-source, orm, python

Secrets of the Django ORM

2008-11-08T23:49:15+00:00

Secrets of the Django ORM

An undocumented (and unsupported) method of poking a Django QuerySet’s internal query to add group_by and having clauses to a SQL query.

Tags: django, groupby, having, orm, python, queryset, sql

Django 1.0 alpha release notes

2008-07-22T06:04:29+00:00

Django 1.0 alpha release notes

The big features are newforms-admin, unicode everywhere, the queryset-refactor ORM improvements and auto-escaping in templates.

Tags: alpha, autoescaping, django, django-admin, newformsadmin, orm, python, querysetrefactor, unicode

jQuery style chaining with the Django ORM

2008-05-01T12:31:17+00:00

Django's ORM is, in my opinion, the unsung gem of the framework. For the subset of SQL that's used in most web applications it's very hard to beat. It's a beautiful piece of API design, and I tip my hat to the people who designed and built it.

Lazy evaluation

If you haven't spent much time with the ORM, two key features are lazy evaluation and chaining. Consider the following statement:

entries = Entry.objects.all()

Assuming you have created an Entry model of some sort, the above statement will create a Django QuerySet object representing all of the entries in the database. It will not result in the execution of any SQL - QuerySets are lazily evaluated, and are only executed at the last possible moment. The most common situation in which SQL will be executed is when the object is used for iteration:

for entry in entries:
    print entry.title

This usually happens in a template:

<ul>
{% for entry in entries %}
  <li>{{ entry.title }}</li>
{% endfor %}
</ul>

Lazy evaluation works nicely with template fragment caching - even if you pass a QuerySet to a template it won't be executed if the fragment it is used in can be served from the cache.

You can modify QuerySets as many times as you like before they are executed:

entries = Entry.objects.all()
today = datetime.date.today()
entries_this_year = entries.filter(
    posted__year = today.year
)
entries_last_year = entries.filter(
    posted__year = today.year - 1
)

Again, no SQL has been executed, but we now have two QuerySets which, when iterated, will produce the desired result.

Chaining

Chaining comes in when you want to apply multiple modifications to a QuerySet. Here are blog entries from 2006 that weren't posted in January:

Entry.objects.filter(
    posted__year = 2006
).exclude(posted__month = 1)

And here's entries from that year posted to the category named "Personal", ordered by title:

Entry.objects.filter(
    posted__year = 2006
).filter(
    category__name = "Personal"
).order_by('title')

The above can also be expressed like this:

Entry.objects.filter(
    posted__year = 2006,
    category__name = "Personal"
).order_by('title')

Chaining in jQuery

The parallels to jQuery are pretty clear. The jQuery API is built around chaining, and the jQuery animation library even uses a form of lazy evaluation to automatically queue up effects to run in sequence:

jQuery('div#message').addClass(
	'borderfade'
).animate({
   'borderWidth': '+10px'
}, 1000).fadeOut();

One of the neatest things about jQuery is the plugin model, which takes advantage of JavaScript's prototype inheritance and makes it trivially easy to add new chainable methods. If we wanted to package the above dumb effect up as a plugin, we could do so like this:

jQuery.fn.dumbBorderFade = function() {
    return this.addClass(
        'borderfade'
    ).animate({
       'borderWidth': '+10px'
    }, 1000).fadeOut();
};

Now we can apply it to an element like so:

jQuery('div#message').dumbBorderFade();

Custom QuerySet methods in Django

Django supports adding custom methods for accessing the ORM through the ability to implement a custom Manager. In the above examples, Entry.objects is the Manager. The downside of this approach is that methods added to a manager can only be used at the beginning of the chain.

Luckily, Managers also provide a hook for returning a custom QuerySet. This means we can create our own QuerySet subclass and add new methods to it, in a way that's reminiscent of jQuery:

from django.db import models
from django.db.models.query import QuerySet
import datetime

class EntryQuerySet(QuerySet):
    def on_date(self, date):
        next = date + datetime.timedelta(days = 1)
        return self.filter(
            posted__gt = date,
            posted__lt = next
        )

class EntryManager(models.Manager):
    def get_query_set(self):
        return EntryQuerySet(self.model)

class Entry(models.Model):
    ...
    objects = EntryManager()

The above gives us a new method on the QuerySets returned by Entry.objects called on_date(), which lets us filter entries down to those posted on a specific date. Now we can run queries like the following:

Entry.objects.filter(
    category__name = 'Personal'
).on_date(datetime.date(2008, 5, 1))

Reducing the boilerplate

This method works fine, but it requires quite a bit of boilerplate code - a QuerySet subclass and a Manager subclass plus the wiring to pull them all together. Wouldn't it be neat if you could declare the extra QuerySet methods inside the model definition itself?

It turns out you can, and it's surprisingly easy. Here's the syntax I came up with:

from django.db.models.query import QuerySet

class Entry(models.Model):
   ...
   objects = QuerySetManager()
   ...
   class QuerySet(QuerySet):
       def on_date(self, date):
           return self.filter(
               ...
           )

Here I've made the custom QuerySet class an inner class of the model definition. I've also replaced the default manager with a QuerySetManager. All this class does is return the QuerySet inner class for the current model from get_query_set. The implementation looks like this:

class QuerySetManager(models.Manager):
    def get_query_set(self):
        return self.model.QuerySet(self.model)

I'm pretty happy with this; it makes it trivial to add custom QuerySet methods and does so without any monkeypatching or deep reliance on Django ORM internals. I think the ease with which this can be achieved is a testament to the quality of the ORM API.

Tags: api, chaining, django, jquery, orm, python, queryset

Queryset-refactor branch has been merged into trunk

2008-04-27T07:21:13+00:00

Queryset-refactor branch has been merged into trunk

Malcolm’s latest Django masterpiece is complete.

Tags: branch, django, malcolm-tredinnick, orm, python, qsrf

mysql_cluster

2008-03-21T08:45:57+00:00

mysql_cluster

My Russian isn’t all that good, but this looks like a neat way of getting Django to talk to a master/slave setup, written by Ivan Sagalaev. UPDATE: English docs are linked from the comments.

Via A comment on "Django's tipping point"

Tags: django, ivansagalaev, masterslave, mysql, mysqlcluster, orm, python, replication

Queryset Implementation

2008-03-19T09:43:08+00:00

Queryset Implementation

Malcolm explains the work that has gone in to the queryset-refactor branch. Executive summary: Python’s ORM is probably a lot better at SQL than you are.

Tags: django, malcolm-tredinnick, orm, python, querysetrefactor, sql

Caching Layer for Django ORM

2008-01-23T15:18:19+00:00

Caching Layer for Django ORM

Interesting extension to Django’s ORM that adds automatic caching of querysets and smart cache invalidation.

Tags: caching, david-cramer, django, orm, ormcaching, python

Announcing StaticGenerator for Django

2008-01-07T21:26:41+00:00

Announcing StaticGenerator for Django

Simple but powerful static file generator for Django applications—just tell it about your model instances and it will create an entire static site based on calling get_absolute_url() on each one. Uses signals to repopulate the cache when a model changes.

Tags: caching, django, jared-kuolt, orm, performance, static, static-generator

Django Evolution

2007-11-23T23:49:10+00:00

Django Evolution

Really smart take on the problem of updating database tables to reflect changes to Django models. Code that automatically modifies your database tables can be pretty scary, but Evolution seems to hit the right balance.

Tags: databases, django, djangoevolution, migration, orm, schema

Using the extra() QuerySet modifier in Django for WeGoEat

2007-10-24T19:28:20+00:00

Using the extra() QuerySet modifier in Django for WeGoEat

You can use select() on a QuerySet to obtain extra values using subqueries.

Tags: django, orm, python, queryset, ryan-kanno, subqueries

tranquil

2007-10-09T02:30:29+00:00

tranquil

Inspired take on the Django ORM to SQLAlchemy problem: lets you define your models with the Django ORM but use SQLAlchemy to run queries against them.

Tags: django, djangoorm, models, orm, python, sqlalchemy, tranquil