Introduction
At Nilenso, I’m working on an (opensource!) app to design and conduct surveys.
Here’s a simple survey being designed:
Internally, this is represented as:
A survey is made up of many questions. A number of questions can (optionally) be grouped together in a category. Our actual data model is a bit more complicated than this (subquestions, especially), but let’s assume that we’re just working with questions and categories.
Here’s how we preserve the ordering of questions and categories in this survey.
Each question and category has an order_number
field. It’s an integer that specifies the relative ordering of itself and its siblings.
For example, in this case,
the question Bar
will have an order_number
that is less than the order number of Baz
.
This guarantees that the a category can fetch its subquestions in the right order:
1 2 3 4 5 

In fact, this is how we fetch the entire survey at this point. Each category calls sub_questions_in_order
on each of its children, and so on down the entire tree.
This gives us a depthfirst ordering of this tree:
For large surveys with 5+ levels of nesting, and more than a hundred questions, this is pretty slow.
Recursive Queries
I’ve used gems like awesome_nested_set
before, but as far as I could find, none of them supported fetching results across multiple models.
Then I stumbled on a page describing PostgreSQL’s support for recursive queries! That seemed perfect.
Let’s solve this particular problem using recursive queries. (My understanding of this is still very basic, so please don’t take my word for any of this)
To define a recursive Postgres query, we need to define an initial query, which is called the nonrecursive term.
In our case, that would be the top level questions and categories.
Top level elements don’t have a parent category, so their category_id
is NULL
.
1 2 3 4 5 6 7 8 9 

(That query, and all subsequent ones, assume that the survey we’re querying has id = 2)
This gives us the toplevel elements.
Now on to the recursive term. According to the Postgres docs:
Our recursive term simply has to find all the direct children of the elements fetched by the nonrecursive term.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

But wait. The recursive term is only fetching questions. What if the child of a firstlevel category is a category?
Postgres doesn’t let us reference the nonrecursive term more than once. So doing a UNION
of categories and questions is out.
We have to be a little creative here:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 

We perform a UNION
of categories and questions before joining it with the nonrecursive term.
This yields all the survey elements:
Unfortunately, it looks like the ordering is way off.
Ordering a Recursive Query
The problem is that since we’re effectively appending secondlevel elements to firstlevel elements, we’re effectively performing a breadthfirst search instead of a depthfirst search.
How can we correct that?
Postgres has arrays that can be built during a query.
Let’s build an array of order numbers for each element that we fetch. Let’s call this the path
. The path
for an element is:
the path of it’s parent category (if one exists) + its own order_number
If we sort the final result by path
, we’ve converted it into a depthfirst search!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

That’s almost right. There are two entries for What’s your favourite song?
This is happening because when we do an ID comparison to look for children:
1


fle
contains both questions and categories. But we want this to match only categories (because questions can’t have children).
Let’s hardcode a type into each of these queries, so that we don’t try and check for children of a question:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 

That seems about right. We’re done here.
Let’s see how much of a performance boost this gives us.
Performance
Using this script (after creating a survey from the UI), I generated 10 subquestion chains; each 6 levels deep.
1 2 3 4 5 6 7 8 

Each chain looks like this:
Let’s see if recursive queries are faster.
1 2 3 4 5 

31x faster? Not bad.