What is a query system?
A query system is an optimizing system, that is able to cache the results of computations, but invalidate them in case one of the dependencies change. This is very similar to a build system, that tries to perform the least work possible, by only rebuilding parts that went out of date.
Example
Let's say we define the following computations we want to perform:
- Read a list of numbers from a file, called
read_numbers(file)
- Multiply a list of numbers, called
product(numbers)
Now let's say we have two files, A.txt
and B.txt
:
// A.txt
2 3 4 5
// B.txt
1 3 5 7
Let's say we want to multiply all the numbers from A.txt
and B.txt
. We'd do that by:
a_nums := read_numbers('A.txt')
b_nums := read_numbers('B.txt')
a_prod := product(a_nums)
b_prod := product(b_nums)
result := a_prod * b_prod
Then let's change the contents of A.txt
to something else, like
If we want to recompute the product of all numbers in A.txt
and B.txt
, the algoritm could go something like this:
// Re-read the numbers from A.txt, the contents changed
a_nums := read_numbers('A.txt')
// We can just recall the last result of read_numbers('B.txt'), it didn't change
b_nums := read_numbers('B.txt')
// We need to perform this again, the numbers changed
a_prod := product(a_nums)
// We can just recall the old results again, the input did not change
b_prod := product(b_nums)
result := a_prod * b_prod
Which means that two operations didn't truly execute out of the 5 we had in this algorithm, they could just return their cached results!
We call these computations queries.
Observations
There are some key things/rules to take away from the example:
- There are inputs to the system, that will be the root dependencies of computations
- The computations have to be side-effect free, you can not cache results of a computation that has side-effects
- Computations can depend on each other, but they should never form a cycle
- Parameters and the results of the computations have to have value-based equality
Why do we need a query system?
Syntax analysis is relatively easy to make incremental without any fancy systems, because the theory for them has existed for a long time. The problem is that we are now past that, we need to make the rest of the frontend incremental, and that is not trivial by any means. A system like this could make it somewhat easier, because we would mostly have to focus on the actual logic, the incrementality would be given by this system.
Existing systems/literature
Designing such a system is hard, so it might be worth to look at some inspiration. I have collected out a few resources:
Internals
If you are interested about how such a system can work, you could watch this video by one of the Salsa authors (I promise it's interesting!), or you can take a peek at the pseudo-code I wrote that extracts out the core algorithm.
Designs I currently have
An old system inspired by the old Salsa framework
I have implemented a system like this, heavily inspired by an old version of Salsa that you can find here (usage is detailed in the README right there!). The gist of it is that you define the queries in interfaces, then register them in a DI framework with a special method. The method aliases the originally injected interface with a proxy that the framework generates. This proxy does all the versioning and caching behind the scenes. When the DI injects a service, it will be the proxy of it.
I really dislike this solution: it's easy to forget to call a method through the proxy if it's on the same instance, and the heavy dependency on SGs makes it really closed off, hard to fix bugs in it, hard to configure.
A new, also unattractive design I have
On the other extreme, I have a very dynamic API idea that requires no code generation, but it's not type-safe or even method-safe at all! Implementing the initial example with it would look something like so:
// NOTE: Sequence<T> is just a type that compares by values
static Sequence<int> ReadNumbers(Database db, string file) =>
db.Get<string>("file_text", file).Split(' ').Select(int.Parse).ToSequence();
static int Product(Database db, Sequence<int> ns) =>
ns.Aggregate((a, b) => a * b);
static int ProductOfNumbersInFiles(Database db)
{
var aNums = db.Get<Sequence<int>>("ReadNumbers", "A.txt");
var bNums = db.Get<Sequence<int>>("ReadNumbers", "B.txt");
var aProd = db.Get<int>("Product", aNums);
var bProd = db.Get<int>("Product", bNums);
return aProd * bProd;
}
The pros of this system:
- The
Database
type is exposed for easy configuration, it's not some magic internal type.
- It's easier to extend, you don't need to modify an interface to add a query
- It works with static functions
The cons of this system:
- You reference the queries by some value key, which implies a few other points on this list
- The DB has no type-safe info about what the called query returns
- The calls to other queries are severely uglified
An ideal implementation would look like so (only showing the last query):
static int ProductOfNumbersInFiles(Database db)
{
var aNums = ReadNumbers(db, "A.txt");
var bNums = ReadNumbers(db, "B.txt");
var aProd = Product(db, aNums);
var bProd = Product(db, bNums);
return aProd * bProd;
}
Key differences are that calls to other queries actually look like calls to queries, no key references through some value, type safety. The problem is that currently, this is not possible, Source Generators can't wrap up existing methods to allow for this.