Get distinct values through direct query of an index


I have a relatively large SQLite database that I'm building. The general access profile of this database is:

  1. Initial load of 300 million rows.
  2. Initial read of each row, and UPDATE on approximately 30 million rows
  3. Ongoing read-only access through a python script with a MyDatabaseReader class.

The data structure is a flat denormalized table and queries are of the form:

WHERE chromosome=TEXT, position=INT, reference_sequence=TEXT, alternate_sequence=TEXT

I would like to validate that an incoming query requests a valid chromosome name, based on the contents of the database. Requesting chromosome='chr13' is valid, but in the case of '13' or 'chr31' I would like my python script to throw an error, instead of silently returning zero rows. To enable this, my during the init of my python class, it makes an initial query to get the set of valid chromosome names:

SELECT DISTINCT chromosome from dbsnp;

This query takes a LONG time, and bogs down the execution. I have tried both a compound index on (chromosome, position, reference_sequence, alternate_sequence), as well as a single index on chromosome, and have verified with EXPLAIN QUERY PLAN that in both cases the index gets used.

0|0|TABLE dbsnp WITH INDEX chromosome ORDER BY

My question: Is there some SQL trick to query directly from the chromosome index. I don't care about anything else in the row, and it seems that the index is a pre-built version of the data that I want to return.

Alternatively, I'm considering constructing a chromosome_names table that I populate with my SELECT DISTINCT query above after database load and update. Because it scares me to construct a static table that can fall out of sync with the main table I'm considering triggers to update the chromosome_names table on change of the main table. However, I'm concerned that this may cause significant churn should I update rows in the main table, and more importantly, that I'm reinventing what is essentially contained in an index.

Is there a good way to get my distinct values query from the index directly, or alternatively, is there a way to have SQLite throw an error should the queried value for chromosome be outside of the set of contained values (note: position, ref, and alt sequences are expected to sometimes query for unexpected values so erroring on zero rows returned will not work).


Best Answer

I've never used SQLite so bear with me here. But it seems as if this problem is common among many RDBMS platforms.

When you select distinct values from your column you end up scanning all rows in the index:

index scan

This can be a great strategy if there aren't many rows in the table or if the column doesn't have very many duplicate values. But if you have millions of rows for each distinct value then you'll scan millions of rows just to return a single unique value. For data sets like that, it can sometimes be better to get the first distinct value, then skip to the next value, and so on. This can be accomplished via recursion in some platforms. You can also run one query at a time with each getting the next distinct value. For example, you could get the first value with this query:

SELECT MIN(chromosome) FROM dbsnp;

Then get the next value with this query (substituting the filter with the values of the first query):

SELECT chromosome FROM dbsnp WHERE chromosome > 'TEST_1' ORDER BY chromosome LIMIT 1;

And the next:

SELECT chromosome FROM dbsnp WHERE chromosome > 'TEST_2' ORDER BY chromosome LIMIT 1;

And so on. For these queries I'm getting index seeks:

index seeks

For a relatively small data set, the single distinct query takes about 320 ms and the series of LIMIT 1 queries only took 4 ms. You'll of course need to write more code to use this solution, but it might be worth a shot.

db fiddle