PostgreSQL IN query performance on a larger array

optimizationpostgresqlpostgresql-performance

I'm fairly new to PostgreSQL and have a question about query performance.

I get a set of values by parsing a file and want to count how many of them already exist in a specific column in my database (PostgreSQL). Currently I do this by parsing the whole file, storing its values in an array and then executing an IN query as follows:

SELECT COUNT(DISTINCT(id)) FROM "account"
WHERE name IN ('name1', 'name2', 'name3');

The column "name" which is being queried has a unique index on it.

I am wondering if this approach would be optimal if the array grows larger (20.000+ values) and/or I execute the query on a table with 40.000+ existing rows. Or is there a better approach?

Best Answer

Correctness

First things first.

I [...] want to count how many of the words already exist in a specific column

Your query does not do that if there can be duplicates in the input, and the misplaced DISTINCT keyword in the query suggests as much.

When passing ('name1', 'name1', 'name1'), with 'name1' in the table, the correct answer is: 3. But your query says 1.

The DISTINCT keyword makes sure of it, but it's a complete waste. The IN construct folds duplicates in the input implicitly, and there are no duplicates in your unique column. IN is the wrong tool to answer your question to begin with. One of these queries would do that:

SELECT count(*)
FROM   unnest(ARRAY['name1', 'name1', 'name1']) name
JOIN   account USING (name);

SELECT count(*)
FROM   unnest(ARRAY['name1', 'name1', 'name1']) i(name)
WHERE  EXISTS (SELECT FROM account a WHERE a.name = i.name);

The first one counts duplicates in the table (multiplied with duplicates on the same name in the input), the second one does not. Since there are no duplicates in the table, both do the same.

Adapted question

Let's assume you meant to ask:

How many of the words in my table are listed in the input (at least once)?

Now duplicates in the input don't matter (logically), and IN fits the job description. But we still don't need DISTINCT while the table column is unique (and rows are not multiplied by joins or similar).

Removing DISTINCT makes it faster already. Next, there is no need to involve the additional column id. Insignificant as though this may seem, it can force an index scan instead of an index only scan. count(name) is typically faster (and safer). But use count(*) instead. Faster, yet. And equivalent while id is the PK:

SELECT COUNT(*) FROM account WHERE name IN ('name1', 'name2', 'name3');

With modern Postgres that should be about the optimum for your case. You should see index-only scans on the unique index (if the table is vacuumed enough, read the manual), which is as fast as it gets, even with thousands of search terms. In particular do not rewrite with UNION or OR as has been suggested. That would get you a much more verbose and slower query.

Nor will it be faster to put the words in a (temporary) table first. That might be useful if you go on to do more with it, but not for this query alone.