Postgresql – Removing duplicates (deduplifying) from an existing, huge, table

normalizationpostgresql

I have a huge table. Each row has an ID, some columns which relate to that ID specifically, a short (DNA) Sequence, and some columns which relate to that sequence particularly. If these have not yet been calculated (for this row) then they are null, but the calculations will always come out identical for a given sequence.

In case it proves relevant: the sequence is indexed, and the DBMS is Postgres.

There are lots of duplicate sequences. Obviously this is non-optimal – both because we don't want to store the duplicates, and because we don't want to waste time recalculating those properties. There will already be duplicate calculated properties.

So I want to move the sequence's properties into a new table, using the sequence as a foreign key. The trouble here is the size of the table – hundreds of millions of records, and the properties are quite large as well.

With a small table this would be easy enough, but I need a better strategy for a huge table.

Best Answer

Yes, the delete step is going to be a pain and might cause (slight) performance issues, but see here for a ray of hope in that regard.

The best solution (prevention is better than cure) is to create a UNIQUE index on your dna field (or whatever you call it) and then this problem cannot arise in the future and the steps below become a one-off operation. Then, when loading a new batch/run, simply use a staging table which allows duplicates and then only copy in the UNIQUE DNA sequences and then TRUNCATE the staging table.

CREATE UNIQUE INDEX sequence_idx ON sequence (dna);

You could also consider PARTITIONing before the operation - only HASH partitions make sense in this case. However, if you're have no reason to partition then it's pointless for this operation which will only be a one-off because of the UNIQUE constraint!

Note that you are not normalising, rather you are deduplifying (if that's a word?). Here's a post from a guy who really knows his way around PostgreSQL on this topic.

Simple outline of deduplifcation steps follows (see fiddle here):

CREATE TABLE 
sequence
(
    id SERIAL PRIMARY KEY,
    dna VARCHAR(50) NOT NULL
);

Populate it:

INSERT INTO sequence (dna) VALUES ('agctagctagct');
INSERT INTO sequence (dna) VALUES ('agctagctagct');
INSERT INTO sequence (dna) VALUES ('agctagctagct');

INSERT INTO sequence (dna) VALUES ('tagctagctagc');
INSERT INTO sequence (dna) VALUES ('tagctagctagc');
INSERT INTO sequence (dna) VALUES ('tagctagctagc');
INSERT INTO sequence (dna) VALUES ('tagctagctagc');

INSERT INTO sequence (dna) VALUES ('ggcggcggcggc');
INSERT INTO sequence (dna) VALUES ('ggcggcggcggc');

You can find your dups using this:

SELECT s.dna, COUNT(dna)
FROM sequence s
GROUP BY dna
HAVING COUNT(dna) > 1
ORDER BY s.dna;

Then delete your dups as follows:

DELETE FROM sequence s1 USING sequence s2
WHERE
  s1.id < s2.id AND
  s1.dna = s2.dna;

Check:

SELECT s.dna, COUNT(dna)
FROM sequence s
GROUP BY dna
ORDER BY s.dna;

Result:

dna          count
agctagctagct     1
ggcggcggcggc     1
tagctagctagc     1