Postgresql – What indexing to be used over string columns(text type) with string length around 3000 characters


I am using open source columnar dbms (postgresql+Cstore).

I am trying to speed up the response time of queries. My database has tables with strings mostly. String columns contain huge data (up to 3000 characters per value). A linear look up will take it too long to give results. My queries vary from simple '=' to like, contain and wildcards.

I am unable to conclude on what indexing should I use on string columns. I have hash table, trie structure, B+ tree index in mind.

The problem with hash table is collisions and even if use perfect hashing, there are chances of returning false positives. Moreover, hash index cannot solve like, contains and wild card queries. Apart from that, my data have duplicate values in columns like company name etc., and building perfect hash is getting difficult over values with duplicates in them.

Trie and B+ tree: Trie seems useful for all the queries, but as I mentioned, my strings are of huge length. Can trie be scaled to that extent? I have the same question regarding B+ Tree also.

So, to sum up, I have lengthy string columns and I need speed in query execution. Which indexing should I use, or is there any other method to speed up searches, apart from what I've mentioned?

Best Answer

Are you looking to use the existing methods already implemented in PostgreSQL, or to implement your own new indexing methods?

PostgreSQL's implementation of b-tree indexes cannot index values longer than about 2712 (although it will compress the value if it is highly compressible, before failing due to size) unless you recompile PostgreSQL with a non-standard block size.

PostgreSQL has hash indexes, but they are discouraged because they are not crash-safe. And as you say, it would only support '=' operators.

You might be able to use the pg_trgm extension which creates a type of GIN index which can support LIKE queries. It should be able to accommodate your long strings, but it might not be very effective with them. But with some C programming you could perhaps customize it to suit your usage.