I'm inspecting one legacy and quite big database which is growing faster nowadays than before.
select count(*) from information_schema.tables where table_schema = 'public' returns a bit more than 120k. Despite this, most of them have same structure which is pretty simple, furthermore data in these tables are independent of each other.
I reckon that database design extremely needs some modifications to be applied as soon as it possible, but would like to share my thoughts.
There is one very regular table
Users(id, name, …) and many(~100k) others have foreign keys referenced to Users.id. The idea is very straightforward: each user-story has own tables to keep their data isolated, each table has its increment sequence and in general this looks like manual sharding.
- As far as I know, Postgres have no limits for tables, but on the other hand – performance degradation is quite possible in this case?
- pg_dump is a nightmare even now because needs lots of locks which makes OOM is quite possible as well. Moreover, Postgre have a limit for relations.
I'm thinking about declarative partitioning, but under the hood server organizes storage in different files, so it will be same OS limits, but most important what will I win eventually? I do not need to SELECT data across thousand tables with same structure – all operations are logically separated to each other. In other words, everything boils down that I do not really need reorganize tables, but thousands foreign keys referenced to same column looks to me a huge issue. The one idea I have is simply get rid of these constraints and check id in a business logic or a trigger before each INSERT, but this is ugly in my opinion.
What is best practices in such use cases? Please, share you thoughts.
Thank you both for joining the discussion and sorry for the late reply due to unexpected personal circumstances :\ Give me one more try, please.
Within our business logic we have a Case class where some telemetry is logging and discussion happens. I would say that this is sort of a chat(very specific) and an issue tracker.
For each independent Case we have 3 tables in the public scheme and each table has a common prefix in the name and postfix with case id.
Simplified data structure:
Messages(id, user_id, body, date_time)
Logs(id, user_id, incident)
- ids are bigserial (each table has private auto sequence, so I cannot create one general table with one bigserial field)
- user_ids are foreign keys referenced to
There are 60k tables like these in the database for now. Other "half" of the db consists of other tables for different types of journals, but all of them, as far as I understand, have same primitive motivation: isolated auto int sequences and avoiding locks on inserts within independent Cases or Journals.
What I came up with, but would be very grateful for the criticism.
Just three common tables for participants, messages and logs to get everything together in just 3 tables, but:
- create a simple bigint field instead of a bigserial ones for
case_idcolumn to distinguish cases in each table
- make a primary key = (case_id, id)
- create sequence for each case and use manual nextval('case_12345_id_seq') on insertions
- create partitions by case_id
Sound like a simple plan with additional benefits like a sharding, but here's what I'm not sure about:
- As far as I know I have to create foreign key referenced to
Users.idin each partition table – means same limits and it is not worth it even at first glance
- How about locks? What will be locked on insertion? (whole table, just partition table for particular case_id)
- 1+2 boils down to be
max_locks_per_transactionendlessly huge value, otherwise even pg_dump still a nightmare or a DELETE from
Userswill possibly lock each partition table.
- if it still works what I should prefer: declarative partitioning or using inheritance to achieve partitioning?
I'm trying to compare 2 approaches: separate tables(like now in our db) and partitioned tables of one. What will happen if I drop a partition? As far as I understand this DROP locks exclusively the main table as well. Beside this, pg_dump is fairly impossible on such a huge database(with lots relations) in prod because acquires share locks and from what I know the one technique is dumping from asynchronously replicated,readonly database to make a backup from a backup to run some tests on a real data in a sandbox.
The other important thing to me is a limit of relations, how experts deal with it? My guess is sharding, but which instance across all databases is the owner of a relation? Do I need to care about it?
I strongly believe that I can't merge everything to just one regular collection of 3 tables distinguished by case_id like @LaurenzAlbe says and need to be prepared for rapid data growing. On the other hand I am ready to accept the fact that all of this is ocd, but that's why I'm thinking about a global database redesign and asking for a help.
Thank you again.