A Huge Database: Picking the right schema and data model


We want to store some genomic variant data but there are some problems, more important ones like problem of the data's immense size and variability.

1) Variant data can be huge. For example, a single individuals variant data could feasibly some day require a million rows of data in a table, or require of a gigabyte of raw storage on disk. Multiply this over several thousand individuals, and you could potentially end up with terabytes worth of information that you need to make sense of.

2) Each client and/or system we integrate with, will expose or want to see data slightly differently depending on their needs and use cases. This can potentially lead to hundreds of fields that we might need to store, all of which might need to be in different configurations based on the clients needs. So this variant data model will need to keep this in mind in order to remain easy to use, expandable and most importantly, scalable in the long term.

What do you think is better for such a problem? We were thining of having some coulmns in each table that point to an external database or even a file, where we save the huge BLOB data?

Best Answer

I have not experience such scenario but based on my present knowledge I would suggest you have a table which contains union of all the column which any client can need e.g. say client A wants column 1 and 2 and client 2 wants column 2 and 3 and you have only these two clients then create a table with column 1,2 and 3 and in case of client 1 keep column 3 as null and in case of client 2 keep column 2 as null. Once you are done with this create view for each client e.g.create View A, it should have column 1,2 and will be used in case of client A similarly create view for each client.

Also I would suggest to keep data in table only and not separatly on disc as it create sync issue.