Database for timeseries data


I work in a fintech company and we are redesigning our data system.

We have around 120 GB of compressed data stored in the filesystem.
The data concerns around 80K different timeseries each one of them with around 8K up to 5M rows.
The total estimated number of rows is 8 billion. Each row should hold 6 numeric values, time and an identifier.

The queries we want to run are:

  1. Give me all data for a timeseries between two timestamps
  2. Give me all data for a timeseries in single element buckets of n minutes

Everyday we update around 10K of these timeseries with 8K up to 40K records that concern the data of the last month. That is around 8M rows and takes around 4 hours.

We are looking for something that can handle the daily write load in less than 3 hours while maintaining fast query performance.
Our queries concern 3K timeseries, they scan around 3M rows and return 3.5 K rows for each one of them. This currently happens in 1 minute.

The current system is not so sophisticated, all data is kept in a file server and is synchronized daily to the local computers of the interested parties.
The system worked well for some time and now we are looking for something more robust and scalable.

Our hard requirement is the daily ingestion time but we have some flexibility on how much the reading queries should take.

Can a database give us performance and scalability?
If so what would you choose and why?
What kind of infrastructure do we need to support such a system?

Best Answer

Creating 8M new records should take seconds, not hours. You just need to do Bulk Loading instead of slow-by-slow loading. (calling COMMIT after every row would be a "bad idea" as it will limit you to 1row/5ms).

Same concept goes for UPDATE of the data. (Although. I would expect minutes). Bulk operations is key.

From my experience, your main concern will be Disk I/O performance. You improve this by using RAID systems. You'll probably have an extra terabyte or 2, but your performance, and redundancy, should be greatly improved over current setup.

Which RDBMS?

I would go with Oracle. But, that is because local resources (me) are experts in that technology. (It also comes with a cool web based application development frame work).

Actually, I recommend you try out each of them with your actual data and pick the one that fits your company the best.