I work in a fintech company and we are redesigning our data system.
We have around 120 GB of compressed data stored in the filesystem.
The data concerns around 80K different timeseries each one of them with around 8K up to 5M rows.
The total estimated number of rows is 8 billion. Each row should hold 6 numeric values, time and an identifier.
The queries we want to run are:
- Give me all data for a timeseries between two timestamps
- Give me all data for a timeseries in single element buckets of n minutes
Everyday we update around 10K of these timeseries with 8K up to 40K records that concern the data of the last month. That is around 8M rows and takes around 4 hours.
We are looking for something that can handle the daily write load in less than 3 hours while maintaining fast query performance.
Our queries concern 3K timeseries, they scan around 3M rows and return 3.5 K rows for each one of them. This currently happens in 1 minute.
The current system is not so sophisticated, all data is kept in a file server and is synchronized daily to the local computers of the interested parties.
The system worked well for some time and now we are looking for something more robust and scalable.
Our hard requirement is the daily ingestion time but we have some flexibility on how much the reading queries should take.
Can a database give us performance and scalability?
If so what would you choose and why?
What kind of infrastructure do we need to support such a system?