Scalable Relational Database Design

database-design

I'm designing a web application (using Django) for a client that will need a fairly large database behind it. I've never worked with databases that handle large amounts of data before so I'm hoping to get some help.

The basic layout of this database is simple: there are users, each user can have multiple child entries, and each entry is composed of a description and up to 25 images. Each user is permitted to have 40 entries per month.

enter image description here

So far this is pretty simple. The client says he wants the initial release to be able to accommodate his expected 20,000 users. This web application will be used by individuals within an organization, not the general public. So his estimate of initial users is very accurate.

If the timescale we're looking at is a year, that means that I need a database that can accommodate 20,000 user rows, 9,600,000 entry rows, and 240,000,000 images.

The first thing I want to know is how to store these images. I think the best thing to do is to store URLs to the images and actually store the images themselves outside of the database. But what is the best way to store these URLs? Should I have a comma separated list within each entry? This seems like a simple approach to me, that way I would only need two tables, one for users and one for entries. The entries table would contain an ID, a description, and a comma separated list of image URLs that can be processed by Django to serve content. Again, I've never done anything like this before so suggestions are very much appreciated.

My more complicated question is about scalability. My client says he can expect to have these 20,000 users, but that if his service takes off that number might go as high as 250,000. Is there a way to implement a SQL style database that can handle 20,000 users here but also leave room for expansion, rather than simply buying hosting and using a database such that the possible 250,000 users would be accommodated from the start?

Best Answer

So you're going to want to store your images on a fast CDN like Cloudfront. Normally I recommend storing pictures in-database, but not 240M of them :)

I'd avoid storing the whole URL, as it's likely to be repetitive. Just store the file name (a five-char case-sensitive name using a-z 0-9 would give you ~900M combinations. If your CDN is case-insensitive wrt URLs then use 6 chars ).

Are there really going to be 240M unique images, or can they be "shared" among entries? Do you expect every user to fill every entry with the max # of images?

You'd want some tables like this:

create table users (
  user_id serial primary key,
  ...
);

create table entries (
  entry_id serial primary key,
  user_id int not null references users(user_id), --index me
  description text,
  ...
);


create table pictures (
  file_name char(5) primary key,
  entry_id int not null references entries(entry_id) --index me
);

or perhaps:

create table pictures (
  picture_id serial primary key,
  file_name char(5) not null unique,
  entry_id int not null references entries(entry_id) -- index me
);

Use connection pooling!