Better database design approach to store thousands of users and their validations (composed by many attributes) about thousands of elements


I’m looking for the best database design approach of a future web application that will store thousands of films and where (maybe) thousands of users will save their own films validations with up to thirty attributes each validation (attributes are likely to increase as the app grows). Validations are private; they’re not shared with other users.

With this scenario, when I designed the database months ago I ended up that the best approach would be the following:

  • One table for “users” containing UserID, nickname, password…
  • As many tables as necessary like “users_xxxx” to store related data of all users together.
  • One table to store “films” (FilmID, Name…)
  • UserID-numbered-tables like “film_validations_[user-ID]” with the following columns: FilmID and the thirty attributes (at least). So it could store the validation of all films of a unique user.

Sadly, the PROS I think I am getting with this approach are merely based on suppositions, not on performance tests (I am too novice to know how to do this kind of task). And the CONS about maintenance and performance are an imminent incoming headache for sure, like:

  • Thousands of tables to store all film validations.
  • Almost impossible to offer statistics based on global users film validations. Because it will be really hard to join too many tables, also having in mind the total will grow as new users sign up into the application.
  • Any day I want to offer more film validation attributes I will have to massively execute ALTER TABLE.
  • The DBMS has to operate with thousands of tables and open them before it can execute the queries.

I am still aware about these problems, but I really can’t figure out how to design it better.

Designing a single table with the UserID column, plus the thousands FilmID as columns (these would contain a Boolean value), plus the tens of film attributes looks really weird.

If we accept this approach, here is another problem: a heavy amount of columns, which in addition, will grow in the future.

Although I have read that a database has no problem managing millions of records in a single table (it would contain the thousands of film validations of [I hope] thousands of users), and the MySQL columns limit is much higher than my requirements (4.096 columns as described here), I don’t know if the performance would be poor.

Other design errors I consider about the single table approach are:

  • Each record would contain as many empty columns as films minus one, which is awful. I mean, each film validation would carry the rest of FilmID columns, and that has no sense because it wouldn’t be neither necessary nor optimised.
  • Querying for films with a validation would be a waste of memory. I guess I would have to code a query within a for loop to check which of the thousands FilmID columns contains a “true” value. In my current design, film validations are inserted into the table “film_validations_[user-ID]” as the user fill them. So with just a “SELECT * FROM film_validations_[user-ID]” is enough because I am sure it only contains rated films.

I am a recent graduate student as a web developer. Although I think I have read and learned enough about database normalization (Wikipedia, many StackOverflow Q&A before asking you, and other sources) to consider the single table approach as a bad option in this scenario, maybe in this case is the opposite having in mind all its future maintenance benefits.

Well, after some months, I am still wondering what would be the best approach designing this kind of database. Right now I’m working on moving the unfinished web application fully coded as procedural to a MVC pattern. So I think this is the right moment to correct my database design if necessary.

That’s why am asking for your suggestions based on professional expertise.

Thank you very much in advance.

Best Answer

Just looking at two aspects of the design:

First, it looks like you need a user table (to store information for each user of the site), and a film table (to store information about each film to be "validated", whatever that means).

From that point, you have a many-to-many relationship between users and films: each user can validate many films, and each film can be validated by many users.

This is normally handled by an intermediate table (a "mapping" or "bridge" table). This might be named something like user_film, and would have at least two or three columns:

user_film_id : integer, auto-incremented; primary key for this table
user_id : same type (and ideally, same name) as the primary key for the user table
film_id : same type (and ideally, same name) as the primary key for the film table

You can choose to simply use the user_id and film_id together as the primary key for user_film. Primary keys are generally used as a "clustered index", and the data is stored in the table based on that. Using a combined key means that each new entry could wind up anywhere in the table, making page splits and fragmentation much more likely. Using a "surrogate" key like an auto-incremented integer value means that each new record will go to the "end" of the table, saving time on inserts.

If you do use a surrogate key (as suggested), I'd put a unique index on user_id and film_id on the user_film table, to make sure that each user only has one entry for any given film.

Second, as far as the validations go, it sounds like a tagging system: you would have a number values that might help categorize the film (say, "color", "black&white", "silent", "sci-fi", "horror", "romantic-comedy", "oscar", etc.)

If that's the case, then there are a couple of options, depending on your database.

One is a simplified version of what's called an "entity-attribute-value" model.

In your case, you'd have a validation table, with an ID value (again, I recommend an auto-incremented integer value) and validation name, plus any other columns (perhaps a description of what the validation is to be used for).

Then, you'd have a user_film_validation table, with the primary key from user_film and the ID value from validation, and an auto-incrementing integer (possibly a bigint or the equivalent) to be the primary key of user_film_validation (again, you could just use the first two values as the primary key here).

This makes it easy to modify the validation names: you just have to change one record to change "black&white" to "B&W", for example, instead of updating thousands of records. It also lets you add new validations without having to alter existing tables: simply adding one record to validation makes the new entry available to every user, on every movie.