Mongodb – Design the big data with mongo db


I'm working now on the implementation of the following feature:
Say I have a collection in mongodb with millions of entries (very big say 100M), each entry is of type

{"uniqId", "properties"}

for example:

 {"uniqId" : "u1" , "properties" : ["p1", "p2"]}

I have a runtime component that should read the data by uniqId, and some batch process that should insert the data into this collection.
There is huge amount of reads and very rare updates (once in a week)

The problem is that i'll get always the full file (that could be very big milions of lines) that i should insert.

The solution i think about is as follwing:

  • There will be unique index on "uniqId" field.
  • I'll have a timestamp field on each document
  • I can use upsert operation to override the existed entries
  • Each operation will be a bulk of some valuable amount (say 1000 documents)
  • There will be a batch process that will remove the documents with 'old' timestamp

There is another solution:

  • Each time create new collection and then swap between the new and the old one
    (but if i'll use sharding i saw that mongo does not support rename in that case)

Is this design seems to be OK?
How it can influence on the mongo performance?

I have the mongo with replica set configuration, and i did not used sharding yet.

Appreciate any valuable ideas,

Best Answer

If the files are sorted you can just find the id or timestamp of the last inserted document in the database, then only insert the records in the files which have ids that come after that id.

If they are sorted by _id or timestamp then doing this should be trivial.

The solutions you came up with are great as well.

However, be aware that creating entirely new collections will be slower than what i proposed, and upserting will be even slower than that.