Mongodb – How to efficiently design MongoDB database for Uber-like application

database-designmongodbnosql

I'm designing a MongoDB database for an Uber-like application, but since I'm new to NoSQL concept, I have several doubts.

I have three main collections – users, requests and messages. User can post a ride request, and if other user wants to give him a ride, accepts such a request and now text communication between those two users is started.

So, in SQL world I would have the following tables:

USERS
user_id
username

REQUESTS
request_id
request_details
gps_location
passenger_id
driver_id
timestamp

MESSAGES
message_id
message_body
request_id
sender_id
timestamp

But we're in NoSQL reality right now, and here things tend to look a little bit different. I know that I should forget about SQL mindset, about JOINs, and get familiar with embedded documents.

The question is – what's the proper way of designing the database in this case?

I could of course have such a design:

USERS
{
    "user_id": "001",
    "username": "John"
}

REQUESTS
{
    "request_id": "001",
    "request_details": "Chicago - NYC",
    "gps_location": [21.0,42.0],
    "passenger_username": "John",
    "driver_username": "Claire",
    "timestamp": "30-Mar-2016"
}

But there would be a data redundancy – passenger and driver usernames would exists both in Users and Requests model, and in case of username change, I would need to update all the requests for the given user. Using user_id instead of username could be a solution, but it would create a need for JOIN and in NoSQL we try to avoid it, don't we? And there are still messages left – how can we efficiently associate them with requests? I guess it's not a proper way and I need something else…

So maybe such a design?

{
    "user_id": "001",
    "username": "John",
    "requests": [{
        "request_id": "001",
        "request_details": "Chicago - NYC",
        "gps_location": [21.0,42.0],
        "passenger_username": "John",
        "driver_username": "Claire",
        "timestamp": "30-Mar-2016",
        "messages": [{
            "message_id": "001",
            "message_body": "Hi, how are you?",
            "timestamp": "30-Mar-2016-14-30-00"
        }]
    }]
}

But now several other things come to my mind. I will very often need to search for requests within a given radius and within a certain timespan, e.g. "show me all requests witin 10 miles not older than 24 hours", and for such a request I need to display all its details and username of the user who posted it. Actually, I will more often ask for requests than users (and each user will have more information, there will be not only his id and name, but also google/facebook username, photo, phone number, etc.). Isn't it a problem with this design?

Aren't all these collections (users/requests/messages) too coupled with each other?

What would be the best way to design a database here?

Best Answer

As you demonstrated in your question, there is a clear relational structure to your data.

  • There are different objects (Users, Requests, and Messages), which all relate to each other, potentially in multiple ways.
  • These objects have attributes (eg driver or passenger names) that would be referenced many times from other objects (Requests). This requires either data duplication or normalization (and thus joins)
  • The different objects are tightly coupled to each other, and your anticipated data access patterns reflect the need to query from multiple angles.

The title of this article is intentionally provocative, but there is a lot of good content: Why You Should Never Use MongoDB

In particular, the section "How MongoDB Stores Data" and the several sections after that discuss precisely the questions you ask regarding Normalization vs Denormalization, and issues with how to efficiently query your data.

Ultimately, you have one main question:

What would be the best way to design a database here?

The answer is: Because your data is relational, don't use MongoDB.

MongoDB may be the right solution for storing unstructured (or less structured) data related to your application, but for the specific data you mention, it is the wrong choice.