Use Database to Store Data Extracted From Internet

database-recommendationgraphneo4jnosql

I'm mining 500 million users, and their "followers" from a social network using their API. The extraction of data itself is not a problem, since I can do it with my scripts. However having 500 million users and their followers in a list in memory can be very costly.

My script created two lists,one with the users that I already got their followers, and one with the users to be looked at (I would get each user, put their followers in the queue, write to file, and then go to the next one.) So it would be 2 long lists that I cannot handle in memory. So I thought of a database.

So finally to my question, is it better for me to use a relational database, or a NoSQL, graph, database, like Neo4j. The only information I'm getting now is the user ID and the ID of the followers, which later I want to analyse (for graph theory research.) I thought of a database because I might try add more information later as well.

Thank you.

Best Answer

Sounds on the surface like a graph database problem. If you're going to be walking the edges between users, neo4j or such like may be the one for you.

You might be able to do more generic processing using a document db where every user has an _id of user_id and an array of followers _ids.

Perhaps you could output to MongoDb, then use Neo4j for creating the graph(s) for specialised work, and mongodb for more general work. MapReduce and the aggregation framework in MongoDb are pretty good (speaking from experience, although MapReduce is much more powerful than aggregrtion framework (currently)).

Since the schema is likely to morph, and you do not know what the additional data will be, you might prefer a doc or graph db over a RDB. If you prefer to work in a relational manner at a later point, you can generate csv extracts to upload to your RDBMS of choice after you have defined a schema.

Best Answer

Related Solutions

Mysql – Database design suggestions for a data scraping/warehouse application

Which allows for faster queries in a graph database: a node with multiple properties, or a node that points to multiple nodes

Related Question