SQLite or CouchDB for Locational Data Mining Project

couchdbdatabase-recommendationsqlite

I am currently designing a Data Mining Project where I am going to harvest rather large volumes of Twitter data in order to analyse locational data (geocoded tweets) and do some machine learning with it.

What I want to do: I'll have some scripts that run 24/7 on a small Samsung Netbook (<2GHz,1GB RAM, 200GB Disk), limited in frequency only by the query limit of the Twitter API. These scripts will save various sorts of data in a database, which in turn will later be used as a base for analysing data.

I am quite experienced in RDBMS, thus I also know their limitations. I just read about CouchDB and its ability to store JSON in so called documents – this would come in handy because the responses from the Twitter API are in JSON, and some of those strings are quite nested and complex.

On the other hand, I don't really want to miss relational functionality, since I have for example a table user which saves general data about a Twitter account and a table geo which saves place-time-Tuples which always reference a particular user.

For analysis, the content of geo will be used in any possible way – I have not yet thought about geospatial analysis in depth, but there will aggregation, distance calculation, all that sort of stuff. This might be done with CouchDB's reduce-Functions in Javascript, I read? If I used a SQLite DB, I would just stick to Python and do everything there.

I don't really know what is possible in CouchDB since I am really new to that concept. I just saw that it is easy to store JSON and that the structure of the database is not fixed at all, so I could easily introduce new types of data or destroy old ones (DROP COLUMN is not possible in SQLite). Also, since I know Javascript pretty well (actually better than Python), it might be easier for me to do analysis on the data.

What do you think? Is there a striking advantage in using NoSQL for doing that sort of thing or should I stick to what I can do best?

Best Answer

I think the answer mainly depends on how much time do you want to spend learning new databases vs how much time to you want to spend learning machine learning.

For example PostgreSQL has lots of GIS stuff built-in which I assume could be very useful for your queries. CouchDB has many useful features with their map/reduce stuff but I find it a bit limited. If you think you're going to add many columns later on based on new algorithms you might want to look into the column-oriented databases like Apache Cassandra.

However, I would recommend splitting the problem into two parts:

  • Data gathering
  • Data analysis

The first part you seem to have a good plan for already. I would write a super simple app that would just suck in the data from twitter and put it in a table as a BLOB. Or just a file on disk.

When you have that you can go from the raw data and insert it into any kind of backend. Now you can choose backend based on the current problem (algorithm) you're working on, instead of having one solution to fit all your problems. The key point here is to extract exactly the data you need. That way you shouldn't have to consider the fact that the data is nested inside a document from Twitter as you only pick out the parts you need. If you go that way I think an RDBMS would perform very nice as they can run all kinds of queries.

Make sense?