SQLite or CouchDB for Locational Data Mining Project

couchdbdatabase-recommendationsqlite

I am currently designing a Data Mining Project where I am going to harvest rather large volumes of Twitter data in order to analyse locational data (geocoded tweets) and do some machine learning with it.

What I want to do: I'll have some scripts that run 24/7 on a small Samsung Netbook (<2GHz,1GB RAM, 200GB Disk), limited in frequency only by the query limit of the Twitter API. These scripts will save various sorts of data in a database, which in turn will later be used as a base for analysing data.

I am quite experienced in RDBMS, thus I also know their limitations. I just read about CouchDB and its ability to store JSON in so called documents – this would come in handy because the responses from the Twitter API are in JSON, and some of those strings are quite nested and complex.

On the other hand, I don't really want to miss relational functionality, since I have for example a table user which saves general data about a Twitter account and a table geo which saves place-time-Tuples which always reference a particular user.

For analysis, the content of geo will be used in any possible way – I have not yet thought about geospatial analysis in depth, but there will aggregation, distance calculation, all that sort of stuff. This might be done with CouchDB's reduce-Functions in Javascript, I read? If I used a SQLite DB, I would just stick to Python and do everything there.

I don't really know what is possible in CouchDB since I am really new to that concept. I just saw that it is easy to store JSON and that the structure of the database is not fixed at all, so I could easily introduce new types of data or destroy old ones (DROP COLUMN is not possible in SQLite). Also, since I know Javascript pretty well (actually better than Python), it might be easier for me to do analysis on the data.

What do you think? Is there a striking advantage in using NoSQL for doing that sort of thing or should I stick to what I can do best?

Best Answer

I think the answer mainly depends on how much time do you want to spend learning new databases vs how much time to you want to spend learning machine learning.

For example PostgreSQL has lots of GIS stuff built-in which I assume could be very useful for your queries. CouchDB has many useful features with their map/reduce stuff but I find it a bit limited. If you think you're going to add many columns later on based on new algorithms you might want to look into the column-oriented databases like Apache Cassandra.

However, I would recommend splitting the problem into two parts:

Data gathering
Data analysis

The first part you seem to have a good plan for already. I would write a super simple app that would just suck in the data from twitter and put it in a table as a BLOB. Or just a file on disk.

When you have that you can go from the raw data and insert it into any kind of backend. Now you can choose backend based on the current problem (algorithm) you're working on, instead of having one solution to fit all your problems. The key point here is to extract exactly the data you need. That way you shouldn't have to consider the fact that the data is nested inside a document from Twitter as you only pick out the parts you need. If you go that way I think an RDBMS would perform very nice as they can run all kinds of queries.

Make sense?

Related Solutions

Mongodb – How do databases store index key values (on-disk) for variable length fields

You can store your index as a list of fixed-size offsets into the block containing your key data. For example:

+--------------+
| 3            | number of entries
+--------------+
| 16           | offset of first key data
+--------------+
| 24           | offset of second key data
+--------------+
| 39           | offset of third key data
+--------------+
| key one |
+----------------+
| key number two |
+-----------------------+
| this is the third key |
+-----------------------+

(well, the key data would be sorted in a real example, but you get the idea).

Note that this does not necessarily reflect how index blocks are actually constructed in any database. This is merely an example of how you might organise a block of index data where the key data is of variable length.

SQLite3 simple database design question

After reading a lot around and receiving a hint about SQL data normalization I came with this schema. The rest will be handled by simple requests:

Table MRindex - All details about a entry will be listed here: key db | user_made_id | description | date | blob_picture | rss_feed | twitter_alias/keyword/hashtag |
Table MRtwitter - All twitter related data will be stored here: key db| index-at-table1| tw_option|tw_user | tw_date | tw_location | tw_name |tw_text
Table MRrss - All rss data will be stored here: key db | index-at-table1| rss_post_author | rss_post_date | rss_post_title | rss_post_text

Any comments & hints would be more than welcomed :-)

EDIT: To clarify a bit...

MRIndex (also referred as table1) holds basic data, filled by the user: Might be user tweets, hashtags or location based tweets. Once this entry is filled up

MRtwitter will hold the tweets and related info. The field index-at-table1 is the "link" to table MRindex.

MRrss it's the the same for blog posts. Same thing applies for 'index-at-table1'.

For MRindex:

description: short user description date: date of creation blob_picture: a picture if any (jpg, png, etc) rss_feed: An RSS_feed URL belonging to this user twitter_alias/keyword/hashtag: either @user, Greece (for location) or #hashtag

Best Answer

Related Solutions

Mongodb – How do databases store index key values (on-disk) for variable length fields

SQLite3 simple database design question

Related Question