How to store data having a fixed number of columns and a variable number of rows

I have posted this question in the stack overflow site and a more experienced user than me suggested I post it here.

I am quite new at collecting data and am collecting a new dataset using R from online media articles together with each article's disqus comments and other related information. My "record" will have 12 columns of information (see image below). My confusion lies with the fact that these media articles might have anything from 0 to 500 comments (maybe more). I would like to save this data as efficiently as possible and for now the only idea to my inexperienced knowledge is to save each in its own .csv file although I'm not sure about this.

And that is where I'm stumped. My first thought was to divide each record in two: the first part would contain six columns and will always have one row and the second part will contain another six columns with a varying number of rows. A sort of ID will link the two together. This would possibly be the "dirty" solution.

What I would like to kindly ask the Community is about newer technologies (for examples: Mongodb, NoSQL. Or ust SQL?) to store this sort of data. Eventually, this data will not be online as it will be queried or processed using some or all of its elements of data.

The image below shows an example of part of a data record starting out with 1 row but rows increase according to the number of comments in an article. This particular record would have 23 rows in the last 6 columns if all data is entered.

It would greatly help me to have an idea on a way forward rather than having to test or research a technology only to find that it is not what I need. Any comments or suggestions are most welcome.

create table articles ( article_id integer PRIMARY KEY, title varchar(256) NOT NULL, url varchar(256) NOT NULL, summary varchar(256) NOT NULL, body varchar(4000) NOT NULL ); create table article_comments ( comment_id integer PRIMARY KEY, -- primary key, unique, does not map to the "comment number" in your question article_id integer NOT NULL, -- references articles.article_id screen_name varchar(100), disqus_url varchar(256), comment_dt datetime, -- both the date & time from your spreadsheet comment_text varchar(4000), comment_url varchar(256), comment_post_id varchar(25) );

Best Answer

Based on your image, something like the following will be fine (using a RDBMS solution):

Basically, there's a one-to-many relationship between articles and comments.

You don't say what RDBMS you'd want to use, so the data types will need changing accordingly (bigger VARCHARs/TEXT columns etc will be needed for long articles/comments etc). Constraint DDL is also largely RDBMS-dependent, so you'll have to look the syntax up for yourself once you've made a decision. I can recommend Postgres.

Best Answer

Related Solutions

Mongodb – How do databases store index key values (on-disk) for variable length fields

Postgresql – Behavior of PostgreSQL COPY FROM

Related Question