How to store data having a fixed number of columns and a variable number of rows

csvnosql

I have posted this question in the stack overflow site and a more experienced user than me suggested I post it here.

I am quite new at collecting data and am collecting a new dataset using R from online media articles together with each article's disqus comments and other related information. My "record" will have 12 columns of information (see image below). My confusion lies with the fact that these media articles might have anything from 0 to 500 comments (maybe more). I would like to save this data as efficiently as possible and for now the only idea to my inexperienced knowledge is to save each in its own .csv file although I'm not sure about this.

And that is where I'm stumped. My first thought was to divide each record in two: the first part would contain six columns and will always have one row and the second part will contain another six columns with a varying number of rows. A sort of ID will link the two together. This would possibly be the "dirty" solution.

What I would like to kindly ask the Community is about newer technologies (for examples: Mongodb, NoSQL. Or ust SQL?) to store this sort of data. Eventually, this data will not be online as it will be queried or processed using some or all of its elements of data.

The image below shows an example of part of a data record starting out with 1 row but rows increase according to the number of comments in an article. This particular record would have 23 rows in the last 6 columns if all data is entered.

It would greatly help me to have an idea on a way forward rather than having to test or research a technology only to find that it is not what I need. Any comments or suggestions are most welcome.

enter image description here

Best Answer

Based on your image, something like the following will be fine (using a RDBMS solution):

create table articles
(
  article_id     integer PRIMARY KEY,
  title          varchar(256) NOT NULL,
  url            varchar(256) NOT NULL,
  summary        varchar(256) NOT NULL,
  body           varchar(4000) NOT NULL
);

create table article_comments
(
  comment_id    integer PRIMARY KEY, -- primary key, unique, does not map to the "comment number" in your question
  article_id    integer NOT NULL, -- references articles.article_id
  screen_name   varchar(100),
  disqus_url    varchar(256),
  comment_dt    datetime, -- both the date & time from your spreadsheet
  comment_text  varchar(4000),
  comment_url   varchar(256),
  comment_post_id varchar(25)
);

Basically, there's a one-to-many relationship between articles and comments.

You don't say what RDBMS you'd want to use, so the data types will need changing accordingly (bigger VARCHARs/TEXT columns etc will be needed for long articles/comments etc). Constraint DDL is also largely RDBMS-dependent, so you'll have to look the syntax up for yourself once you've made a decision. I can recommend Postgres.