How to set-up a distributed data storage system for The Real Junkfood Project

database-designdatabase-recommendationusers

Context

The Real Junkfood Project consists of a network of pay as you feel (PAYF) cafes. We need to log the data from each of ~100 cafes, generating perhaps 50 rows each par day. I estimate that this could (if we expand) generate around 1 million rows per year. How best to store this data in a system that allows computer noobs to log their data in a way that is as easy and user-friendly as possible? That's the essence of my question. Details below.

Introduction to data storage

I know that the appropriate way to store data is described by the term 'tidy data'
[@Wickham2014]. The rules of data stored in this way are:

  1. Each variable forms a column.

  2. Each observation forms a row.

  3. Each type of observational unit forms a table.

TRJFP data

The data collected by the TRJFP operates at several levels, so several
tables are needed. These are

  1. Food items intercepted
  2. Daily totals (e.g. meals served, finances)
  3. Interception points
  4. Cafes

The first 2 tables are usually collected daily by each cafe. The 3rd table
on 'interception points' (where food is picked-up from)
is also collected by each cafe but is only updated
when a new interception point is logged. The 5th table provides a global overview
of all the cafes and is updated every time a new cafe is added.

By far the largest table in the database will be 1,, which could store perhaps 50
new items per day. Assuming (optimistically) that in the future cafes are dilligent in
logging the data, we could log data from 100 cafes. That would be 5,000
rows per day or around 1.8 million rows of data per year. This is not really
Big Data, but it is worth thinking carefully about how it is stored.

The variables associated with each table are described below.
Critical to the useability of the data is that the datasets can be
linked. Therefore linking variables connecting each table are described.

Options

There are 3 main options for storing these tables, from simple to complex.

  1. On Google Fusion Tables
  2. As regularly backed-up .csv files managed through an online server system like Shiny
  3. On a fully fledged database system, like postgres

Best Answer

Generally speaking I would recommend a full fledged relational database. This will give you everything you need, and allow you to grow gracefully.

As far as what exact engine to use that depends on personal preference/budgetary constraints/hosting requirements/compatability with the application layer/etc. Personally I would probably use the Microsoft stack (SqlServer database, C# for the application layer, etc) because that is what I'm most familiar with, and would be most efficient in coding. Others may choose other platforms based on X,Y,Z reasons. The truth is any mainstream relational database engine will support your needs.

I wouldn't recommend csv files because that is highly likely to become unmanageable. I would overwhelmingly recommend Sql Lite over csv files given only these two options. Both are lightweight, but Sql Lite at least offers some database functionally. Here is a good link for what SqlLite is good for. Basically good: local storage and/or replacing CSV/custom data storage files. Not designed for: Replacing client/server SQL database engines (SqlServer, Oracle, PostgreSQL, etc).

I also wouldn't recommend Google Fusion Tables because I believe it'll be harder to manage as you grow. I could be wrong but that is my general experience with those types of solutions. Also Fusion Tables is an experimental app, so I would be worried about what happens if the "experiment" fails...

I wouldn't recommend NoSql because I don't think you will gain much if anything, and relational type reporting will be complicated.