How to model data from arbitrary CSV’s in a database

database-designdatabase-recommendation

I am setting up a web interface for data management. Users upload CSV or similarly structured files, and I want to store these in the database so they can do manipulations on them — filtering, sorting, graphing, etc.

I do not know how to model this properly in a database. I have several ideas, but none seem to be the right way to do it.

Make a new table for each uploaded CSV. This would mean each column could be appropriately typed (integers, strings, dates, etc.) and each record would trivially correspond to a line in the CSV file. This seems like a natural conceptualization of the problem — but would performance become an issue if I had to make a new table for each uploaded file?
Make a table where each record represents a dataset (CSV) and have other tables where data points have the id their dataset in their record. This would mean that all the data from a given dataset is spread across different tables and there would be lots of redundancy (since each data point would store the id of the dataset). However, it would mean tables would not have to be created per-dataset.
Other variations on 2. Most of my other thoughts were variations on number 2 with various amounts of indirection.

My question is essentially "How do I model this properly?", that is, with the ability to scale reasonably.

Most of the data will be scientific, so how do I deal with many data sets of sizes varying from trivial (say, 10 columns and 100 rows) to massive (hundreds of columns and thousands/millions of rows)?

tl;dr: How do I model arbitrary data from an arbitrary number of well-formed CSVs in a database, and would a new table for each CSV perform acceptably?

Best Answer

The question is so wide open that it is hard to say much.

And why are you using a database at all? Excel does everything you've described so far.

Seriously though, avoid making an inner system in your database that can store any kind of data. You have one already, called a DATABASE. Make code that creates the tables you need. What's so bad about that? I say option 1.

You know, if you want infinite flexibility how about a square-mile white board?

Related Solutions

Sql-server – How to select a subset of data from a large ER model

Unfortunately there isn't a cookie-cutter way to do this. You have to start with some base records, write queries that join against them and then repeat. As you go deeper the queries get more complex as they have to go right back to the root.

If you have the foreign keys then you can automate the generation of the queries, as the joins will just be through the foreign key columns. It's not as hard as it sounds, but here are a few pointers if you take this approach:

The queries are only going to retrieve the contents of the table at their leaf, so they are fairly straightforward to generate.
Get the data out in a form that will load through bcp. As a bonus, you can generate bcp control files from the system metadata.
You can probably quite safely just copy all of the reference data, which probably accounts for the majority of your tables anyway.
You may have to start from multiple roots, and you may have to manually add join metadata if you have missing foreign keys. If your relationships to your leaf tables are not purely hierarchical then you may have to manually edit the metadata for the generator. In practice this means you probably want to start from the system data dictionary and then make up a configuration file for your generator that works around this.
You may also need to add custom queries in for certain data items if the selection criteria involve more than one path to the record.
Generate or write a script that uses bcp to load the tables to the destination in the right order.

Automation will probably get you 95% of the way with some manual intervention and the scripts aren't all that hard to write. It's not a 5 minute job, but it's certainly possible.

You could programatically generate SSIS packages to do the load, but I think that generating bcp control files is probably easier. Another option would be to just structure the scripts so they copy the data from the queries into another shadow database. This can then be backed-up/restored to wherever you want.

It's not a trivial undertaking but it's certainly not beyond the wit of man. As a bonus, if you can configure your script so the starting keys can be parameterised, then you can make a generalised utility to copy subsets of your application database. This will be quite useful for rolling out test environments.

Record with variable number of columns

This sounds like a really simply one-to-many relationship.

For SQL Server, I would write this like:

CREATE TABLE Devices
(
    DeviceID INT
    , DeviceName nvarchar(255)
);

CREATE TABLE Cards
(
    CardID INT
    , CardName nvarchar(255)
    , DeviceID INT
);

CREATE TABLE Ports
(
    PortID INT
    , PortName nvarchar(255)
    , CardID INT
);


INSERT INTO Devices VALUES (1, 'Test Device 1');
INSERT INTO Devices VALUES (2, 'Test Device 2');
INSERT INTO Cards VALUES (1, 'Card 1 in Test Device 1', 1);
INSERT INTO Cards VALUES (2, 'Card 2 in Test Device 1', 1);
INSERT INTO Cards VALUES (3, 'Card 1 in Test Device 2', 2);
INSERT INTO Cards VALUES (4, 'Card 2 in Test Device 2', 2);
INSERT INTO Ports VALUES (1, 'Port in Card 1, Device 1', 1);
INSERT INTO Ports VALUES (2, 'Port in Card 2, Device 2', 4);

SELECT *
FROM Devices;

SELECT *
FROM Cards;

SELECT *
FROM Ports;

This allows a Device to have multiple Cards, which in turn can have multiple Ports.

The results:

enter image description here

The 3 tables can be JOINed together like this:

SELECT DeviceName, CardName, PortName
FROM Devices 
    INNER JOIN Cards ON Devices.DeviceID = Cards.DeviceID
    INNER JOIN Ports ON Cards.CardID = Ports.CardID
ORDER BY DeviceName, CardName, PortName;

enter image description here

If you use LEFT JOIN like this:

SELECT DeviceName, CardName, PortName
FROM Devices 
    LEFT JOIN Cards ON Devices.DeviceID = Cards.DeviceID
    LEFT JOIN Ports ON Cards.CardID = Ports.CardID
ORDER BY DeviceName, CardName, PortName;

you get these results:

enter image description here

This is an image showing the table relationships:

enter image description here

Best Answer

Related Solutions

Sql-server – How to select a subset of data from a large ER model

Record with variable number of columns

Related Question