Windows – What platform for a large record based mathematical database

database-recommendationrecordwindows

Hi I am a civil engineer with some experience of programming but I am not familiar with the vast range of options available today. Hope you could give me any pointers the best way to proceed.

I want to make and query a database of ground level survey measurements in a grid format. There will be many measurements for each grid location at various times through the earthmoving job so there is a 4th dimension of time.

The observations will most likely be read in from a text file. In each record there will be an (2 x integer) grid position (row and column) a (floating point) ground level and various string information codes (maybe up to 30 characters in total).

The grids could be about 10000 rows x 10000 columns. Not every location on the grid would be have a record in each survey, but they would typically have up to a hundred records. A lot of the grid locations would have no records at all (the site will not be perfectly rectangular).

I want to search the records, extract data and do calculations eg calculate the lowest or highest ground level for each grid location.
I am fairly confident I would have the ability to program this fairly simply in a language like FORTRAN, BASIC or C using arrays. A lot of the array elements would be empty though and I am guessing that this is not the right way to do it though and big databases like this need special tools that I will have to learn how to use.

I am thinking of possible options for the platform –

  1. Use a database program. I am not familiar with how powerful these can be but I imagine they would have a lot of overhead with GUI.

  2. Use SQL? This I don't know much about but it seems to be the language for databases. I have always used imperative languages rather than declarative and as I understand from wikipedia that SQL is declarative I am a bit nervous of the change. I don't fully understand the process for using it. Is there a compiler that makes console programs? Is the database stored on disk? Sorry for such stupid questions.

  3. Use an API like c-treeACE? I think this may be the way to go offering me the familiarity of a "do this, then do that" language (unfortunately this is the way I think as an engineer!). But I am hoping that the behind the scenes memory and processing management offered by the API would be superior to what I could achieve with huge arrays.

  4. Or could i do it with an object orientated language and let the computer worry about the storage requirements. eg if i stored the records as objects with methods and properties that would help me get the results I need out of each record – would it be a huge bloated program compared to 3)

There are likely to be hundreds of millions of records and I want to be able to query and process them in minutes not hours (preferably seconds!) on a modern PC running windows.
To be more specific mine is an i7 processor with 6Gb ram and 120Gb SSD running Windows 7 64 bit.

Hope someone has time to share a couple of words of wisdom with a newbie.

Best Answer

There are a number of options and please don't limit yourself to my answer here. In particular you may find array-native databases to be of help. My answer is going to be specifically about your questions on SQL-based databases.

It sounds to me like this is a question of geospacial information. SQL-based databases are in fact used in such fields quite well, but this is also a specialist field within databases.

Among the SQL databases in this area, PostgreSQL, with the PostGIS add-on is considered one of the best. If I were you, this is where I would start. The primary advantage of SQL is that it preserves flexibility down the road regarding re-use of your data for uses you haven't thought of yet. Doing this with good geospacial support means that you can calculate distance across a large area without worrying about the specifics of spherical trig.

Of course this only becomes a factor with very large grids. For smaller grids, where the curvature of the earth can be disregarded, PostgreSQL also has a range of geometric types including points on a coordinate system which can be used. I mention this because it isn't clear how large of an area is being surveyed and whether one can assume plane geometry or not.

Even so PostGIS may still simplify things by allowing representations and calculations on 3- and 4-dimensional geometric coordinate systems.

Also note that you say your sites are not necessarily square. In PostgreSQL one thing you can do (either using the geometric types or PostGIS) is define a non-rectangular boundary to each site so you can check to make sure a point is inside the bounds of the site before saving the measurement.

Declarative Language Impacts

This concern I think is overblown. People can and do write SQL queries as if they are a part of the imperative language of the program they are calling them from. For most of your queries it won't matter.

What people mean by a declarative language is that within a query, the structure tells the database what information you want, not how to get it. This is important when you want complex information from the database because basically it means that if you can ask the right question (and your data is valid) you will get the right answer.

The big difference that occurs however is that long SQL queries are easier to debug than long imperative subroutines, simply because one can more quickly narrow down where in the query the malfunction occurs.

How this would work

Chances are if you go this route you'd have a database and a program written in a language of your choice. The program would send queries to the database and get the answers back. You can also (in PostgreSQL and many other relational DB's) put your queries inside functions which can then be called by the application, giving more of an imperative or functional interface. Data would be stored on disk and accessed from a separate piece of software than your program. You could also connect with another program (from MS Access to pgAdmin) and run queries or generate reports.

In essence you can think of the RDBMS as a "math engine" which manages your data, and your program interacts with it to do what you need.

Related Question