Database Design – Storing Historical nmap Data

database-design

I am designing a database to store nmap data. I want to store two basic types of data:

Information about the nmap scans themselves (e.g. start and end times, nmap version used, command line arguments used)
Information about the hosts on the network (e.g. host names, IP and MAC addresses, open ports, services running)

The primary purpose of my application is to generate a "map" of the network; essentially, a listing of which hosts are up, which hosts are down, which ports are open on which hosts, etc. For this part of the application, all I care about is the current state; in other words, the results of the latest nmap scan. However, my application also needs to do some historical reporting, like generate a graph of host OS over time.

I read this StackOverflow answer about storing historical data and it seems logical: only track the current state in your main tables and use audit tables for historical reporting. So I would have, for example, a hosts table with the following columns:

host_id
name
ip_address
mac_address
status

and a ports table with the following columns:

port_id
host_id
number
status

Every time I run nmap, I would truncate the hosts and ports tables and insert the new results.¹ The corresponding audit tables would track all of the changes over time so I could do historical reporting. Easy enough.

The problem(s)

What I can't figure out is the best way to store the information about the nmap scans themselves. If I followed the above structure, I might have a scans table with these columns:

scan_id
start
end
version
arguments

and a corresponding audit table. But the scans table would only ever contain one row (the most recent scan). This seems utterly pointless. There would also be no need for a scan_id column.

Okay, drop the audit table just for scans. Every time I run nmap, create a new row in the scans table and increment the scan_id. Problem solved.

However, now I want to generate a graph of number of hosts on the network over time. This would require me to join the scans table with the hosts_audit table. But now I need to add scan_id as a column to the hosts and ports tables in order to do the join. If I do this, however, every row in the hosts and ports tables would have the same scan_id, since those tables are truncated every time I run nmap. Hmmmm.

This is my first time designing a database from the ground up, and something doesn't smell right. I want the hosts table to reflect the present state of the network, but truncating the table every time I run nmap seems…unorthodox. I want to be able to connect data from the scans table with data from the hosts table, but can't figure out a way that makes sense. What exactly is wrong with the design I've outlined above and how can I improve it?

I don't think simply updating rows would work; for example, if a particular host is decommissioned, it would no longer appear in the nmap results but without truncating, would stay in the hosts table indefinitely. However, it feels a little odd to be truncating all my tables every few minutes…

Best Answer

I'm not sure you need audit tables. I think a simple schema like this would be sufficient:

scans
-----
  scan_id (PK)
  start (this must be a datetime)
  end (this must be a datetime)
  version
  arguments

hosts
-----
  host_id (PK)
  scan_id (FK to scans.scan_id)
  name
  ip_address
  mac_address
  status

ports
-----
  port_id (PK)
  host_id (FK to hosts.host_id)
  number
  status

Each scan has a start and end so you know when it took place. From this you can easily determine which is the most recent scan. Since hosts has a foreign key to scans, you can always tell which hosts were found with a specific scan. Similar for ports.

In this example, host_id is simply a numeric identifier generated by the database. If you want to store something like myhost.com, add a hostname column to hosts. Same thing for ports.

UPDATE 2012-02-24 16:19 EDT

Two points to consider

From your comment, it sounds like normalization is what you may need.
You may need to migrate out everything over 90 days old into an archive table but still access archive and working set at the same time. If your data is all MyISAM, I recommend using the MERGE storage engine. First, you create the MERGE table map once that unites a working set MyISAM table and an archive MyISAM table. You would keep data less than 91 days in one MyISAM table and rollover any data over 90 days old into the archive. You would query the MERGE table map only.

Here are two posts I made on how to use it:

Here is an additional post I made on tables with a lot of columns

Too many columns in MySQL

How to properly design outgoing/incoming mailserver configuration per user

Because you're doing this to learn database design, I'd get strict. I'd also recommend Louis Davidson's book on database design. He focuses on SQL Server, but it's a really well-done book no matter which database platform you're using, and his chapter on normalization is outstanding.

Your user table has repeating attributes: two mail configuration IDs. In a real-world business scenario, this might slide, but it is a violation of first normal form. The way to correct that is to break those out into a separate table, turning columns into rows.

Looking at hosts and ports, there are a couple of possibilities. Do you want to limit appropriate ports, or "know" which port is which (e.g., 25 for SMTP, 110 for POP3, 143 for IMAP)? If so, create a ports table. In that case, you could use either a surrogate key or the port number as the primary key because well-known mail ports are distinct. On the other hand, because servers can use "weird" ports, you might not be able to create that table (or you may need to leave it open for users to add new ports to the list). In a business application, this would be a business decision. In an academic application, I'd probably create the table and make a foreign key connection to the mail server configuration table.

Regarding hosts, you have a similar question. If you want to limit hosts, check for common misspellings, or be able to display more information about the particular host (e.g., pop3.google.com's application name is GMail), you would want to create a Host lookup table, as well as application logic which lets users create new hosts. But there's something else to think about with hosts: how far down the rabbit hole do you want to go? Are you going to analyze based on the host's top-level domain (TLD)? If so, you would want to break out the domain into its own table. In fact, you might have three separate tables: host prefixes (mail, pop3, smtp, imap, etc.), host name (google, yahoo, etc.), and host top-level domains (.com, .co.uk, etc.). Your Host table would then have a Prefix ID, a Hostname ID, and a TLD ID and you could build the final hostname up from that. Whether this makes any sense at all for you to do depends upon the application: if you're using the mail server information to shoot e-mails out but aren't analyzing it further, then this is going too far; otherwise, if you do need to split out hostnames or domains as part of your application, keeping it all as "pop3.google.com" would be a violation of first normal form.

Finally, looking at the mail server configuration table, you have port on there. If your users have different credentials for receiving and sending e-mails, that would make sense, but otherwise, you end up duplicating data: username Bob with password PW at mail.yahoo.com will be the same regardless of whether he's connecting to port 25 or 110. The port really is more for communication than authorization, so it makes some sense here to split it out. You'll still have some "duplication" of sorts because of hosts that have pop3.hostname.com and smtp.hostname.com servers, but I wouldn't worry too much about that, especially if we don't need to split hosts out.

Assuming that you don't need to split out hosts any further, the table design would look a bit like the following:

Host
----------------------------
HostId smallint (PK; surrogate)
Name varchar(75) (AK1)
[additional host attributes]

Port
----------------------------
PortId smallint (PK; surrogate)
Number int (AK1)
Name varchar(20)

Direction { Inbound; Outbound }
----------------------------
DirectionId tinyint (PK; surrogate)
Name varchar(8) (AK1)

User
----------------------------
UserId int (PK; surrogate)
[additional user attributes]

UserHostCredential
----------------------------
UserHostCredentialId (PK; surrogate)
HostId (FK; AK1)
UserId (FK; AK1)
Username
Password

UserMailConnection
----------------------------
UserMailConnectionId int (PK; surrogate)
UserHostCredentialId int (FK; AK1)
DirectionId tinyint (FK; AK1)
PortId (FK)

AK here means alternate key. You'd put a unique key constraint wherever you see that. I numbered them to show you instances in which there are multi-column alternate keys, like mail server configuration.

This is a bit more work than your two-table design, but not really that much more. I would tend toward adding the Host and Port tables simply to get rid of potential update inconsistencies; for example, if Google's new POP3 server is pop3.googlemail.com instead of pop3.google.com, you would have to update all of the rows to match this. But if somebody misspelled a row (or you didn't catch it in your updating process), you now have a data anamoly. In the hosts table, you make one update and you're done.

But one important thing to note (and something that I've tried to hint at up above) is that business rules shape database design. I'm working from one set of assumptions, some of which may not exactly match a particular scenario. For example, I'm assuming that a username and password typically should relate to one person at one host. But if a number of different people all use the same credentials (for whatever reason), this may no longer apply. In that case, I'd move UserId off of UserHostCredential and onto UserMailConnection and then rename UserHostCredential to HostCredential. Both of those are "valid" database designs, and which one is correct depends upon the business purpose.

Similarly, I'm assuming that Port isn't necessarily determinate of direction (inbound or outbound). If you are able to make that determination--e.g., that 25 is always outbound, and there are no ports which are both inbound and outbound--then the direction stops being an attribute of a particular connection and instead becomes an attribute of the port itself. If that is the case, you would move DirectionId from UserMailConnection up to Port and make PortId part of the alternate key for UserMailConnection. Again, both are valid, fully-normalized table designs, but which business rule holds would determine which is the "correct" one for your purposes.

The problem(s)

Best Answer

Related Solutions

MySQL – Database Redesign for Sensor Data Collection

UPDATE 2012-02-24 16:19 EDT

How to properly design outgoing/incoming mailserver configuration per user

Related Question