You should think about partitioning the table for a big reason.
All indexes you have on a giant table, even just one index, can generated a lot of CPU load and disk I/O just to perform index maintenance when executing INSERTs, UPDATEs, and DELETEs.
I wrote an earlier post back on October 7, 2011 on why Table Partitioning would be a big help. Here is one excerpt from my past post:
Partitioning of data should serve to group data that are logically and
cohesively in the same class. Performance of searching each partition
need not be the main consideration as long as the data is correctly
grouped. Once you have achieved the logical partitioning, then
concentrate on search time. If you are just separating data by id
only, it is possible that many rows of data may never be accessed for
reads or writes. Now, that should be a major consideration: Locate all
ids most frequently accessed and partition by that. All less
frequently accessed ids should reside in one big archive table that is
still accessible by index lookup for that 'once in a blue moon' query.
You can read my entire post later on this.
To cut right to the chase, you need to research and find out what data is rarely used in your 10GB table. That data should be placed in an archive table that is readily accessible should you need adhoc queries for a historical nature. Migrating that archival from the 10GB, followed by OPTIMIZE TABLE
on the 10GB table, can result in a Working Set that is faster to run SELECTs, INSERTs, UPDATEs, and DELETEs. Even DDL would go faster on a 2GB Working Set than a 10GB table.
UPDATE 2012-02-24 16:19 EDT
Two points to consider
- From your comment, it sounds like normalization is what you may need.
- You may need to migrate out everything over 90 days old into an archive table but still access archive and working set at the same time. If your data is all MyISAM, I recommend using the MERGE storage engine. First, you create the MERGE table map once that unites a working set MyISAM table and an archive MyISAM table. You would keep data less than 91 days in one MyISAM table and rollover any data over 90 days old into the archive. You would query the MERGE table map only.
Here are two posts I made on how to use it:
Here is an additional post I made on tables with a lot of columns
Too many columns in MySQL
Because you're doing this to learn database design, I'd get strict. I'd also recommend Louis Davidson's book on database design. He focuses on SQL Server, but it's a really well-done book no matter which database platform you're using, and his chapter on normalization is outstanding.
Your user table has repeating attributes: two mail configuration IDs. In a real-world business scenario, this might slide, but it is a violation of first normal form. The way to correct that is to break those out into a separate table, turning columns into rows.
Looking at hosts and ports, there are a couple of possibilities. Do you want to limit appropriate ports, or "know" which port is which (e.g., 25 for SMTP, 110 for POP3, 143 for IMAP)? If so, create a ports table. In that case, you could use either a surrogate key or the port number as the primary key because well-known mail ports are distinct. On the other hand, because servers can use "weird" ports, you might not be able to create that table (or you may need to leave it open for users to add new ports to the list). In a business application, this would be a business decision. In an academic application, I'd probably create the table and make a foreign key connection to the mail server configuration table.
Regarding hosts, you have a similar question. If you want to limit hosts, check for common misspellings, or be able to display more information about the particular host (e.g., pop3.google.com's application name is GMail), you would want to create a Host lookup table, as well as application logic which lets users create new hosts. But there's something else to think about with hosts: how far down the rabbit hole do you want to go? Are you going to analyze based on the host's top-level domain (TLD)? If so, you would want to break out the domain into its own table. In fact, you might have three separate tables: host prefixes (mail, pop3, smtp, imap, etc.), host name (google, yahoo, etc.), and host top-level domains (.com, .co.uk, etc.). Your Host table would then have a Prefix ID, a Hostname ID, and a TLD ID and you could build the final hostname up from that. Whether this makes any sense at all for you to do depends upon the application: if you're using the mail server information to shoot e-mails out but aren't analyzing it further, then this is going too far; otherwise, if you do need to split out hostnames or domains as part of your application, keeping it all as "pop3.google.com" would be a violation of first normal form.
Finally, looking at the mail server configuration table, you have port on there. If your users have different credentials for receiving and sending e-mails, that would make sense, but otherwise, you end up duplicating data: username Bob with password PW at mail.yahoo.com will be the same regardless of whether he's connecting to port 25 or 110. The port really is more for communication than authorization, so it makes some sense here to split it out. You'll still have some "duplication" of sorts because of hosts that have pop3.hostname.com and smtp.hostname.com servers, but I wouldn't worry too much about that, especially if we don't need to split hosts out.
Assuming that you don't need to split out hosts any further, the table design would look a bit like the following:
Host
----------------------------
HostId smallint (PK; surrogate)
Name varchar(75) (AK1)
[additional host attributes]
Port
----------------------------
PortId smallint (PK; surrogate)
Number int (AK1)
Name varchar(20)
Direction { Inbound; Outbound }
----------------------------
DirectionId tinyint (PK; surrogate)
Name varchar(8) (AK1)
User
----------------------------
UserId int (PK; surrogate)
[additional user attributes]
UserHostCredential
----------------------------
UserHostCredentialId (PK; surrogate)
HostId (FK; AK1)
UserId (FK; AK1)
Username
Password
UserMailConnection
----------------------------
UserMailConnectionId int (PK; surrogate)
UserHostCredentialId int (FK; AK1)
DirectionId tinyint (FK; AK1)
PortId (FK)
AK here means alternate key. You'd put a unique key constraint wherever you see that. I numbered them to show you instances in which there are multi-column alternate keys, like mail server configuration.
This is a bit more work than your two-table design, but not really that much more. I would tend toward adding the Host and Port tables simply to get rid of potential update inconsistencies; for example, if Google's new POP3 server is pop3.googlemail.com instead of pop3.google.com, you would have to update all of the rows to match this. But if somebody misspelled a row (or you didn't catch it in your updating process), you now have a data anamoly. In the hosts table, you make one update and you're done.
But one important thing to note (and something that I've tried to hint at up above) is that business rules shape database design. I'm working from one set of assumptions, some of which may not exactly match a particular scenario. For example, I'm assuming that a username and password typically should relate to one person at one host. But if a number of different people all use the same credentials (for whatever reason), this may no longer apply. In that case, I'd move UserId off of UserHostCredential and onto UserMailConnection and then rename UserHostCredential to HostCredential. Both of those are "valid" database designs, and which one is correct depends upon the business purpose.
Similarly, I'm assuming that Port isn't necessarily determinate of direction (inbound or outbound). If you are able to make that determination--e.g., that 25 is always outbound, and there are no ports which are both inbound and outbound--then the direction stops being an attribute of a particular connection and instead becomes an attribute of the port itself. If that is the case, you would move DirectionId from UserMailConnection up to Port and make PortId part of the alternate key for UserMailConnection. Again, both are valid, fully-normalized table designs, but which business rule holds would determine which is the "correct" one for your purposes.
Best Answer
I'm not sure you need audit tables. I think a simple schema like this would be sufficient:
Each
scan
has astart
andend
so you know when it took place. From this you can easily determine which is the most recent scan. Sincehosts
has a foreign key toscans
, you can always tell which hosts were found with a specific scan. Similar forports
.In this example,
host_id
is simply a numeric identifier generated by the database. If you want to store something likemyhost.com
, add ahostname
column tohosts
. Same thing forports
.