You can definitely keep all your dimensions and measures in one fact table and not use any dimension tables. Make sure your OLAP tool supports this though.
Normalizing out your dimensions into other tables is done mostly to minimize the size of the fact table, which can get large fast.
With no dimension tables you're looking at about 336 MB per year (not counting indexes), which isn't so bad.
With dimension tables, you're looking at about 34 MB per year, plus a couple dozen MB for storing dimension details. Indexes will be smaller too.
You'll want to expand your date column into something more analyzable (year, month, quarter, etc), which will add to the size.
You'll want to index all fields. Drop indexes before insert, add them after.
You can use a tool like Pentaho Aggregation Designer to find useful aggregates and generate them for you.
I hope you are willing to do some programming outside of SQL.
To find the recipients
that need the most pruning:
SELECT recipient
FROM inbox
GROUP BY recipient
HAVING COUNT(*) > 30
ORDER BY COUNT(*) DESC
LIMIT 100
Then, for each recipient, first find the 31st id:
SELECT id
FROM inbox
WHERE recipient = $recipient
ORDER BY id DESC
LIMIT 30,1
And delete the excess baggage:
DELETE FROM inbox
WHERE recipient = $recipient
AND id < $id
You would need INDEX(recipient, id)
.
This process could be running continually. The first SQL would be the slowest, but it would run "Using index" and probably not be too bad, maybe 0.1 sec for 200K rows in inbox.
If you want to stay in SQL, this would let you do one recipient at a time:
BEGIN;
SELECT @recipient := recipient
FROM inbox
GROUP BY recipient
HAVING COUNT(*) > 30
ORDER BY COUNT(*) DESC
LIMIT 1
FOR UPDATE;
SELECT @id := id
FROM inbox
WHERE recipient = @recipient
ORDER BY id DESC
LIMIT 30,1
FOR UPDATE;
DELETE FROM inbox
WHERE recipient = @recipient
AND id < @id;
COMMIT;
Then, put that in a Stored Procedure with a loop around it. And, if you want to be further 'nice', add a SELECT SLEEP(1);
in the loop but outside the transaction.
(Trying to do the entire thing in a single statement makes my brain hurt.)
Best Answer
Generally it is best to hold data in a normalised structure.
If testing at scale, on production-grade hardware, proves there is a problem, and that the problem is because of an over-large table, and index tuning and SQL re-factoring do not bring the problem withing acceptable bounds, then splitting the table is a recognised technique.
Moving some rows to a different table is called horizontal partitioning or sharding. Moving some columns to a different table is call vertical partitioning. Both entail additional design and maintenance work. There will be run-time implications if values from several partitions are required by a single query.
There is some further discussion at this answer.