Ok, I am making a lot of assumptions (INT instead of VARCHAR(50) being one of them) with this answer, so feel free to correct me if needed. The problem with option B is that it introduces a new join to relate Users to Alerts without any real added benefit. If joining on the UserID, it is best to index the UserID, so you can utilize seeks for your joins.
For Option A, UserID will be the clustering key (index key for the clustered index) on the Users table. UserID will be a nonclustered index key on Alerts table. This will cost 16 bytes per Alert.
For Option B, UserID will be the clustering key on the Users table. UserId will probably be the clustering key in UserMap too, to make joining more efficient. UserKey (assuming this is an INT) would then be a nonclustered index key on the Alerts table. This will cost 4 bytes per Alert. And 20 bytes per UserMap.
Looking at the big picture, one relationship, for Option A, costs 16 bytes of storage, and involves 1 join operation. Whereas, one relationship, for Option B, costs 24 bytes of storage, and involves 2 join operations.
Furthermore, there are a possibility of 340,282,366,920,938,000,000,000,000,000,000,000,000 uniqueidentifiers and only 4,294,967,296 INTs. Implementing a uniqueidentifier to INT map for a this type of relationship could cause unexpected results when you start reusing INTs.
The only reason for creating this type map table, is if you plan on creating a Many to Many relationship between Users and Alerts.
Taking all of this into consideration, I would recommend Option A.
I hope this helps,
Matt
To reduce space requirements you might consider a few things:
- Not bothering to store
http://
or the leading www.
- this is just wasted space (though in a few cases the www.
is required because people don't know how to configure their web sites properly).
- Making sure you use data compression. Most systems are still I/O-bound, not CPU-bound.
Only storing any domain name once, and storing the page URLs separately. Both may actually repeat and storing them multiple times is wasteful. So for example instead of:
CREATE TABLE dbo.Pages
(
ID BIGINT IDENTITY(1,1) PRIMARY KEY, -- need BIGINT for 500 billion rows
URL VARCHAR(2048) -- far too large to apply UNIQUE
);
CREATE TABLE dbo.PageLinks
(
PageID BIGINT NOT NULL FOREIGN KEY REFERENCES dbo.Pages(ID),
LinkID BIGINT NOT NULL FOREIGN KEY REFERENCES dbo.Pages(ID),
PRIMARY KEY (PageID, LinkID)
);
You could do something like this:
CREATE TABLE dbo.Domains
(
DomainID INT IDENTITY(1,1) PRIMARY KEY, -- probably no more than 2BN domains
DomainName VARCHAR(255) NOT NULL
) WITH (DATA_COMPRESSION = PAGE);
CREATE UNIQUE INDEX dn ON dbo.Domains(DomainName)
WITH (DATA_COMPRESSION = PAGE);
CREATE TABLE dbo.URLs
(
URLID INT IDENTITY(1,1) PRIMARY KEY, -- maybe you need BIGINT here
URL VARCHAR(2048) NOT NULL -- still can't apply UNIQUE here
-- but you can have the same URL (e.g. /page.php) from two different
-- domains only listed once.
);
CREATE TABLE dbo.DomainURLs
(
DomainURLID INT IDENTITY(1,1) PRIMARY KEY, -- may also need BIGINT here
DomainID INT NOT NULL FOREIGN KEY REFERENCES dbo.Domains(DomainID),
URLID INT NOT NULL FOREIGN KEY REFERENCES dbo.URLs(URLID)
) WITH (DATA_COMPRESSION = PAGE);
CREATE UNIQUE INDEX du ON dbo.DomainURLs(DomainID, URLID)
WITH (DATA_COMPRESSION = PAGE);
Yes, table design and query semantics will be much more complex, but it will scale much better once you index many pages on the same site (or many pages with the same URL across different sites).
Just to demonstrate the potential for space savings here. Taking just the storage of the URLs and ignoring the links between them, let's look at just the Pages, Domains and URLs tables (and to be fair, I'll even test your pages table with compression).
Note that we take advantage of the fact that you will probably index each site in turn, alphabetically, rather than hash and index random URLs each iteration. This allows compression to work as well as possible in this case.
For some sample data, on my system this generates about 121,000 rows, but that will vary depending on a lot of factors, such as how many databases are on your system, how many objects in your database, the design of any non-system objects (e.g. number of columns), and even @@VERSION:
;WITH x AS
(
SELECT
d = t.name + CONVERT(VARCHAR(5), d.database_id) + '.com',
p = '/' + c.name + '.php'
FROM sys.all_objects AS t
CROSS JOIN sys.databases AS d
INNER JOIN sys.all_columns AS c
ON t.[object_id] = c.[object_id]
)
SELECT d, p
FROM x
ORDER BY d, p;
Sample results:
all_columns1.com /collation_name.php
all_columns1.com /column_id.php
all_columns1.com /default_object_id.php
all_columns1.com /is_ansi_padded.php
all_columns1.com /is_column_set.php
...
Now let's use that to populate our four tables:
;WITH x AS
(
SELECT
d = t.name + CONVERT(VARCHAR(5), d.database_id) + '.com',
p = '/' + c.name + '.php'
FROM sys.all_objects AS t
CROSS JOIN sys.databases AS d
INNER JOIN sys.all_columns AS c
ON t.[object_id] = c.[object_id]
)
SELECT d, p
INTO #blat
FROM x
ORDER BY d, p;
CREATE CLUSTERED INDEX x ON #blat(d, p);
INSERT dbo.Pages(URL) SELECT 'http://www.' + d + p FROM #blat ORDER BY d, p;
INSERT dbo.Pages_compressed(URL) SELECT 'http://www.' + d + p FROM #blat ORDER BY d, p;
INSERT dbo.Domains(DomainName) SELECT DISTINCT d FROM #blat ORDER BY d;
INSERT dbo.URLs(URL) SELECT DISTINCT p FROM #blat;
Now of course you would have to build the junction table here by cross-referencing the Domains and the URLs. But this will still show how much space you can save by not storing the same domain name or page name twice, even if it means more complicated logic to put these things together:
EXEC sp_spaceused 'dbo.Pages'; -- 8,904 KB
EXEC sp_spaceused 'dbo.Pages_compressed'; -- 4,552 KB
EXEC sp_spaceused 'dbo.Domains'; -- 656 KB
EXEC sp_spaceused 'dbo.URLs'; -- 136 KB
That's with 121,000 URLs. For 500 billion? Let's extrapolate. Per complete URL, without compression, you are storing about 73 bytes. With compression, 38 bytes per URL. My method: 6.5 bytes per URL. (Discard, for a moment, how unrealistic my sample data is. Let's pretend I've come somewhere close to the average URL length, and that both (a) you will be indexing many URLs on every domain and (b) you will get duplication of page paths across many domains.)
So ignoring those realities, simple math shows more than 90% reduction in storage space between your uncompressed method and my method:
Your method, uncompressed = 33.2 TB
Your method, compressed = 17.1 TB
My method = 2.9 TB
And again, it's more complicated, so more work up front, but usually this is worth it. You only design the schema and write the code around it once; you'll be maintaining this for, well, how long do you expect the service to exist?
Best Answer
A long time ago, in 2000, I had a similar question, and posted on Google Groups. I received an awesome response from a guy called Joe Celko, and it will fit right into your problem. I've been using this methodology ever since, and it has great performance, and works really well
https://groups.google.com/forum/#!search/pascaljr/microsoft.public.sqlserver.programming/knQ5WXIDfeo/E0vfeC2kbTsJ
It uses two integer values, a left and a right value, to store the hierarchy. I've made some adaption to the table and procs over the years, but the essence remains the same.