To reduce space requirements you might consider a few things:
- Not bothering to store
http://
or the leading www.
- this is just wasted space (though in a few cases the www.
is required because people don't know how to configure their web sites properly).
- Making sure you use data compression. Most systems are still I/O-bound, not CPU-bound.
Only storing any domain name once, and storing the page URLs separately. Both may actually repeat and storing them multiple times is wasteful. So for example instead of:
CREATE TABLE dbo.Pages
(
ID BIGINT IDENTITY(1,1) PRIMARY KEY, -- need BIGINT for 500 billion rows
URL VARCHAR(2048) -- far too large to apply UNIQUE
);
CREATE TABLE dbo.PageLinks
(
PageID BIGINT NOT NULL FOREIGN KEY REFERENCES dbo.Pages(ID),
LinkID BIGINT NOT NULL FOREIGN KEY REFERENCES dbo.Pages(ID),
PRIMARY KEY (PageID, LinkID)
);
You could do something like this:
CREATE TABLE dbo.Domains
(
DomainID INT IDENTITY(1,1) PRIMARY KEY, -- probably no more than 2BN domains
DomainName VARCHAR(255) NOT NULL
) WITH (DATA_COMPRESSION = PAGE);
CREATE UNIQUE INDEX dn ON dbo.Domains(DomainName)
WITH (DATA_COMPRESSION = PAGE);
CREATE TABLE dbo.URLs
(
URLID INT IDENTITY(1,1) PRIMARY KEY, -- maybe you need BIGINT here
URL VARCHAR(2048) NOT NULL -- still can't apply UNIQUE here
-- but you can have the same URL (e.g. /page.php) from two different
-- domains only listed once.
);
CREATE TABLE dbo.DomainURLs
(
DomainURLID INT IDENTITY(1,1) PRIMARY KEY, -- may also need BIGINT here
DomainID INT NOT NULL FOREIGN KEY REFERENCES dbo.Domains(DomainID),
URLID INT NOT NULL FOREIGN KEY REFERENCES dbo.URLs(URLID)
) WITH (DATA_COMPRESSION = PAGE);
CREATE UNIQUE INDEX du ON dbo.DomainURLs(DomainID, URLID)
WITH (DATA_COMPRESSION = PAGE);
Yes, table design and query semantics will be much more complex, but it will scale much better once you index many pages on the same site (or many pages with the same URL across different sites).
Just to demonstrate the potential for space savings here. Taking just the storage of the URLs and ignoring the links between them, let's look at just the Pages, Domains and URLs tables (and to be fair, I'll even test your pages table with compression).
Note that we take advantage of the fact that you will probably index each site in turn, alphabetically, rather than hash and index random URLs each iteration. This allows compression to work as well as possible in this case.
For some sample data, on my system this generates about 121,000 rows, but that will vary depending on a lot of factors, such as how many databases are on your system, how many objects in your database, the design of any non-system objects (e.g. number of columns), and even @@VERSION:
;WITH x AS
(
SELECT
d = t.name + CONVERT(VARCHAR(5), d.database_id) + '.com',
p = '/' + c.name + '.php'
FROM sys.all_objects AS t
CROSS JOIN sys.databases AS d
INNER JOIN sys.all_columns AS c
ON t.[object_id] = c.[object_id]
)
SELECT d, p
FROM x
ORDER BY d, p;
Sample results:
all_columns1.com /collation_name.php
all_columns1.com /column_id.php
all_columns1.com /default_object_id.php
all_columns1.com /is_ansi_padded.php
all_columns1.com /is_column_set.php
...
Now let's use that to populate our four tables:
;WITH x AS
(
SELECT
d = t.name + CONVERT(VARCHAR(5), d.database_id) + '.com',
p = '/' + c.name + '.php'
FROM sys.all_objects AS t
CROSS JOIN sys.databases AS d
INNER JOIN sys.all_columns AS c
ON t.[object_id] = c.[object_id]
)
SELECT d, p
INTO #blat
FROM x
ORDER BY d, p;
CREATE CLUSTERED INDEX x ON #blat(d, p);
INSERT dbo.Pages(URL) SELECT 'http://www.' + d + p FROM #blat ORDER BY d, p;
INSERT dbo.Pages_compressed(URL) SELECT 'http://www.' + d + p FROM #blat ORDER BY d, p;
INSERT dbo.Domains(DomainName) SELECT DISTINCT d FROM #blat ORDER BY d;
INSERT dbo.URLs(URL) SELECT DISTINCT p FROM #blat;
Now of course you would have to build the junction table here by cross-referencing the Domains and the URLs. But this will still show how much space you can save by not storing the same domain name or page name twice, even if it means more complicated logic to put these things together:
EXEC sp_spaceused 'dbo.Pages'; -- 8,904 KB
EXEC sp_spaceused 'dbo.Pages_compressed'; -- 4,552 KB
EXEC sp_spaceused 'dbo.Domains'; -- 656 KB
EXEC sp_spaceused 'dbo.URLs'; -- 136 KB
That's with 121,000 URLs. For 500 billion? Let's extrapolate. Per complete URL, without compression, you are storing about 73 bytes. With compression, 38 bytes per URL. My method: 6.5 bytes per URL. (Discard, for a moment, how unrealistic my sample data is. Let's pretend I've come somewhere close to the average URL length, and that both (a) you will be indexing many URLs on every domain and (b) you will get duplication of page paths across many domains.)
So ignoring those realities, simple math shows more than 90% reduction in storage space between your uncompressed method and my method:
Your method, uncompressed = 33.2 TB
Your method, compressed = 17.1 TB
My method = 2.9 TB
And again, it's more complicated, so more work up front, but usually this is worth it. You only design the schema and write the code around it once; you'll be maintaining this for, well, how long do you expect the service to exist?
Best Answer
The way you've modelled it is fine. Your model ensures that the business rules are enforced by the database.
There are a couple of things you could do as an alternative. One would be to eliminate the surrogate key on
Roles_Applications
. As an intersection table, you could use the two foreign keys together as a composite primary key. If you did this, that would propagateRole
andApplication
down to yourApplications_Permissions_Roles
table. This would have the advantage of giving you more of a "one stop shop" for your application permission data (i.e. fewer joins) without compromising your normalization in any way.Another way you could go would be to simplify slightly and define a default permission for each application. You could call it whatever you like, such as "default access" or "basic access" or "user" or whatever makes sense to you. This would allow you to flatten your model and essentially drop the
Roles_Applications
table and joinApplications_Permissions_Roles
straight toRoles
. This would change the nature of the query that you would use to ask "which roles can access which applications?" but your business rules would still be enforced by the schema.