To reduce space requirements you might consider a few things:
- Not bothering to store
http://
or the leading www.
- this is just wasted space (though in a few cases the www.
is required because people don't know how to configure their web sites properly).
- Making sure you use data compression. Most systems are still I/O-bound, not CPU-bound.
Only storing any domain name once, and storing the page URLs separately. Both may actually repeat and storing them multiple times is wasteful. So for example instead of:
CREATE TABLE dbo.Pages
(
ID BIGINT IDENTITY(1,1) PRIMARY KEY, -- need BIGINT for 500 billion rows
URL VARCHAR(2048) -- far too large to apply UNIQUE
);
CREATE TABLE dbo.PageLinks
(
PageID BIGINT NOT NULL FOREIGN KEY REFERENCES dbo.Pages(ID),
LinkID BIGINT NOT NULL FOREIGN KEY REFERENCES dbo.Pages(ID),
PRIMARY KEY (PageID, LinkID)
);
You could do something like this:
CREATE TABLE dbo.Domains
(
DomainID INT IDENTITY(1,1) PRIMARY KEY, -- probably no more than 2BN domains
DomainName VARCHAR(255) NOT NULL
) WITH (DATA_COMPRESSION = PAGE);
CREATE UNIQUE INDEX dn ON dbo.Domains(DomainName)
WITH (DATA_COMPRESSION = PAGE);
CREATE TABLE dbo.URLs
(
URLID INT IDENTITY(1,1) PRIMARY KEY, -- maybe you need BIGINT here
URL VARCHAR(2048) NOT NULL -- still can't apply UNIQUE here
-- but you can have the same URL (e.g. /page.php) from two different
-- domains only listed once.
);
CREATE TABLE dbo.DomainURLs
(
DomainURLID INT IDENTITY(1,1) PRIMARY KEY, -- may also need BIGINT here
DomainID INT NOT NULL FOREIGN KEY REFERENCES dbo.Domains(DomainID),
URLID INT NOT NULL FOREIGN KEY REFERENCES dbo.URLs(URLID)
) WITH (DATA_COMPRESSION = PAGE);
CREATE UNIQUE INDEX du ON dbo.DomainURLs(DomainID, URLID)
WITH (DATA_COMPRESSION = PAGE);
Yes, table design and query semantics will be much more complex, but it will scale much better once you index many pages on the same site (or many pages with the same URL across different sites).
Just to demonstrate the potential for space savings here. Taking just the storage of the URLs and ignoring the links between them, let's look at just the Pages, Domains and URLs tables (and to be fair, I'll even test your pages table with compression).
Note that we take advantage of the fact that you will probably index each site in turn, alphabetically, rather than hash and index random URLs each iteration. This allows compression to work as well as possible in this case.
For some sample data, on my system this generates about 121,000 rows, but that will vary depending on a lot of factors, such as how many databases are on your system, how many objects in your database, the design of any non-system objects (e.g. number of columns), and even @@VERSION:
;WITH x AS
(
SELECT
d = t.name + CONVERT(VARCHAR(5), d.database_id) + '.com',
p = '/' + c.name + '.php'
FROM sys.all_objects AS t
CROSS JOIN sys.databases AS d
INNER JOIN sys.all_columns AS c
ON t.[object_id] = c.[object_id]
)
SELECT d, p
FROM x
ORDER BY d, p;
Sample results:
all_columns1.com /collation_name.php
all_columns1.com /column_id.php
all_columns1.com /default_object_id.php
all_columns1.com /is_ansi_padded.php
all_columns1.com /is_column_set.php
...
Now let's use that to populate our four tables:
;WITH x AS
(
SELECT
d = t.name + CONVERT(VARCHAR(5), d.database_id) + '.com',
p = '/' + c.name + '.php'
FROM sys.all_objects AS t
CROSS JOIN sys.databases AS d
INNER JOIN sys.all_columns AS c
ON t.[object_id] = c.[object_id]
)
SELECT d, p
INTO #blat
FROM x
ORDER BY d, p;
CREATE CLUSTERED INDEX x ON #blat(d, p);
INSERT dbo.Pages(URL) SELECT 'http://www.' + d + p FROM #blat ORDER BY d, p;
INSERT dbo.Pages_compressed(URL) SELECT 'http://www.' + d + p FROM #blat ORDER BY d, p;
INSERT dbo.Domains(DomainName) SELECT DISTINCT d FROM #blat ORDER BY d;
INSERT dbo.URLs(URL) SELECT DISTINCT p FROM #blat;
Now of course you would have to build the junction table here by cross-referencing the Domains and the URLs. But this will still show how much space you can save by not storing the same domain name or page name twice, even if it means more complicated logic to put these things together:
EXEC sp_spaceused 'dbo.Pages'; -- 8,904 KB
EXEC sp_spaceused 'dbo.Pages_compressed'; -- 4,552 KB
EXEC sp_spaceused 'dbo.Domains'; -- 656 KB
EXEC sp_spaceused 'dbo.URLs'; -- 136 KB
That's with 121,000 URLs. For 500 billion? Let's extrapolate. Per complete URL, without compression, you are storing about 73 bytes. With compression, 38 bytes per URL. My method: 6.5 bytes per URL. (Discard, for a moment, how unrealistic my sample data is. Let's pretend I've come somewhere close to the average URL length, and that both (a) you will be indexing many URLs on every domain and (b) you will get duplication of page paths across many domains.)
So ignoring those realities, simple math shows more than 90% reduction in storage space between your uncompressed method and my method:
Your method, uncompressed = 33.2 TB
Your method, compressed = 17.1 TB
My method = 2.9 TB
And again, it's more complicated, so more work up front, but usually this is worth it. You only design the schema and write the code around it once; you'll be maintaining this for, well, how long do you expect the service to exist?
Best Answer
A sliding time scale is very efficient by using
PARTITIONs
. Implement 14 hourly partitions usingPARTITION BY RANGE(..)
. UseREORGANIZE PARTITION
to create a new partition every hour. Use "transportable tablespaces" (assuming you are using 5.6, or preferably 5.7, and InnoDB) to move the 'old' partition away from the main table to the archive table.More on timed partitions (including code).
More thoughts:
TRIGGERs
may be functional, I worry that they are costly in your situation. Can you elaborate. Perhaps partitioning can serve a purpose there, too.query_cache_type = 0
andquery_cache_size = 0
; otherwise it could be causing some of the slowdown.INSERTs
per day? One row at a time? Or batched (much better)?Could you provide more numbers -- how many users? inserts/sec?