Sql-server – Storing IP addresses – varchar(45) vs varbinary(16)

database-designnonclustered-indexsql serversql-clrsql-server-2012

I am going to create a table with two fields – ID as BIGINT and IPAddress as either varchar(45) or varbinary(16). The idea is to store all unique IP addresses and use a reference ID instead the actual IP address in other tables.

Generally, I am going to create a stored procedure that is returning the ID for given IP address or (if the address was not found) insert the address and return the generated ID.

I am expecting to have many records (I cannot tell exactly how many), but I need the stored procedure above to be executed as fast as possible. So, I am wondering how to stored the actual IP address – in text or bytes format. Which is going to be better?

I have already written SQL CLR functions for transforming IP address bytes to string and the reverse, so transformation is not an issue (working with both IPv4 and IPv6).

I guess I need to create an index to optimize the search, but I am not sure should I include the IP address field to the clustered index, or to create a separate index and with which type the search will be faster?

Best Answer

how to stored the actual IP address - in text or bytes format. Which is going to be better?

Since "text" here refers to VARCHAR(45) and "bytes" refers to VARBINARY(16), I would say: neither.

Given the following information (from Wikipedia article on IPv6):

Address representation
The 128 bits of an IPv6 address are represented in 8 groups of 16 bits each. Each group is written as 4 hexadecimal digits and the groups are separated by colons (:). The address 2001:0db8:0000:0000:0000:ff00:0042:8329 is an example of this representation.

For convenience, an IPv6 address may be abbreviated to shorter notations by application of the following rules, where possible.

  • One or more leading zeroes from any groups of hexadecimal digits are removed; this is usually done to either all or none of the leading zeroes. For example, the group 0042 is converted to 42.
  • Consecutive sections of zeroes are replaced with a double colon (::). The double colon may only be used once in an address, as multiple use would render the address indeterminate. RFC 5952 recommends that a double colon must not be used to denote an omitted single section of zeroes.[41]

An example of application of these rules:

        Initial address: 2001:0db8:0000:0000:0000:ff00:0042:8329
        After removing all leading zeroes in each group: 2001:db8:0:0:0:ff00:42:8329
        After omitting consecutive sections of zeroes: 2001:db8::ff00:42:8329

I would start by using 8 VARBINARY(2) fields to represent the 8 groups. The fields for Groups 5 - 8 should be NULL as they will only be used for IPv6 addresses. The fields for Groups 1 - 4 should be NOT NULL as they will be used for both IPv4 and IPv6 addresses.

By keeping each group independent (as opposed to combining them into either a VARCHAR(45) or a VARBINARY(16) or even two BIGINT fields) you get two main benefits:

  1. It is much easier to reconstruct the address into any particular representation. Otherwise, in order to replace consecutive groups of zeroes with (::) you would have to parse it out. Keeping them separate allows for simple IF / IIF / CASE statements to facilitate this.
  2. You will save a ton of space on IPv6 addresses by enabling either ROW COMPRESSION or PAGE COMPRESSION. Since both types of COMPRESSION will allow for fields that are 0x00 to take up 0 bytes, all of those groups of zeroes will now not cost you anything. On the other hand, if you stored the example address from above (in the Wikipedia quote), then the 3 sets of all zeroes in the middle would take up their full amount of space (unless you were doing the VARCHAR(45) and went with the reduced notation, but that might not work well for indexing and would require special parsing to reconstruct it to the full format, so let's assume that is not an option ;-).

IF you need to capture the Network, create a TINYINT field for that called, um, [Network] :-)

For more info on the Network value, here is some info from another Wikipedia article on the IPv6 address:

Networks

An IPv6 network uses an address block that is a contiguous group of IPv6 addresses of a size that is a power of two. The leading set of bits of the addresses are identical for all hosts in a given network, and are called the network's address or routing prefix.

Network address ranges are written in CIDR notation. A network is denoted by the first address in the block (ending in all zeroes), a slash (/), and a decimal value equal to the size in bits of the prefix. For example, the network written as 2001:db8:1234::/48 starts at address 2001:db8:1234:0000:0000:0000:0000:0000 and ends at 2001:db8:1234:ffff:ffff:ffff:ffff:ffff.

The routing prefix of an interface address may be directly indicated with the address by CIDR notation. For example, the configuration of an interface with address 2001:db8:a::123 connected to subnet 2001:db8:a::/64 is written as 2001:db8:a::123/64.


For indexing, I would say create a Non-Clustered index on the 8 Group fields, and possibly the Network field if you decide to include that.


End result should be something like the following:

CREATE TABLE [IPAddress]
(
  IPAddressID INT          NOT NULL IDENTITY(-2147483648, 1),
  Group8      VARBINARY(2) NULL, -- IPv6 only, NULL for IPv4
  Group7      VARBINARY(2) NULL, -- IPv6 only, NULL for IPv4
  Group6      VARBINARY(2) NULL, -- IPv6 only, NULL for IPv4
  Group5      VARBINARY(2) NULL, -- IPv6 only, NULL for IPv4
  Group4      VARBINARY(2) NOT NULL, -- both
  Group3      VARBINARY(2) NOT NULL, -- both
  Group2      VARBINARY(2) NOT NULL, -- both
  Group1      VARBINARY(2) NOT NULL, -- both
  Network     TINYINT      NULL
);

ALTER TABLE [IPAddress]
  ADD CONSTRAINT [PK_IPAddress]
  PRIMARY KEY CLUSTERED
  (IPAddressID ASC)
  WITH (FILLFACTOR = 100, DATA_COMPRESSION = PAGE);

CREATE NONCLUSTERED INDEX [IX_IPAddress_Groups]
  ON [IPAddress] (Group1 ASC, Group2 ASC, Group3 ASC, Group4 ASC,
         Group5 ASC, Group6 ASC, Group7 ASC, Group8 ASC, Network ASC)
  WITH (FILLFACTOR = 100, DATA_COMPRESSION = PAGE);

Notes:

  • I recognize that you plan on using BIGINT for the ID field, but do you really expect to capture more than 4,294,967,295 unique values? If so then just change the field to be BIGINT and you can then even change the seed value to be 0. But otherwise you are better off using INT and starting with the minimum value so that you can make use of the entire range of that datatype.
  • If desired, you can add one or more NONpersisted Computed Columns to this table to return text representations of the IPAddress.
  • The Group* fields are arranged purposefully going down, from 8 to 1, in the table so that doing SELECT * will return the fields in the expected order. But the index has them going up, from 1 to 8, as that is how they are filled out.
  • An example (unfinished) of a computed column to represent the values in text form is:

    ALTER TABLE [IPAddress]
      ADD TextAddress AS (
    IIF([Group8] IS NULL,
        -- IPv4
        CONCAT(CONVERT(TINYINT, [Group4]), '.', CONVERT(TINYINT, [Group3]), '.',
          CONVERT(TINYINT, [Group2]), '.', CONVERT(TINYINT, [Group1]),
          IIF([Network] IS NOT NULL, CONCAT('/', [Network]), '')),
        -- IPv6
        LOWER(CONCAT(
          CONVERT(VARCHAR(4), [Group8], 2), ':', CONVERT(VARCHAR(4), [Group7], 2), ':',
          CONVERT(VARCHAR(4), [Group6], 2), ':', CONVERT(VARCHAR(4), [Group5], 2), ':',
          CONVERT(VARCHAR(4), [Group4], 2), ':', CONVERT(VARCHAR(4), [Group3], 2), ':',
          CONVERT(VARCHAR(4), [Group2], 2), ':', CONVERT(VARCHAR(4), [Group1], 2),
          IIF([Network] IS NOT NULL, CONCAT('/', [Network]), '')
         ))
       ) -- end of IIF
    );
    

    Test:

    INSERT INTO IPAddress VALUES (127, 0, 0, 0, 4, 22, 222, 63, NULL); -- IPv6
    INSERT INTO IPAddress VALUES (27, 10, 1234, 0, 45673, 200, 1, 6363, 48); -- IPv6
    INSERT INTO IPAddress VALUES (NULL, NULL, NULL, NULL, 192, 168, 2, 63, NULL); -- v4
    INSERT INTO IPAddress VALUES (NULL, NULL, NULL, NULL, 192, 168, 137, 29, 16); -- v4
    
    SELECT [IPAddressID], [Group8], [Group1], [Network], [TextAddress]
    FROM IPAddress ORDER BY [IPAddressID];
    

    Result:

    IPAddressID   Group8   Group1   Network  TextAddress
    -----------   ------   ------   -------  ---------------------
    -2147483646   0x007F   0x003F   NULL     007f:0000:0000:0000:0004:0016:00de:003f
    -2147483645   0x001B   0x18DB   48       001b:000a:04d2:0000:b269:00c8:0001:18db/48
    -2147483644   NULL     0x003F   NULL     192.168.2.63
    -2147483643   NULL     0x001D   16       192.168.137.29/16