Sql-server – Suggest better design pattern for reverse DNS lookups

dnssql serversql-clrsql-server-2008-r2

I'm working on a project where I need to look up the hostnames associated with IP addresses who were logged making HTTP requests. The lookups currently happen as part of a daily ETL job. The current method is to use a scalar CLR function (similar code to this is posted a number of places on the web, posted below with my revisions; I'm not sure who the original author was):

using System.Data.SqlTypes;
using System.Net;
using System.Security;
using System.Text.RegularExpressions;
using Microsoft.SqlServer.Server;

public partial class udfn_GetHostName
{

[Microsoft.SqlServer.Server.SqlFunction]
public static string udfn_GetHostname(string IPAddr)
{
    try
    {
        /* 
         Using deprecated method intentionally.
         GetHostEntry() is now recommended.
         But it does some irritating things like returning an error if a PTR
         record points to a name that doesn't have an A record.
        */
        IPHostEntry IpEntry = Dns.GetHostByAddress(IPAddr);
        // Test whether the record returned has at least one alphabetic character
        // If it does, then it's a name
        // Otherwise the DNS server might have returned the IP address
        Match match = Regex.Match(IpEntry.HostName.ToString(), @"[a-zA-Z]+");

        if (match.Success)
        {
            return IpEntry.HostName.ToString();
        }
        else
        {
            return "None";
        }      

     }
    catch(Exception ex)
    {
        return "Failed";
        //return ex.Message.ToString();
    }
  }
} 

I'm not a C# developer so the quality of the CLR code is likely not great.

Then I call the function like this after loading new rows into the dimension:

-- Update only rows that we just inserted
UPDATE DIM.Network_Addresses
SET reverse_dns = dbo.[udfn_GetHostname](client_ip) 
WHERE reverse_dns IS NULL 
AND is_current = 1
AND created_date = (SELECT MAX(created_date) FROM DIM.API_Network_Address);

This method works but is very slow, for at least a couple reasons.

1) Using a scalar function makes SQL Server call the CLR function
once per row that needs to be updated, using a new SQL context.

2) The function calls themselves are very slow due to how GetHostname() and other CLR name resolution functions work: long timeouts, sometimes multiple round trips across the network that all time out if a DNS server doesn't respond or there is no PTR record, etc.

Could any recommend a design pattern that would improve the performance of looking up reverse DNS records and updating the table?

I'm considering a few different things:

1) Move this work outside of the database and use a tool such as dig to do the lookups in parallel.

2) Try to find some way to call the function in parallel or convert it to an in-line function (not making much progress on this!)

However, any ideas would be welcome.

Best Answer

Do not do it in SQL. From the few things that can be categorized as 'stupid use of SQLCLR', making expensive lengthy network calls ranks as #1. At the very very very very least, make sure the CLR code calls Thread.BeginThreadAffinity() before it goes into waiting for the intertubez to respond (DNS lookup and reverse lookup included).

The proper way to handle this is to use an external process, place the IPs to resolve in a queue, dequeue in batches and resolve several (tens) of IPs in parallel using asynchronous I/O, eg. the non-obsolete Dns.BeginGetHostEntry().