Some time ago I was developing project what reports all action of user on web site. I hit big number of problems, mainly performance problem. Biggest site had around 40,000 unique users each day. And one requirement is that it must recognize country by IP of the user. I get list of countries from MaxMind what I used for previous projects. I was able to insert the CSV file by BULK LOAD.
CREATE TABLE [dbo].[GeoIPCountry_Import] ( [startIp] [varchar](200) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [endIp] [varchar](200) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [startIpNum] [varchar](200) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [endIpNum] [varchar](200) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [countryCode] [char](40) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [country] [nvarchar](500) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ); CREATE PROCEDURE [dbo].[UpdateGeoCountryList] @file nvarchar(400) AS BEGIN SET NOCOUNT ON; TRUNCATE TABLE [dbo].[GeoIPCountry_Import] EXEC('BULK INSERT [dbo].[GeoIPCountry_Import] FROM N''' + @file + ''' WITH ( FIELDTERMINATOR ='','', ROWTERMINATOR =''\n'', CODEPAGE = ''OEM'', KEEPNULLS )') DECLARE @count int SELECT @count = COUNT(*) FROM [dbo].[GeoIPCountry_Import] IF(@count < 90000) BEGIN RAISERROR('File is too small or in wrong format. It must contain at least 90k records, fieldterminator = '','', rowterminator = ''\\n''.',15,1) END UPDATE [dbo].[GeoIPCountry_Import] SET startIp = [staging].[Trim](REPLACE(startIp, '"', '')), endIp = [staging].[Trim](REPLACE(endIp, '"', '')), startIpNum = [staging].[Trim](REPLACE(startIpNum, '"', '')), endIpNum = [staging].[Trim](REPLACE(endIpNum, '"', '')), countryCode = [staging].[Trim](REPLACE(countryCode, '"', '')), country = [staging].[Trim](REPLACE(country, '"', '')) SELECT @count = COUNT(*) FROM [dbo].[GeoIPCountry_Import] WHERE startIpNum IS NULL OR startIpNum = '' OR endIpNum IS NULL OR endIpNum = '' IF(@count > 0) BEGIN RAISERROR('Some IP''s cannot be converted into ip number.',15,1) END END
This is the first part and it’s possible to use this format to getting country representation for ip. What is realy faster than to have ip range like ‘from 45.10.8.50 to 45.10.8.138’ is to have them like integer. We will create scalar function which will convert ip into integer.
CREATE FUNCTION [staging].[IPtoInt]? (@strIP varchar(15)) RETURNS bigint AS BEGIN DECLARE @intIPNum bigint IF (LEN(@strIP)-LEN(REPLACE(@strIP,'.','')))/LEN('.') = 3 -- if there 3 periods SET @intIPNum = (16777216 * CAST(PARSENAME(@strIP,4) as bigint)? + 65536 * PARSENAME(@strIP,3) + 256 * PARSENAME(@strIP,2) + PARSENAME(@strIP,1)) RETURN @intIPNum END
When we have this function we can create better performing table for storing data about countries and IPs. It’s possible to use city list from MaxMind. We will prepare the import procedure to be able import new database at anytime, when newer version is released.
CREATE TABLE [dbo].[GeoIPCountry]( [startIp] [varchar](20) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [endIp] [varchar](20) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [startIpNum] [bigint] NOT NULL, [endIpNum] [bigint] NOT NULL, [countryCode] [char](4) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [country] [nvarchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, CONSTRAINT [PK_GeoIPCountry] PRIMARY KEY CLUSTERED ( [startIpNum] ASC, [endIpNum] ASC ) )
And add few lines at the end of [GeoIPCountry_Import] procedure.
TRUNCATE TABLE dbo.GeoIPCountry INSERT INTO dbo.GeoIPCountry (startIp, endIp, startIpNum, endIpNum, countryCode, country) SELECT startIp, endIp, staging.IPtoInt(startIp), staging.IPtoInt(endIp), countryCode, country FROM [dbo].[GeoIPCountry_Import]
We prepared everything what is needed for this. I even created table with most frequent IPs which help a lot, because the site had 90% of visitors from one country. This table contains IPs for this country an make selecting much more faster with less records in table. This solution was more than 10 times faster than table with IPs. And if you need to parse 100,000 records each day, it will help a lot.
SELECT IP, CASE WHEN GeoIPCountry.Country IS NOT NULL THEN GeoIPCountry.Country ELSE 'NA' END AS Country FROM WebLog WITH (NOLOCK) LEFT JOIN staging.GeoIPCountry WITH (NOLOCK) ON GeoIPCountry.startIPNum <= staging.IPtoInt(WebLog .IPNum) AND GeoIPCountry.endIPNum >= staging.IPtoInt(WebLog .IPNum)
I’m not really sure why, but there is faster to use GeoIPCountry.startIPNum <= IP AND GeoIPCountry.endIPNum >= IP then to use BETWEEN. If anyone will be interested in the cached solution or wants codes for all part I can provide them.