WEB Advent 2009 / GeoIP Wrangling

When making the Analog holding page, we wanted to add a touch of personality to make the experience warmer and more welcoming. One of our ideas was to use IP-based geolocation to create a personalized greeting based on each visitor’s location. The task of making this fell to me, and I wanted to share some tidbits about the implementation and what I discovered along the way.

The source I used for GeoIP data was MaxMind. This company has maintained a comprehensive IP-to-location database for a while, and it offers a number of different versions, some of which are free. The GeoCity database, unfortunately, was “available to established businesses only” and cost $370 per site license, so I decided to go with the free — and less accurate — alternative, GeoLite City.

MaxMind advertises GeoLite City as having the accuracy of “over 99.5% on a country level and 79% on a city level for the US within a 25-mile radius.” What this means is that in the US, 18% of the IPs are resolved inaccurately — more than 25 miles from the true location — and 3% are not covered. For the UK, the numbers are worse; 37% of IPs are resolved inaccurately, and 9% are not covered. I experienced this inaccuracy firsthand after Jon Gibbins’s location was determined to be in London, even though he lives in Barnstaple, about 100 miles away. Surprisingly, less developed countries sometimes have better coverage. In Kazakhstan, for example, only 15% of the IPs are resolved inaccurately, and a mere 1% are not covered. To channel Borat, “much success!”

Regardless, this was “good enough.” The next step was to decide how to handle the data. GeoLite City is available in binary and CSV formats. Using the CSV version requires importing it into some sort of database and performing IP range queries, and this is something I wanted to avoid, both for speed of development and to save memory on the server. The binary version has to be used with the libgeoip library provided by MaxMind. Thankfully, there was already a PHP extension in PECL for this library, complete with documentation. My only concern was performance, because the binary IP data file was 28 MB, and reading it from disk every time would quickly become prohibitive. After perusing the libgeoip source code, I discovered that it supports shared memory caching via mmap(), but the GeoIP extension wasn’t making use of this. So, I patched the extension and built a custom version.

The data provided by GeoLite City contains country, region, area code, metro code, city, and postal code. The region indicates the state (in the US) or the province (in Canada). For other countries, the region is a two-letter code derived from the FIPS 10-4 standard, used by such publications as the CIA’s World Factbook. For example, certain parts of London, UK, will have the region H9. I decided to avoid using the regions outside of the US and Canada, because they don’t necessarily correspond to known geographical boundaries. Imagine greeting someone, “It looks like you're in or near London, H9.” That wouldn’t exactly have the human touch I wanted.

The data also contains the latitude and longitude of the approximate location, and this is what I used for the distance calculations. The easiest way to calculate the distance between two points is via the spherical law of cosines, a well-known formula that gives the distance along the great circle between the points. Just to be sure, I looked up a couple of other formulas. The haversine formula used to be preferred when computational precision was limited, but the simpler cosine one gives good results, even for distances as small as 1 meter. And, although it presumes the Earth is a sphere — it’s not really, of course, varying from 6,378 km at the equator to 6,357 km at the poles — the accuracy is still better than 3 m in 1 km. In other words, good enough. If more precision is needed, Vincenty’s formulae can be used instead.

With distance calculations in hand, the next task was grouping people who happened to be near one another. For example, both Alan Colville and Jon Tan live in Bristol, UK, so it makes little sense to describe only one of them as being closest. Wanting to keep things simple, I decided against such algorithms as k-means clustering, which requires pre-selecting the number of clusters. I needed to group only the people closest to the visitor, so I calculated the distance to each person, picked the closest one, then calculated the distances from the closest person to the others, filtering out those above the threshold, which was 50 miles in our case. This threshold was fairly arbitrary, but it worked well for our purposes.

Once I had the closest group of people, I could generate a unique portion of the greeting that varies based upon how far the group is from the visitor. For example, if you’re less than 25 miles from me, you should see “nearby, in San Francisco”; if it’s between 25 and 100 miles, the message becomes something like “not far, in San Francisco, 50 miles away from you”, and so on. For an added touch, if the distance is less than 25 miles, and the city of the visitor matches that of the group, the message says something like, “also in San Francisco!” Another personal touch we added was to show the location of everyone, not just the closest group. For example, if you’re in Oxford, UK, the message you’ll see is:

“By the way, with some GeoIP guesswork, it looks like you’re in or near Oxford, UK. Alan and Jon are not far, in Bristol, about 60 miles from you. Jon is in Barnstaple. Andrei and Chris are in the US.”

Some of the country names provided by the database are a bit strange, especially when used in conversation. South Korea, for example, is “Korea, Republic of;” the British Virgin Islands are “Virgin Islands, British;” and the Vatican City is “Holy See (Vatican City State).” Clearly, some massaging was in order, and we also wanted to condense “United States” to “the US” and such. While implementing this, I also added the ability to add other touches of personality based on the country. If we see that you’re in either the British or US Virgin Islands, for example, we add “We’re jealous!” Visitors from Iceland will see “Andrei and Chris love that place.” If you’re in Ireland, we’ll add, “Did you know that Alan’s originally from Tipperary?” All of this helps create a more natural greeting, and we inject these tidbits into different places in the greeting depending upon the context.

There were some gotchas and corner cases that became evident during development. I uncovered them by testing various IPs and countries. The database simply does not have any location information for IPs such as private networks or certain reserved blocks — 2.2.2.2, for example. In case someone’s lucky enough to have one of these IPs, we just say “Where in the world are you?” — this sounds particularly apropos if the visitor is coming from a block reserved for the Defense Intelligence Agency. Another thing that baffled me initially was that some IPs return latitude and longitude, but no city or region. After encountering a few of these, I noticed that the coordinates were always the same, 38°N, 97°W. The ever helpful Google Maps showed that the location was smack dab in the middle of Kansas. A few ponderous moments later, I realized this was the precise geographical center of the US, which just happens to be Potwin, Kansas. Case closed. Apparently all (unlabeled) roads lead to Kansas. For IPs like these, the best thing we could say was “It looks like you’re in the US.”

Using the source IP of the request may not always be what you want. For example, if the client jumps through one or more proxies, $_SERVER['REMOTE_ADDR'] will be the IP of the last proxy along the way. Properly-configured proxies will use the X-Forwarded-For header to list all the hops:

X-Forwarded-For: client, proxy1, proxy2

So, check for the presence of this header, and extract the first IP address in the list if it exists. Relying on this header for authentication is a bad idea, because it can be easily forged, but it suffices for our use.

The GeoIP extension worked well, except that the geoip_record_by_name() function generates a notice for certain IPs it can’t work with, like 127.0.0.1. The only way to avoid the notice is to check the IP against a list before passing it to the function, or just do something that I rarely advocate — silence the function with the @ operator.

We considered using the browser-based Geolocation API as the primary method of detecting a visitor’s location, falling back to IP-based geolocation if necessary. However, the browser support for this is still somewhat inconsistent; it worked for me only about half of the time in Firefox 3.5, and the location was less accurate than the one from MaxMind. The results may be different for you, depending on the density of the Wi-Fi networks and other factors. The biggest factor that influenced our decision to stick to IP-based geolocation is that browser-based geolocation requires people to opt in, which creates a jarring experience.

Currently, the locations of Analog members are stored in a static array, primarily because we wanted to get the holding page up quickly, and we weren’t planning to travel for another few weeks. In the future, I plan to obtain the current location of each member of Analog either from Dopplr or from geotagged tweets.

The location-based greeting on analog.coop is not a particularly impressive bit of code, but it does demostrate a nice way to add a touch of personality. I spent more time tweaking the wording of the greeting than writing code, but I think this is what gives it personality.

Give geolocation a try for one of your own projects, and remember these lessons:

  • Buy the full GeoIP City database if the accuracy is paramount.
  • Test with many different IPs to identify corner cases.
  • Use a different distance formula for even better accuracy.

I hope to see more location-aware sites in 2010. Until then, happy holidays! I’ll be sipping Oskar Blues Ten Fidy and dreaming of the Virgin Islands.

Comments?

Other posts