• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • Stop wasting time looking for files and revisions. Connect your Gmail, DriveDropbox, and Slack accounts and in less than 2 minutes, Dokkio will automatically organize all your file attachments. Learn more and claim your free account.


Research Report by Pehr Hovey

Page history last edited by Pehr Hovey 10 years, 5 months ago


Research Report: A Geocoding Best Practices Guide


By Pehr Hovey, @LitPlus Twitter Visualization Project


  1. Abstract.

    Geocoding is the process of converting an imprecise placename such as a street address into precise coordinates on the earth. Software systems utilize street data and interpolation to make a best guess at where an address lies in physical space. This process is inherently error-prone which can have negative consequences for quantitative research based on geocoded data. Daniel Goldberg describes these issues and best practices for minimizing coding error and adjusting for imprecision in location.


  2. Description

    Geocoding Best Practices is a free book published online by Daniel Goldberg, a PHD student at the University of Southern California. The book considers geocoding in the context of scientific research rather than consumer mapping applications. The book was published on behalf of the North American Association of Cancer Registries (NAACR).  Goldberg notes in the introduction that Public Health researchers are large consumers of geocoded data, necessitating this publication, but most lessons contained within are applicable to any field that uses locative data as the basis for drawing research conclusions.


     Goldberg states upfront that the book is intended to act as a comprehensive reference rather than a practical tutorial. Geocoding Best Practices is divided into several sections that cover different aspects of the geocoding process and data lifecycle, each aimed at a particular audience. The first section is a detailed discussion of the function and purpose of geocoding; which serves as a starting point for people new to the field. He defines geocoding as  “…the act of transforming aspatial locationally descriptive text into a valid spatial representation using a predefined process” (5). Though we are used to thinking of location data on the earth only as Latitude/Longitude coordinates, this book covers a much larger variety of location systems which are used mostly in the scientific community.


    Throughout the book Goldberg discusses the pitfalls inherent in using geocoded data in epidemiology. When tracking cancer rates over time and space it is important to have reliable geographic data. One study highlighted in the book drew definite conclusions on cancer likelihood based on how close someone lives to a freeway. Upon further analysis it was discovered that up to 24% of the datapoints varied widely in freeway distance when geocoded using two different systems. This could have a dramatic effect on the validity of the study conclusions.


    The middle of the book gets into the finer grained details of address formatting conventions which are important to the success of the proc. Geocoding can be straightforward when the address is well formed but if there are misspellings or missing components (like zipcode) then the system has to make a guess at what address was meant before it can geolocate the address. This issue is particularly worrisome in clinical studies as it is possible that some patients will write their information incorrectly or it will be misread during data entry. If two different geocoding systems apply different heuristics to transform a bad address into a usable one then the same input can render vastly different results (even different cities or states depending on the missing information).


    To help the process Goldberg lays out suggestions for Address Normalization which he discusses on page 45: “Address normalization is the process of identifying the component parts of an address such that they may be transformed into a desired format. This first step is critical to the cleaning process. Without identifying which piece of text corresponds to which address attribute, it is impossible to subsequently transform them between standard formats or use them for feature matching.”  Examples of address components are shown in this table from page 45:

    Later sections of the book discuss the issue of recognizing and coping with geocoder inaccuracy. Even if input data is well formatted there will always be some addresses that improperly map to the real world. Situations include new addresses not in the reference dataset as well as physical lot size discrepancies. Most geocoders use interpolation to guess where on a street block an address is located. In many jurisdictions the numerical street addresses are not evenly distributed down a block, especially if one property parcel is very large and the rest are small. In these instances the geocoder returned a location point but data is incorrect. Another instance is where the geocoded point lies close to the edge of two property parcels. If the application of the data depends on which parcel the point is in the inherent margin of error could mean that the point is in fact located in the adjacent parcel. The graphic below from page 106 depicts this sort of misclassification:



    These sort of subtle inaccuracies can undermine the validity of studies without the researchers realizing it. Some geocoders (including Google and Yahoo) return a value for ‘precision’ that will say whether the geolocation was successful at the address-level (very precise) or other levels of varying precision (city, state, etc). Researchers can take this into account when drawing conclusions that would depend on the precision of geolocation.  Since most geocoding services do not reveal their proprietary algorithms there is not necessarily much a researcher can do to anticipate which addresses might not geocode properly. To combat this issue of transparency, the GIS Research laboratory at USC, of which the book author is a member, is developing the WebGIS open source geocoding platform. Their goal is to provide low-cost, transparent geocoding services to academic researchers. It is implied that this service will implement the Geocoding Best Practices contained in this book.


  3. Commentary

    Goldberg’s book is important to our project because it lays out the many ways that the geocoding process can be error-prone, and ways to cope with the inevitable reduction in certainty. For our project, we have no control over the geocoding process since we are relying on public geocoding services provided by Google or Yahoo. To delve into the inner workings of an interpolating geocoding software system is definitely beyond the scope of our project.


    Instead, this knowledge of the limitations of geocoding will help guide us towards applications of our data that would not be too decimated by a lack of certainty. We have to especially plan for fuzzy geographic data since the vast majority of tweets are not yet geotagged at the tweet-level.

    Since the Twitter geotagging API is only a few months old, it is estimated that less than 1% of tweets are being geotagged. For instances where the tweet does not carry geographic data we have to estimate the location based on the sender’s location string which can be given in any free text format. The location is often a rough city/state declaration or a latitude/longitude pair set from their phone but occasionally it is something more abstract, like “paradise USA”.


    At the time of writing, our data gathering system has 19,714 tweets, of which only 75 came with geotagged data, the rest had to be estimated from the user’s location (85 could not be located at all). It is clear that we cannot rely on fine-grained location but instead should consider tweets at the city level or above. Hence our final visualization and hypothesis should lean more towards the qualitative and broad strokes and avoid quantitative conclusions.

    Even with potentially large margins of error, geocoded data still presents a multitude of opportunities for visualizing and interpreting large amounts of tweet data.


  4. Resources for Further Study

    Daniel W. Goldberg, A Geocoding Best Practices Guide, University of Southern California, GIS Research Laboratory, Los Angeles, California, November 2008. Available at http://www.naaccr.org/filesystem/pdf/Geocoding_Best_Practices.pdf


    USC WebGIS open source Geocode platform



    Google Geocoding services



    Yahoo Geocoding services



Comments (0)

You don't have permission to comment on this page.