Skip to main content icon/video/no-internet

Geoparsing is the process of identifying geographic references in text and linking geospatial locations to these references so that the text can be accessed through spatial retrieval methods and suitable for spatial analysis. Geoparsing is used to add geospatial locations to written text, oral discourse, and legacy scientific data where referencing to location was done with placename references only. Applications include the processing of enterprise technical documents, intelligence surveillance, and unlocking a treasure trove of biological specimen and observation data heretofore not suitable for geospatial analysis.

The process, also known as toponym resolution, is based on linguistic analysis of text strings, looking for proper names in a context that indicates the likelihood that the name is a placename. For example, the capitalized word Cleveland can be identified as a potential placename on the basis of adjacent words and phrases, such as in, near, and south of, rather than being the name of a U.S. president (i.e., Grover Cleveland). These candidate names are submitted to a gazetteer lookup process. When a match is made to a single gazetteer entry, the associated information from the gazetteer can be linked to the text. The context of the proper names is used both to flag the name as a possible placename and to refine the meaning of the phrase containing the placename. For example, “in Cleveland” and “25 miles south of Cleveland” indicate different locations. The geoparsing software can use such information to assign a geospatial location derived from the geospatial footprint specified in a gazetteer entry, modified by any offset expressed in terms of distance, direction, and units of measure.

In many cases, more than one gazetteer entry is a potential match for the candidate proper name. There are several ways to refine the matching process. For example, if the text surrounding the name contains a type term, such as lake or mountains, or if a general location for the place has been named, such as a country or state, these clues can be added to the gazetteer lookup process. So, if the text that has references to “Cleveland” also references “Ohio” prominently or frequently, then the assumption can be made that the “Cleveland” reference is the city in Ohio rather than some other populated place named “Cleveland,” such as “Cleveland, New York.”

The level of confidence in the geoparsing results is often an issue because of many factors. The lexical analysis itself is not perfect when applied to unstructured text. The quality of the gazetteer is also a factor in terms of the completeness of its coverage, the inclusion of alternate forms of the placenames, and the accuracy and detail of its geospatial information. In some cases, the gazetteer itself might include confidence levels for its data—especially when covering ancient features where descriptive information is contradictory or incomplete. When the textual reference is of the form “25 miles south of Cleveland,” the actual location can be estimated only to be within a specified area south of coordinates given for Cleveland. For these reasons, geoparsing results are often accompanied by an indication of confidence. One method is to assign a point and a radius, with the length of the radius indicating the confidence level.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading