Wednesday, October 23, 2013

Geocoding Sand Mines

Goals and Objectives
The goal of this exercise is to explore the basics of downloading data, geocoding with ArcGIS, and analyzing the results for error. The exercise consisted of the following objectives:
  1. Download list of mines from the WisconsinWatch website.
  2. Geocode the mines using the ESRI address locator.
  3. Geocode the mines with PLSS if need be.
  4. Compare results with classmates.
Methods
The first step in this exercise was to obtain the locational data of mines in Wisconsin. Professor Hupy directed us to the Investigative Journalism website called WisconsinWatch. This website provided an Excel file with address records for over 130 sand mines in Wisconsin. The data was split up among the students in the class so that each student had 14 mines to geocode, with overlap among the students so that we could check for potential errors. Figure 1 below displays the data that I was required to geocode.

Figure 1. Raw location data for 14 sand mines in Wisconsin. Notice how inconsistent the Facility Address column is.
Geocoding is using address data to pinpoint data to a geographic location. However, ArcGIS cannot accurately geocode a mess of data that is displayed in the Facility Address column in Figure 1. Geocoding requires consistent and organized data. To prove this point I first geocoded the data above as is, and I wasn't surprised by the outcome. Only two points were in Wisconsin, several were in Canada, and points even appeared in Italy, South America, and Australia. Because the GIS software didn't have good data to work with, it placed the points using keywords that it did manage to find.

Using Google Earth and ESRI address locator, I used the address descriptions that were given to manually locate and update the addresses. I first located the mine by analyzing aerial images with Google Earth, then I used ESRI to select an address that was nearest to the mine's location. I then updated the information in a new Excel spreadsheet to be geocoded once all the addresses were normalized. It was necessary to use Google Earth because it has much more recent aerial images, and revealed the locations of mines that weren't even present in ESRI's aerial images.

Some of the mines only had PLSS information for the Facility Address (see the last row in Figure 1). The PLSS system breaks the state up into lots of squares, organized by Township (34N in the above example), Range (11W in the above example), and Section (28 in the above example). In that example, the mine is located in the Southwestern quarter of the Southwestern quarter of Section 28 of Township 34 North Range 11 West. Given this information, and by using a PLSS grid that was supplied from the department's online database, it was easy to find the mine's location on the aerial image. After this process, I then normalized the data in the new Excel spreadsheet by adding a City, Address, and Postal column to the spreadsheet and updating the new addresses as I located them with ESRI. Figure 2 shows the completed Excel table.

Figure 2. Normalized address data for the same 14 mines. The updated addresses are much more clean and easy to locate than the previously supplied location information.
After updating the addresses, the software was able to geocode the mines with no problems. Our entire class then shared each of our geocoded shapefiles so that we could analyze our results and find potential errors. I first merged all the shapefiles that were not created by me, in order to access the information that I didn't create. From this new shapefile of my classmate's mines, I selected each of the mines that I also geocoded, creating another new shapefile. With this shapefile and the shapefile of my geocoded mines, I used the Point Distance tool in ArcMap to create a table of the distance from my mines to each of the mines that my classmates mapped. Figure 3 shows a portion of this table. The data from this table will then be used to check for errors among my classmates and I.

Figure 3. The results of the Point Distance tool. There were 224 rows
(16 mines for each of my 14 mines). The table is unnecessarily redundant.
Because most of the data in this table was useless, I had to yet again organize data within the Excel table. First I sorted the Distance and INPUT columns from lowest to highest so that mines closer together would be at the top, with the mines still being in order according to the input mines. It was then easy to pick which of my classmates mines were closest to my mines. I created a new Excel table with this information (see Figure 4), adding a Kilometers column for easier reading.

Figure 4. Table of my mines (INPUT_FID) paired with the nearest
of my classmates' mines (NEAR_FID), with the distance between the two
in meters and kilometers.
Results
Overall, my classmates and I were fairly consistent with the mines that we mapped. Figure 5 shows the completed map of my mines compared to those of my classmates. This map, paired with Figure 4 above both support this claim. The nearest two mines were only 8 meters apart, while the furthest were 19 km apart. However, it is easy to see this outlier by viewing the map below. This could have happened either by myself identifying an incorrect mine, or by my classmates failing to identify a correct mine. 

Figure 5. A map of our assigned sand mines. 
Discussion
After all this work, it is easy to imagine the sources of error between my mines and those of my classmates. It is clear that each of us were fairly precise in locating our mines, as it appears that there are clear pairs for each mine. However, the accuracy for these mines vary depending on the operator. Any error in this activity is most likely an operational error, or an error that is created due to inevitable variation or mistakes caused by the operator of the software or hardware. I would say that our errors are due to inevitable variations in feature classification, image analysis, and attribute data input. There will always be errors in image analysis; what one person may identify as a small mining operation, another person may identify as a large farm. When it comes to aerial imagery, it isn't always clear what an object is, and sometimes it causes errors. During the geocoding process, errors due to variation in feature classification and attribute data input are also unavoidable. Sometimes there are several addresses to decide from, and it is up to the analyst to make a decision; not everyone will make the same decision. It is unavoidable to prevent errors like this, it is just up to the analyst to be aware of their presence and either fix them, or take them into account when using the data.

Conclusion
It was no easy hike getting from that first sloppy address table to a complete map of geocoded mines. This being said, I feel like my classmates and I did a good job with the mines that we were assigned. The map shows that each mine has a clear partner (they aren't just scattered about), and aside from the 19 km outlier, each mine is within 6 km of the other mine in the pair. I was surprised to see such precise results after considering the unavoidable errors.


No comments:

Post a Comment