Telling apart good and bad algorithms for geographic text visualization

This project was borne from my interest in cartographic visualization of text data, including geotagged social media, records of consumer preferences, political opinions, medical symptoms, etc. To make these records “mappable”, they are typically filtered, cleaned and summarized by an algorithm. Naturally, one would want to use an algorithm that does a good job. In computer science, such “goodness” is often tested against a standardized challenge with a known (and often human-made) answer, but in cartography such standardized tasks are rare. The goal of this project was to address this exact issue — to build a “ground truth” dataset against which automated tools for data visualization can be tested.

The first step is to define a task — that is, a data analysis challenge — to present to both human analysts and the algorithm. The task I picked for this project, as shown in figure below, is to identify regions on a map that stand out in terms of their mix of differently-colored dots. This task is inherently geographic without being too complex, and it translates well to real-life questions such as “do regions differ in terms of voter concerns”, or “do the reported symptoms vary by region”.

To use a realistic dataset, yet avoid the many potential sources of visual bias, I’ve generated a few hundred of these maps by sampling geolocated surname data from the Ohio electoral rolls. As you can see in the figure below, the resulting maps (with dot colors corresponding to distinct family names, e.g. “Jones”) look quite different from each other, which is great from the randomized counterbalancing (i.e. avoiding visual bias) point of view! I’ve then put together an Amazon Mechanical Turk experiment with 150 participants, asking each person to look at six of these maps and to mark regions that stand out in terms of their color composition.

The answers from the AMT participants, once organized and tallied up, constitute the sought-after “ground truth” dataset — a record of how a person, on average, would solve a particular analytical challenge. We can now compare these results to those obtained by various algorithms, with the hope of finding one that would do as well as a human analyst, but at a fraction of the cost. Since I had all of these components in place, I thought I’d push this project just a little further by actually testing one promising algorithm against my ground-truth dataset (I used a spatialized version of a venerable tf-idf statistic).

To make things easy for my algorithm, each map was cut into small square regions, and the algorithm then scored each region according to how much it “stands out” in terms of its surname (i.e. color) composition. These scores, for regions from just one map, are shown below as a jittered scatter-plot, with higher scores given to the more unusual regions. In the same chart, the regions that were thought to “stand out” by the human analysts were highlighted in orange and shifted upwards a bit, with the hope that some clear pattern will emerge.

Quite clearly, there is plenty of agreement between the human and the algorithm — the higher the algorithmic score, the more likely the region is to be marked as interesting by the analyst! I’ve done a more elaborate statistical analysis (a logistic regression) of the entire dataset to confirm that this is not a fluke, and the results hold — which means I can automate some of my data wrangling needs with only a modest chance of embarrassment. Of course, the next logical step is to test not one, but a dozen of promising algorithms against this task, but that’s for another project.