Mapping Word Frequencies on Twitter using Python

Python Markdown for Jack Grieve and David Jurgens. 2019. Mapping word frequencies on Twitter using R and Python. Workshop presented at NWAV 48, University of Oregon, 2019-10-10.

David Jurgens, University of Michigan, jurgens@umich.edu

2019-10-03

1 Summary

This markdown presents the python code for our workshop at NWAV 48 on Mapping Word Frequencies on Twitter using R and Python. It was prepared by Jack Grieve; the parallel Python code was prepared by David Jurgens. The point of the workshop is to show how to map word frequency in R and to explain why a linguist might want to do this. We’ll be working with a dataset based on a corpus of American geolocated tweets. For more information on the conference and the workshop see https://nwav48.uoregon.edu/workshops/.

2 Packages

This is the full set of python packages used in this markdown. For this workshop we’ll be working mostly with the Geopandas and PySAL packages in Python. These are typically the easiest ways to work with spatial data, plot it, and perform spatial statistics.

In [1]:
# This first line makes sure that our plots will show up 
# in the jupyter notebook if you want to run this code
%matplotlib inline
import geopandas as gpd
import pandas as pd
import pysal
from collections import defaultdict
import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib.pylab import cm
from pysal.explore.esda.getisord import G_Local
from pysal.explore.esda.moran import Moran
from pysal.lib.weights import KNN
import tqdm
import random
import mapclassify

# This lets us do slightly fancier LaTeX-based captions in plots
plt.rc('text', usetex=True)
/Users/jurgens/anaconda3/lib/python3.7/site-packages/pysal/model/spvcm/abstracts.py:10: UserWarning: The `dill` module is required to use the sqlite backend fully.
  from .sqlite import head_to_sql, start_sql

3 US Counties Map Template

3.1 US County Map

First, we read in the US counties map from a shapefile using geopandas, which is a kind of file that describes the geometry of objects (e.g., their latitute and longitude). You can download this counties shapefile here.

In [2]:
county_map = gpd.read_file('UScounties.shp')

Geopandas is just like regular pandas dataframe (or an R dataframe if you know that) so any regular pandas operation also works here. Let's take a look at what's in this dataframe using head() which shows the first five rows of data. We can see that we have regular columns and a special geometry column that describes the shape; geopandas knows how to interpret this column in order to do spatial operations, which will make our life easier laterl.

In [3]:
county_map.head()
Out[3]:
NAME STATE_NAME STATE_FIPS CNTY_FIPS FIPS geometry
0 Lake of the Woods Minnesota 27 077 27077 POLYGON ((-95.34283 48.54668, -95.34105 48.715...
1 Ferry Washington 53 019 53019 POLYGON ((-118.85163 47.94956, -118.84846 48.4...
2 Stevens Washington 53 065 53065 POLYGON ((-117.43883 48.04412, -117.54219 48.0...
3 Okanogan Washington 53 047 53047 POLYGON ((-118.97209 47.93915, -118.97406 47.9...
4 Pend Oreille Washington 53 051 53051 POLYGON ((-117.43858 48.99992, -117.03205 48.9...

Let's plot the state map.

In [4]:
county_map.plot()
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a279b34a8>

For visual convencience, we'll filter out Alaska and Hawaii so it's just the continental US (apologies to those of you from these states!).

In [5]:
# Drop Alaska and Hawaii to just be the continental US
county_map = county_map[(county_map['STATE_NAME'] != 'Alaska') 
                          & (county_map['STATE_NAME'] != 'Hawaii')]
In [6]:
county_map.plot()
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a27a56358>

In addition, we can re-project the map. In this case we’ll use a Lambert equal area projection, which can make our map look more professional. It’s nice to use a different projection system, so we get a less flat map, as we’re used to seeing for the US. The Albers Projection is a standard choice. We do this by defining the projection in the PROJ.4 system (don’t worry about the details...I don’t entirely understand them myself---but thankfully geopandas does!) and then transforming the spatial polygon map. In practice, we can use one of many standard projections or define our own to highight a particular geographic region best.

I've included one below that might be of use. We'll use the same setup as before, but before calling plot(), we first call to_crs() on the dataframe itself which changes the coordinate reference system to the projection (essentially, changing where a particular lat/lon value gets placed on a map).

In [7]:
# This is our fancy projection
projection = "+proj=laea +lat_0=30 +lon_0=-95"
county_map = county_map.to_crs(projection)
In [8]:
county_map.plot()
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a27d1b630>

We can also get rid of that pesky bounding box by turning the axis off of the plot, which is the last prettification step we'll do for now. To do this, we'll (1) create a figure object with Matplotlib and turn its axis markers off using ax.axis('off') and (2) tell geopandas to plot the map on that figure.

In [9]:
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_map.plot(ax=ax)
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a27bf2780>

3.3 Linguistic Data

Now, let’s read in the dataset, which can be downloaded here. The dataset contains each county's FIPS which is a five-digit numeric code that uniquely idenfities it. Python can sometimes get confused and read this in as an "int" type, which causes a few subtle bugs later, so we're going to manually specify the column should be treated as a string using dtype={'FIPS': str} when loading the data.

In [10]:
word_freq_df = pd.read_csv('US_CNTY_3076_TWIT_10K_WORD.txt', sep=' ', dtype={'FIPS': str})

This dataset consists of the relative frequency per million words of top 10,000 most common words (defined as a string of case-insensitive alphabetic characters plus hyphens) measured across the 3,076 counties in the contiguous US, based on a multi-billion word corpus of American geolocated Twitter data collected between October 2013 and November 2014 (see Huang et al. 2016, Grieve et al. 2018).

For example, consider the values for the last few variables. The county name in column 1 and the FIPS code for the county is in column 2.

In [11]:
word_freq_df.tail()
Out[11]:
county_ID FIPS a a. aa aaliyah aaron ab abandoned abbey ... zip zoe zoey zombie zombies zone zoned zoo zoom zumba
3071 wyoming,sweetwater 56037 17786.28 0.0 6.44 3.22 12.88 6.44 3.22 0.00 ... 6.44 6.44 0.0 28.97 12.88 35.41 6.44 12.88 16.09 12.88
3072 wyoming,teton 56039 18783.86 5.2 10.40 0.00 20.80 2.60 15.60 23.39 ... 7.80 10.40 7.8 18.20 5.20 54.59 0.00 7.80 13.00 10.40
3073 wyoming,uinta 56041 17720.52 0.0 0.00 0.00 14.66 14.66 0.00 58.63 ... 0.00 0.00 0.0 29.31 29.31 29.31 0.00 0.00 0.00 29.31
3074 wyoming,washakie 56043 19774.68 0.0 14.75 0.00 14.75 14.75 0.00 0.00 ... 0.00 0.00 0.0 29.49 0.00 29.49 0.00 0.00 44.24 0.00
3075 wyoming,weston 56045 20534.10 0.0 0.00 0.00 0.00 27.31 0.00 0.00 ... 0.00 0.00 0.0 0.00 0.00 54.61 0.00 0.00 0.00 0.00

5 rows × 10002 columns

We can also get a quick summary of the data too

In [12]:
word_freq_df.describe()
Out[12]:
a a. aa aaliyah aaron ab abandoned abbey abby abc ... zip zoe zoey zombie zombies zone zoned zoo zoom zumba
count 3076.000000 3076.000000 3076.000000 3076.000000 3076.000000 3076.000000 3076.000000 3076.000000 3076.000000 3076.000000 ... 3076.000000 3076.000000 3076.000000 3076.000000 3076.000000 3076.000000 3076.000000 3076.000000 3076.000000 3076.000000
mean 17686.561661 4.036619 8.119798 4.479535 31.770312 18.642633 5.292051 3.490634 14.144178 10.266057 ... 10.606531 4.899717 4.677120 15.390471 8.407350 39.698365 3.514844 15.925293 4.380306 91.983557
std 1806.721774 50.617120 24.687379 10.843293 54.384890 43.162770 10.050835 8.150068 23.625402 19.273659 ... 33.785532 14.547293 9.880733 25.511622 13.369255 35.477047 5.215757 23.753035 10.352628 4898.997368
min 2196.160000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 16837.517500 0.000000 0.000000 0.000000 13.637500 3.700000 0.000000 0.000000 2.537500 3.760000 ... 1.407500 0.000000 0.000000 7.597500 0.000000 26.610000 0.000000 5.270000 0.000000 0.000000
50% 17721.195000 1.460000 3.460000 1.130000 24.195000 9.020000 3.800000 1.225000 9.460000 8.410000 ... 5.355000 2.405000 2.710000 13.725000 6.875000 36.950000 2.755000 11.350000 3.100000 1.620000
75% 18596.060000 3.512500 6.772500 5.630000 37.350000 16.712500 6.060000 4.275000 17.610000 13.060000 ... 8.672500 5.722500 5.172500 19.170000 10.555000 46.462500 4.762500 19.862500 5.190000 4.782500
max 55189.400000 2770.080000 446.890000 241.440000 1936.620000 875.660000 297.420000 169.070000 547.950000 898.110000 ... 683.880000 607.560000 220.390000 1203.370000 316.290000 731.140000 110.380000 575.620000 329.380000 271710.000000

8 rows × 10000 columns

Before moving on, it’s worth quickly explain how this dataset was generated. First, each tweet in the corpus was grouped by the county in which it occurred based on the longitude and latitude in the metadata. To normalise for variation in sample size, which reflects variation in county population, the relative frequency of each word in each county was computed by dividing its frequency by the total number of word tokens in that county and then multiplying by 1 million. It is crucial to normalise otherwise you just get population maps.

3.4 Joining the Data

Now we can join linguistic dataset to the sf map object based on the two sets of county IDs, which in the US are known as FIPS codes. For those of you who know databases or SQL, pandas (and therefore geopandas) lets us easily do inner joins on the data frames.

In [13]:
county_word_freq_df = pd.merge(county_map, word_freq_df, on='FIPS', how='inner')

After joining the data, we'll reset the coordinate reference system (CRS) to the one we want so everything keeps plotting the way we expect

In [14]:
county_word_freq_df = county_word_freq_df.to_crs(projection)

We can see that the county_word_freq_df data frame now contains 10,000 more columns than the county_map data frame.

In [15]:
county_word_freq_df.shape
Out[15]:
(3075, 10007)

As a sanity check, let's take a look at the first few rows to see that it has the relative counts for each word (as separate columns)

In [16]:
county_word_freq_df.head()
Out[16]:
NAME STATE_NAME STATE_FIPS CNTY_FIPS FIPS geometry county_ID a a. aa ... zip zoe zoey zombie zombies zone zoned zoo zoom zumba
0 Lake of the Woods Minnesota 27 077 27077 POLYGON ((-25655.892 2049834.670, -25444.188 2... minnesota,lake of the woods 21027.84 0.00 0.00 ... 0.00 0.0 0.0 33.89 0.00 16.94 33.89 16.94 0.0 16.94
1 Ferry Washington 53 019 53019 POLYGON ((-1775417.832 2196570.086, -1758130.5... washington,ferry 12828.09 0.00 0.00 ... 337.58 0.0 0.0 0.00 0.00 84.39 0.00 0.00 0.0 0.00
2 Stevens Washington 53 065 53065 POLYGON ((-1670527.465 2182361.609, -1678023.8... washington,stevens 16853.05 0.00 0.00 ... 51.16 0.0 0.0 38.37 123.64 21.32 0.00 4.26 0.0 0.00
3 Okanogan Washington 53 047 53047 POLYGON ((-1784425.905 2197600.293, -1783913.3... washington,okanogan 18685.00 17.85 4.46 ... 4.46 0.0 0.0 17.85 4.46 22.31 0.00 13.38 0.0 0.00
4 Pend Oreille Washington 53 051 53051 POLYGON ((-1641330.622 2285156.266, -1612439.6... washington,pend oreille 15708.26 0.00 21.40 ... 64.20 0.0 0.0 21.40 0.00 21.40 0.00 0.00 0.0 0.00

5 rows × 10007 columns

4. Mapping Word Frequencies

Now we can map the relative frequencies of any of these 10,000 words across the 3,076 counties in the contiguous US.

4.1 Simple Maps

Let’s first make a quick map for the relative frequency of ‘i’, the most common word in the corpus.

To make the map, we'll tell geopandas which column to use to color the map, which makes a Chloropleth map, where the color of each shape is based on the value of an associated variable). Here, we just include the argument column='i' to our plot() command on the county_word_freq_df dataframe.

In [17]:
figZ, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_word_freq_df.plot(ax=ax, column='i', legend=True)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a281611d0>

Not much there, but that’s because the distribution here is highly skewed. Let's take a look at the frequency distribution. Since each geopandas dataframe is also a Pandas dataframe, we can use its built-in plotting features to show histograms (among oether things). Here, let's try showing a histogram of 40 bins for "i" frequency by county

In [18]:
county_word_freq_df.i.hist(bins=40)
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a28724940>

There's clearly an outliner here:

In [19]:
county_word_freq_df['i'].describe()
Out[19]:
count      3075.000000
mean      44810.952039
std        6332.354688
min        3573.250000
25%       42300.065000
50%       45716.950000
75%       48359.530000
max      192438.870000
Name: i, dtype: float64

And colour breaks are assigned here based on assigning equally spaced intervals distinct colours between the min and max value, so most value are bunched together. Geopandas includes a scheme argument for plot() that tells geopandas how to decide which values get mapped to which colors. We can try to simulate the default behavior more explicitly by adding the scheme='equal_interval' argument to plot() which will use equal sized bins with discrete colors.

In [20]:
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_word_freq_df.plot(ax=ax, column='i', legend=True,
                         scheme='equal_interval')
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2b210a58>

The basic alternative (there are other more complex approaches) is to define these colour breaks across unequal intervals so that each colour band has roughly an equal numbers of locations by setting scheme='quantile'

In [21]:
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_word_freq_df.plot(ax=ax, column='i', legend=True, scheme='quantiles')
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2b117518>

This approach clearly shows patterns better at least in this case because of the skewed distribution, but do note the differences in the legends.

Here are a couple more maps for common words in both styles, plus using Fisher-Jenks style breaks, which is a more complex approach that falls somewhere in between these two extremes.

In [22]:
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('and', fontsize=28)
county_word_freq_df.plot(ax=ax, column='and', scheme='equal_interval', legend=True)
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2afea908>
In [23]:
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('and', fontsize=28)
county_word_freq_df.plot(ax=ax, column='and', legend=True,
                                            scheme='quantiles')
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2b1e1908>
In [24]:
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('and', fontsize=28)
county_word_freq_df.plot(ax=ax, column='and', legend=True,
                                            scheme='fisher_jenks')
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2e8f9ac8>
In [25]:
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('but', fontsize=28)
county_word_freq_df.plot(ax=ax, column='but', scheme='equal_interval', legend=True)
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2a6c2eb8>
In [26]:
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('but', fontsize=28)
county_word_freq_df.plot(ax=ax, column='but', legend=True,
                                            scheme='quantiles')
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2a63ab38>
In [27]:
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('but', fontsize=28)
county_word_freq_df.plot(ax=ax, column='but', legend=True,
                                            scheme='fisher_jenks')
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2a790dd8>

The clear patterns in the usage of these function words, and for example the complementary patterns for these two coordinators are suprising. We will argue in the second half of this workshop that this is important information for our understanding of language variation and change that cannot be accessed through the analysis of alternations, as is standard in variationist sociolinguistics, but first we’ll look at map formating and how to do basic spatial analysis.

Note, however, that plotting this way using quantile breaks doesn’t always work, in particular when we try to map lower frequency words like ‘battlefield’, with lots of 0 values.

In [28]:
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('battlefield', fontsize=28)
county_word_freq_df.plot(ax=ax, column='battlefield', scheme='equal_interval',
                         legend=True)
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2aa332b0>
In [29]:
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('battlefield', fontsize=28)
county_word_freq_df.plot(ax=ax, column='battlefield', scheme='quantiles',
                         legend=True)
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a29a64be0>
In [30]:
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('battlefield', fontsize=28)
county_word_freq_df.plot(ax=ax, column='battlefield', legend=True,
                                            scheme='fisher_jenks')
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a29b52860>

This is all a bit problematic, since we really want to map all words consistently and informatively. The simplest solution in terms of code is probably to use jenks (the complication there is explaining how it works), but we can also use quantile style breaks for all cases if we set the class intervals ourselves, rather than let the geopandas library plotting function do it for us, so let’s look at how to do that next.

4.2 Color Palettes

First, we need to take control of the colour palette. For example, we can specify a simple white through to purple palette.

In [31]:
# Note: typically the color palettes in Matplotlib are called color maps (cmap for short)
cmap = plt.get_cmap('Purples')

And then re-plot the map with that palette, by specifying cmap=cmap

In [32]:
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('i', fontsize=28)
county_word_freq_df.plot(ax=ax, column='i', scheme='fisher_jenks', legend=True, cmap=cmap)
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2aa70588>

We can also use a blue to red palette that runs through white.

In [33]:
cmap = mpl.colors.LinearSegmentedColormap.from_list(
        'Custom cmap', ["steelblue", "white", "tomato"])
In [34]:
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('i', fontsize=28)
county_word_freq_df.plot(ax=ax, column='i', scheme='fisher_jenks', legend=True, cmap=cmap)
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2a51b550>

Or we can do something more complex – for better or worse.

In [35]:
# Create the color scale; note we need an even number of colors to avoid
# having white as a middle color that's not exactly "middle" in value
cmap = mpl.colors.LinearSegmentedColormap.from_list(
        'Custom cmap', [(.05, .90, .65),
                        (.85, .45, .05),
                        (.10, .20, .70),
                        (.65, .05, .75)])
In [36]:
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('i', fontsize=28)
county_word_freq_df.plot(ax=ax, column='i', scheme='quantiles', legend=True, cmap=cmap)
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2a31fd68>

4.3 Class Intervals

Now we can specify how the the observed values of a variable and divided into class intervals are and then link those intervals to the bands of our colour palette.

To do this, we first specify how many intervals and the style of breaks we want and then find those class intervals, in this case 5 quantile breaks for the word ‘i’. Under the hood, geopandas uses the mapclassify library to divide up the data according to different strategies. We can directly invoke each strategy to see where it sets its cut-off points

In [37]:
mapclassify.Quantiles(county_word_freq_df.i, k=5)
Out[37]:
                    Quantiles                    
 
  Lower                Upper                Count
=================================================
             x[i] <=  41366.236               615
 41366.236 < x[i] <=  44469.782               615
 44469.782 < x[i] <=  46804.648               615
 46804.648 < x[i] <=  48904.748               615
 48904.748 < x[i] <= 192438.870               615

Second, we extract as many colours as we need from a colour palette. We’ll use the purple one that we already defined.

In [87]:
cmap = mpl.colors.LinearSegmentedColormap.from_list(
    'Custom cmap', ["#FFFFFF", "#D6D1E2", "#AEA3C4", "#8575A8", "#5D478B"], 5)

# This defines how we want separate our regions by color
bounds = list(mapclassify.Quantiles(county_word_freq_df.i, k=5).bins)

# This controls how the colors are scaled using the above bounds
norm = mpl.colors.BoundaryNorm(bounds, cmap.N, clip=True)

Now we can plot using our custom color bounds. Note that since we defined our own color boundaries, we don't have a color bar and legend. We'll fix this later in 4.5

In [88]:
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_word_freq_df.plot(cmap=cmap, scheme='quantiles', 
                         ax=ax, column='i', linewidth=0.3, edgecolor='0.8')
Out[88]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2d61a3c8>

4.4 Map Formatting

Let’s now look at various ways of tweaking the appearance of our maps, using the word ‘i’ as an example.

First, we set up the class interval and the colour palette. If we switch words we must redo this step, but for now we’ll just work with this map. Note that we’re still using the purple colour list collist defined earlier.

In [39]:
cmap = mpl.colors.LinearSegmentedColormap.from_list(
    'Custom cmap', ["#FFFFFF", "#D6D1E2", "#AEA3C4", "#8575A8", "#5D478B"], 5)

Next, we can change the width of the county borders by setting linewidth=0.3, which is especially useful because the counties are so small in the East. We'll also manually specify the color of the edges using edgecolor='black'

In [40]:
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_word_freq_df.plot(cmap=cmap, scheme='quantiles', 
                         ax=ax, column='i', linewidth=0.3, edgecolor='black')
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2b21ca90>

We can also change the colour of the county borders (reusing the colour palette we just defined) which by default aren't drawn. In the west where there are many counties without any mentions of the word, light lines can help make sense of the space. We use the edgecolor='0.8' argument to draw them as light gray. In matplotlib you can specify colors in many ways and we've used three examples in this tutorial (names, hexcodes, and fractions fo black) to give you a few examples.

In [41]:
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_word_freq_df.plot(cmap='Purples', scheme='quantiles', 
                         ax=ax, column='i', linewidth=0.3, edgecolor='0.8')
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2b2c8ac8>

We might also want so superimpose state borders here in a different colour and/or width. To do this we first have to generate a state-level geopandas dataframe from a shapefile. You can download the states shapefile here or get it from the included data.

We can plot both states and counties together by creating a single "figure" object using python's main plotting library, matplotlib and telling geopandas to plot the two shapefiles on it. Here, we'll use different colors and line widths for the edges to highlight the different boundaries. There are lots of options to the plot command that you can use to fine tune your figures.

Let's plot the states map file just as a sanity check.

In [42]:
us_states = gpd.read_file('states.shp')
us_states = us_states[(us_states['STATE_NAME'] != 'Alaska') 
            & (us_states['STATE_NAME'] != 'Hawaii')]

projection = "+proj=laea +lat_0=30 +lon_0=-95"
us_states = us_states.to_crs(projection)
us_states.plot()
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2bbe0b00>

Now let's plot the states and counties together. We don't want the states' interiors to be drawn over the counties, so we'll use facecolor="none" to indicate that the states shouldn't be filled.

In [43]:
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_word_freq_df.plot(cmap='Purples', scheme='quantiles', 
                         ax=ax, column='i', linewidth=0.3, edgecolor='0.8')

us_states.plot(facecolor="none", ax=ax, linewidth=0.4, edgecolor='0')
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2a3a3710>

4.5 Legend

Additionally, let’s look at modifying the legend and title, just by drawing it in directly, using the class intervals and colours we defined, and the coordinates from the (projected) map.

In [44]:
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')

cmap = mpl.colors.LinearSegmentedColormap.from_list(
    'Custom cmap', ["#FFFFFF", "#D6D1E2", "#AEA3C4", "#8575A8",
                    "#5D478B"], 5)

county_word_freq_df.plot(cmap=cmap, scheme='quantiles', 
                         ax=ax, column='i', linewidth=0.3, edgecolor='0.8')

us_states.plot(facecolor="none", ax=ax, linewidth=0.4, edgecolor='0')

# The tick marks we want for our map.  We'll pull these directly from
# the mapclassify code.
bounds = list(mapclassify.Quantiles(county_word_freq_df.i, k=5).bins)

# This controls how the colors are scaled using the above bounds
norm = mpl.colors.BoundaryNorm(bounds, cmap.N, clip=True)

# This sets where the colorbar goes and its size
cbax = fig.add_axes([0.15, 0.25, 0.25, 0.02])

cb2 = mpl.colorbar.ColorbarBase(cbax, cmap=cmap,
                                norm=norm, 
                                ticks=bounds,
                                orientation='horizontal')

cb2.set_label(r'\Huge{i}\\\large{per million words}')
cb2.ax.xaxis.set_label_position('top')
cb2.ax.xaxis.label.set_multialignment('center')
plt.setp(plt.getp(cb2.ax.axes, 'xticklabels'), fontsize='small')
Out[44]:
[None, None, None, None, None, None, None, None, None, None]

4.6 Mapping Function

Finally, since this is all a lot of code, and since we want to be able to map all word from the dataset easily and consistently, we’ll put all this code into a single mapping function, so we just have to specify the word and the map title.

In [45]:
def plot_freq_map(word, title):
    '''plots the US map of the word using the data in the geopandas dataframe
       the geo_df object is expected to have a column labeled "z-score" for the
       target word
    '''
 
    fig, ax = plt.subplots(1, figsize=(11,8.5))
    ax.axis('off')


    cmap = mpl.colors.LinearSegmentedColormap.from_list(
        'Custom cmap', ["#FFFFFF", "#D6D1E2", "#AEA3C4", "#8575A8",
                        "#5D478B"], 5)

    county_word_freq_df.plot(cmap=cmap, scheme='quantiles', 
                             ax=ax, column=word, linewidth=0.3, edgecolor='0.8')

    us_states.plot(facecolor="none", ax=ax, linewidth=0.4, edgecolor='0')

    # The tick marks we want for our map
    bounds = list(mapclassify.Quantiles(county_word_freq_df[word], k=5).bins)

    # This controls how the colors are scaled using the above bounds
    norm = mpl.colors.BoundaryNorm(bounds, cmap.N, clip=True)

    # This sets where the colorbar goes and its size
    cbax = fig.add_axes([0.15, 0.25, 0.25, 0.02])

    cb2 = mpl.colorbar.ColorbarBase(cbax, cmap=cmap,
                                    norm=norm, 
                                    ticks=bounds,
                                    orientation='horizontal')

    cb2.set_label(r'\Huge{%s}\\\large{%s}' % (word, title))
    cb2.ax.xaxis.set_label_position('top')
    cb2.ax.xaxis.label.set_multialignment('center')
    plt.setp(plt.getp(cb2.ax.axes, 'xticklabels'), fontsize='small')    
 
plot_freq_map('i', 'per million words')
In [46]:
plot_freq_map('the', 'per million words')
In [47]:
plot_freq_map('and', 'per million words')
In [48]:
plot_freq_map('but', 'per million words')
In [49]:
plot_freq_map('oregon', 'per million words')
In [50]:
plot_freq_map('fuck', 'per million words')
In [51]:
plot_freq_map('sneakers', 'per million words')
/Users/jurgens/anaconda3/lib/python3.7/site-packages/mapclassify/classifiers.py:95: UserWarning: Warning: Not enough unique values in array to form k classes
  UserWarning)
/Users/jurgens/anaconda3/lib/python3.7/site-packages/mapclassify/classifiers.py:96: UserWarning: Warning: setting k to 4
  Warn('Warning: setting k to %d' % k_q, UserWarning)
/Users/jurgens/anaconda3/lib/python3.7/site-packages/mapclassify/classifiers.py:95: UserWarning: Warning: Not enough unique values in array to form k classes
  UserWarning)
/Users/jurgens/anaconda3/lib/python3.7/site-packages/mapclassify/classifiers.py:96: UserWarning: Warning: setting k to 4
  Warn('Warning: setting k to %d' % k_q, UserWarning)
In [52]:
plot_freq_map('battlefield', 'per million words')

5 Spatial Autocorrelation Analysis

Now let’s look at how to conduct spatial autocorrelation analyses. The basic purpose of these methods is to identify regional patterns in the map for a quantitative variable, like any of our swear word maps. There are two basic types of spatial autocorrelation analyses (global and local), which we’ll look at in turn, but first we need to learn how to define a spatial weights matrix, which is the foundation for both global and local measures, as well as various other methods for spatial analysis.

5.1 Spatial Weights Matrix

A spatial weights matrix (SWM) is a big location-by-location matrix that specifies the spatial relationship between every pair of locations in a regional dataset. Basically, nearby locations are given a stronger weight than distant locations (often 1 for nearby locations and 0 for distant locations). These values are then used to weight the comparisons between locations in spatial autocorrelation analyses.

5.1.1 Finding Nearest Neighbours

There are various ways to build an SWM. We’ll use what’s known as a “k-nearest neighbour” SWM, which is a very common and simple approach, an essentially the default on Python. The k part refers to how many nearest neighbors you want to include.

As a sanity check, let's look at the centroids of the counties

In [53]:
# We'll create a new Geopandas dataframe where each county gets mapped to its centroid
county_centroids_df = county_word_freq_df[['geometry']].copy()
county_centroids_df.geometry = county_word_freq_df.centroid
In [54]:
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_centroids_df.plot(ax=ax, marker='*', color='red', markersize=1)
county_word_freq_df.plot(ax=ax, facecolor="none", linewidth=0.4, edgecolor='0')
Out[54]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2d51da58>

5.1.2 Building the Spatial Weights Matrix

Geopandas doesn't know how to build the nearest neighbors data on its own, but our second library PySAL does! Even better, since a geopandas has spatial geometry to describe where the counties are, we can use our dataframe as is with PySAL to compute the nearest neighbors.

Here, we'll use the KNN class (k-Nearest Neighbors) and choose 25 neighbors usign the k=25 argument. We'll explore what happens if we choose other values of k later too.

Now we can identify the nearest neighbours for each location (let’s use 25 for now, including the location itself) and then give each of these pairings a weight of 1 (or 1/n), and all other pairings a weight of 0. We don’t have to assign a non-zero weight to the pairing of each location and itself, but this leads to more intuitive results later on. Note that this SWM is non-symetric: even if location A is one of the nearest neighbours of location B, location B is not necessarily one of the nearest neigbours of location A.

In [55]:
listws25 = KNN.from_dataframe(county_word_freq_df, k=25)

5.1.3 Varying Spatial Weights

The issues we skipped is how to select the number of nearest neighbours. There is no simple rule of thumb, which is unfortunate to some extent, since it places greater burden on us to make a principled choice, but this also gives us more flexibility, kind of like adjusting the resolution on a microscope. Plus if the values are set at a reasonable level (say with the number of nearest neighbours around 10% of the total locations or less, see Grieve 2017), it shouldn’t make too much difference. Anyway, this is all pretty abstract, so let’s generate a couple more spatial weights matrices (for 5 and 50 nearest neighbours) and then we’ll see how that changes our analyses later on.

In [56]:
listws05 = KNN.from_dataframe(county_word_freq_df, k=5)
listws50 = KNN.from_dataframe(county_word_freq_df, k=50)

5.2 Global Spatial Autocorrelation

Now that we have our SWMs, we can conduct global spatial autocorrelation analyses, spefically global Moran’s I. Our goal is to test if the values of a variable show an overall regional pattern, as opposed to being randomly distributed across space. Basically, Moran’s I compares the value of the variable around each location to see if nearby locations (as defined by the SWM) tend to have similar (clustering), different (dispersal), or random values. Moran’s I generally ranges from -1 to +1, with clustering being associated with positive values, dispersion with negative values, and randomness with values around 0. In dialectology, we don’t really find much dispersion (e.g. a checkboard pattern), so in general we’re just interested in measuring the strength of clustering.

In [57]:
mi = Moran(county_word_freq_df.i, listws25)
print("Moran's I:", mi.I) 
print("Moran's p-value (under normality assumption): %0.19f" % mi.p_norm) 
Moran's I: 0.2253822031683297
Moran's p-value (under normality assumption): 0.0000000000000000000
In [58]:
mi = Moran(county_word_freq_df.the, listws25)
print("Moran's I:", mi.I) 
print("Moran's p-value (under normality assumption): %0.19f" % mi.p_norm) 
Moran's I: 0.26306523916603974
Moran's p-value (under normality assumption): 0.0000000000000000000
In [59]:
# NOTE: 'and' is a python keyword so we have to use the other way of looking up values
mi = Moran(county_word_freq_df['and'], listws25)
print("Moran's I:", mi.I) 
print("Moran's p-value (under normality assumption): %0.19f" % mi.p_norm) 
Moran's I: 0.30729811359522646
Moran's p-value (under normality assumption): 0.0000000000000000000
In [60]:
mi = Moran(county_word_freq_df.oregon, listws25)
print("Moran's I:", mi.I) 
print("Moran's p-value (under normality assumption): %0.19f" % mi.p_norm) 
Moran's I: 0.32614344285868935
Moran's p-value (under normality assumption): 0.0000000000000000000
In [61]:
mi = Moran(county_word_freq_df.now, listws25)
print("Moran's I:", mi.I) 
print("Moran's p-value (under normality assumption): %0.19f" % mi.p_norm) 
Moran's I: 0.06443194223076998
Moran's p-value (under normality assumption): 0.0000000000000000000
In [62]:
mi = Moran(county_word_freq_df.sneakers, listws25)
print("Moran's I:", mi.I) 
print("Moran's p-value (under normality assumption): %0.19f" % mi.p_norm) 
Moran's I: 0.29900107494088407
Moran's p-value (under normality assumption): 0.0000000000000000000
In [63]:
mi = Moran(county_word_freq_df.battlefield, listws25)
print("Moran's I:", mi.I) 
print("Moran's p-value (under normality assumption): %0.19f" % mi.p_norm) 
Moran's I: 0.010774342612320652
Moran's p-value (under normality assumption): 0.0247682802860265117

Notably, given the p-values, we can see that they all show non-random regional patterns (as can be plainly seen by inspecting the maps), which is generally true of word maps in this dataset at least. We can also use these Moran’s I scores to rank maps in terms of the strength of the regional patterns they show.

5.2.1 Bonus Feature: Ranking words by Moran's I

To calculate Moran’s I for all 10,000 word maps, we loop through the base dataset word by word and then store the results in a new data frame, including the word, the Moran’s I value, and the associated p value, which we calculate under the assumption of randomness (i.e. we do not assume normality), using a one-tail greater than test, since we’re not looking for dispersion. We’re also going to run three different tests, one for each SWM, so we can look at how that changes our results.

In [64]:
# NOTE: we'll convert this to a pandas dataframe after
moran_swear = defaultdict(list)

# Let's pick 100 random words
words = list(county_word_freq_df.columns[7:])
random.shuffle(words)
words = words[:100]

for word in tqdm.tqdm_notebook(words):
    moran_swear['NAME'].append(word)
    
    mi = Moran(county_word_freq_df[word], listws25)
    morans_I_25 = round(mi.I, 3)
    z_25 =  mi.p_norm
    moran_swear['I_25nn'].append(morans_I_25)
    moran_swear['p_25nn'].append(z_25)

    
moran_swear = pd.DataFrame(moran_swear) 
moran_swear.sort_values('I_25nn').head()

Out[64]:
NAME I_25nn p_25nn
63 weapon -0.010 0.050787
43 mixing -0.002 0.665449
22 brake -0.001 0.960017
57 forecast -0.000 0.974963
39 std 0.001 0.865913
In [65]:
moran_swear.sort_values('I_25nn').tail()
Out[65]:
NAME I_25nn p_25nn
8 mo 0.306 0.0
68 wtf 0.309 0.0
18 fa 0.394 0.0
71 hella 0.566 0.0
27 iowa 0.620 0.0

From this you can see that all the maps show significant spatial clustering. We can also see that varying the SWM doesn’t make that much difference here: we get highly significant results across the board (even if we correct alpha). For the most part, Moran’s I and the associated p-value are strongest here for SWMs based on 5 nearest neighbours, but it is not generally the case that Moran’s I will be maximised by looking at the fewest number of neighbours (i.e. n = 2), or that Moran’s I will be maximised for all the variables in a regional linguistic dataset for SWMs based on the same number of neighbours, so this result shouldn’t be overinterpretted.

5.3 Local Spatial Autocorrelation Analysis

Next, we can run a local spatial autocorrelation analysis. Whereas global Moran’s I provides a single value reflecting the overall amount of spatial clustering in a map, local Getis-Ord Gi* provides a value for each location indicating the degree to which it is part of a high or low value cluster, assigning a positive z-score to locations that are part of high value clusters and a negative z-score to locations that are part of a low value clusters.

Crucially, these values can then be mapped across the locations to identify areas of high and low values (hot spot maps), essentially generating smoothed maps that isolate the underlying regional signal in each variable, while ignoring non-regional variability. Local spatial autocorrelation analysis is therefore a very useful in dialectology as it basically automates the process of mapping isoglosses.

To conduct the local spatial autocorrelation analysis, we generate Getis-Ord Gi* z-scores for each location for each variable, using 25 nearest neighbour SWM.

In [66]:
getis_ord_g = pysal.explore.esda.G_Local(county_word_freq_df['i'], listws25, transform='B')
In [67]:
getis_ord_g.Zs
Out[67]:
array([-2.18410436, -3.2377711 , -4.4314233 , ...,  3.34761178,
        0.83175678, -5.58241857])

Let's make a function that will plot a word's Getis-Ord G* for all the counties

In [68]:
def plot_spatial_corr_map(word, neighbors=listws25):
    '''plots the US map of the word using the data in the geopandas dataframe
       the geo_df object is expected to have a column labeled "z-score" for the
       target word
    '''
    
    fig, ax = plt.subplots(1, figsize=(11,8.5))
    
    gol = pysal.explore.esda.G_Local(county_word_freq_df[word], neighbors, transform='B')
    county_word_freq_df['z-score'] = gol.Zs

    # Create the color scale; note we need an even number of colors to avoid
    # having white as a middle color that's not exactly "middle" in value
    cmap = mpl.colors.LinearSegmentedColormap.from_list(
        'Custom cmap', ["steelblue", "white", "tomato"], 10)

    county_word_freq_df.plot(cmap=cmap, ax=ax, column='z-score', 
                                      linewidth=0.8, edgecolor='0.8')
    tmp = us_states.plot(facecolor="none", ax=ax, 
                         linewidth=0.4, edgecolor='0')


    # The tick marks we want for our map
    bounds = [min(county_word_freq_df['z-score']), -2.33, -1.96, -1.28, -.66, 0, 0.66,
              1.28, 1.96, 2.33, max(county_word_freq_df['z-score'])]

    # This controls how the colors are scaled using the above bounds
    norm = mpl.colors.BoundaryNorm(bounds, cmap.N, clip=True)

    # This sets where the colorbar goes and its size
    cbax = fig.add_axes([0.15, 0.25, 0.25, 0.02])

    cb2 = mpl.colorbar.ColorbarBase(cbax, cmap=cmap,
                                    norm=norm, 
                                    ticks=bounds,
                                    orientation='horizontal')


    cb2.set_label(r'\Huge{%s}\\\large{Getis-Ord $G_i^*$ z-score}' % word)
    cb2.ax.xaxis.set_label_position('top')
    #for t in leg.texts:
    cb2.ax.xaxis.label.set_multialignment('center')
    plt.setp(plt.getp(cb2.ax.axes, 'xticklabels'), fontsize='small')


    # NOTE: this is kind of a hack so that we only show specific labels 
    # in the color bar legend, which otherwise becomes crowded
    #
    # NOTE2: Matplotlib doesn't use the '-' character found on your 
    # keyboard when printing negative numbers; it instead uses a 
    # specific negation character, which is pasted into the examples below
    # .replace('-', '−')
    labels_we_want_to_show = [('%0.2f' % min(bounds)), 
                                  '%0.2f' % max(bounds), 
                                  '0.00', '1.96', '-1.96']
    # Using LaTeX for captions means matplotlib adds the latex math-mode markers
    labels_we_want_to_show = ['$'+x+'$' for x in labels_we_want_to_show]

    for label in cb2.ax.xaxis.get_ticklabels()[::]:
        if label.get_text() not in labels_we_want_to_show:
            #print(label.get_text())
            label.set_visible(False)
            pass
    # Turn off matplotlib's bounding box
    tmp = ax.axis('off')
    
plot_spatial_corr_map('i')
In [69]:
plot_spatial_corr_map('the')
In [70]:
plot_spatial_corr_map('and')
In [71]:
plot_spatial_corr_map('but')
In [72]:
plot_spatial_corr_map('oregon')
In [73]:
plot_spatial_corr_map('snow')
In [74]:
plot_spatial_corr_map('fuck')
In [75]:
plot_spatial_corr_map('sneakers')
In [76]:
plot_spatial_corr_map('battlefield')
In [77]:
plot_spatial_corr_map('and', neighbors=listws05)
In [78]:
plot_spatial_corr_map('and', neighbors=listws25)
In [79]:
plot_spatial_corr_map('and', neighbors=listws50)
In [80]:
plot_spatial_corr_map('oregon', neighbors=listws05)
In [81]:
plot_spatial_corr_map('oregon', neighbors=listws25)
In [82]:
plot_spatial_corr_map('oregon', neighbors=listws50)
In [83]:
plot_spatial_corr_map('gtf', neighbors=listws25)
In [84]:
plot_spatial_corr_map('hurricane', neighbors=listws25)

Conclusion

That’s the end of the tutorial. Please let us know if you have any questions or feedback. In addition to this python code and the dataset, please also see the OSF repository (https://osf.io/e2f63/) for the matching R code by Jack Grieve and our slides for the rest of this workshop, where we argue for the value of this type of research in dialectology and sociolinguistics, as well as additional code. And if you’re interested in learning more about these types of methods and implementing them in R, please see Jack's upcoming book with Martijn Wieling ‘Regional Dialectology: Quantitative Approaches in R’.