Mapping Word Frequencies on Twitter using Python¶

Python Markdown for Jack Grieve and David Jurgens. 2019. Mapping word frequencies on Twitter using R and Python. Workshop presented at NWAV 48, University of Oregon, 2019-10-10.

David Jurgens, University of Michigan, jurgens@umich.edu

2019-10-03

1 Summary¶

This markdown presents the python code for our workshop at NWAV 48 on Mapping Word Frequencies on Twitter using R and Python. It was prepared by Jack Grieve; the parallel Python code was prepared by David Jurgens. The point of the workshop is to show how to map word frequency in R and to explain why a linguist might want to do this. We’ll be working with a dataset based on a corpus of American geolocated tweets. For more information on the conference and the workshop see https://nwav48.uoregon.edu/workshops/.

2 Packages¶

This is the full set of python packages used in this markdown. For this workshop we’ll be working mostly with the Geopandas and PySAL packages in Python. These are typically the easiest ways to work with spatial data, plot it, and perform spatial statistics.

# This first line makes sure that our plots will show up 
# in the jupyter notebook if you want to run this code
%matplotlib inline
import geopandas as gpd
import pandas as pd
import pysal
from collections import defaultdict
import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib.pylab import cm
from pysal.explore.esda.getisord import G_Local
from pysal.explore.esda.moran import Moran
from pysal.lib.weights import KNN
import tqdm
import random
import mapclassify

# This lets us do slightly fancier LaTeX-based captions in plots
plt.rc('text', usetex=True)

/Users/jurgens/anaconda3/lib/python3.7/site-packages/pysal/model/spvcm/abstracts.py:10: UserWarning: The `dill` module is required to use the sqlite backend fully.
  from .sqlite import head_to_sql, start_sql

3 US Counties Map Template¶

3.1 US County Map¶

First, we read in the US counties map from a shapefile using geopandas, which is a kind of file that describes the geometry of objects (e.g., their latitute and longitude). You can download this counties shapefile here.

county_map = gpd.read_file('UScounties.shp')

Geopandas is just like regular pandas dataframe (or an R dataframe if you know that) so any regular pandas operation also works here. Let's take a look at what's in this dataframe using head() which shows the first five rows of data. We can see that we have regular columns and a special geometry column that describes the shape; geopandas knows how to interpret this column in order to do spatial operations, which will make our life easier laterl.

county_map.head()

Let's plot the state map.

county_map.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x1a279b34a8>

For visual convencience, we'll filter out Alaska and Hawaii so it's just the continental US (apologies to those of you from these states!).

# Drop Alaska and Hawaii to just be the continental US
county_map = county_map[(county_map['STATE_NAME'] != 'Alaska') 
                          & (county_map['STATE_NAME'] != 'Hawaii')]

county_map.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x1a27a56358>

In addition, we can re-project the map. In this case we’ll use a Lambert equal area projection, which can make our map look more professional. It’s nice to use a different projection system, so we get a less flat map, as we’re used to seeing for the US. The Albers Projection is a standard choice. We do this by defining the projection in the PROJ.4 system (don’t worry about the details...I don’t entirely understand them myself---but thankfully geopandas does!) and then transforming the spatial polygon map. In practice, we can use one of many standard projections or define our own to highight a particular geographic region best.

I've included one below that might be of use. We'll use the same setup as before, but before calling plot(), we first call to_crs() on the dataframe itself which changes the coordinate reference system to the projection (essentially, changing where a particular lat/lon value gets placed on a map).

# This is our fancy projection
projection = "+proj=laea +lat_0=30 +lon_0=-95"
county_map = county_map.to_crs(projection)

county_map.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x1a27d1b630>

We can also get rid of that pesky bounding box by turning the axis off of the plot, which is the last prettification step we'll do for now. To do this, we'll (1) create a figure object with Matplotlib and turn its axis markers off using ax.axis('off') and (2) tell geopandas to plot the map on that figure.

fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_map.plot(ax=ax)

<matplotlib.axes._subplots.AxesSubplot at 0x1a27bf2780>

3.3 Linguistic Data¶

Now, let’s read in the dataset, which can be downloaded here. The dataset contains each county's FIPS which is a five-digit numeric code that uniquely idenfities it. Python can sometimes get confused and read this in as an "int" type, which causes a few subtle bugs later, so we're going to manually specify the column should be treated as a string using dtype={'FIPS': str} when loading the data.

word_freq_df = pd.read_csv('US_CNTY_3076_TWIT_10K_WORD.txt', sep=' ', dtype={'FIPS': str})

This dataset consists of the relative frequency per million words of top 10,000 most common words (defined as a string of case-insensitive alphabetic characters plus hyphens) measured across the 3,076 counties in the contiguous US, based on a multi-billion word corpus of American geolocated Twitter data collected between October 2013 and November 2014 (see Huang et al. 2016, Grieve et al. 2018).

For example, consider the values for the last few variables. The county name in column 1 and the FIPS code for the county is in column 2.

word_freq_df.tail()

We can also get a quick summary of the data too

word_freq_df.describe()

Before moving on, it’s worth quickly explain how this dataset was generated. First, each tweet in the corpus was grouped by the county in which it occurred based on the longitude and latitude in the metadata. To normalise for variation in sample size, which reflects variation in county population, the relative frequency of each word in each county was computed by dividing its frequency by the total number of word tokens in that county and then multiplying by 1 million. It is crucial to normalise otherwise you just get population maps.

3.4 Joining the Data¶

Now we can join linguistic dataset to the sf map object based on the two sets of county IDs, which in the US are known as FIPS codes. For those of you who know databases or SQL, pandas (and therefore geopandas) lets us easily do inner joins on the data frames.

county_word_freq_df = pd.merge(county_map, word_freq_df, on='FIPS', how='inner')

After joining the data, we'll reset the coordinate reference system (CRS) to the one we want so everything keeps plotting the way we expect

county_word_freq_df = county_word_freq_df.to_crs(projection)

We can see that the county_word_freq_df data frame now contains 10,000 more columns than the county_map data frame.

county_word_freq_df.shape

(3075, 10007)

As a sanity check, let's take a look at the first few rows to see that it has the relative counts for each word (as separate columns)

county_word_freq_df.head()

4. Mapping Word Frequencies¶

Now we can map the relative frequencies of any of these 10,000 words across the 3,076 counties in the contiguous US.

4.1 Simple Maps¶

Let’s first make a quick map for the relative frequency of ‘i’, the most common word in the corpus.

To make the map, we'll tell geopandas which column to use to color the map, which makes a Chloropleth map, where the color of each shape is based on the value of an associated variable). Here, we just include the argument column='i' to our plot() command on the county_word_freq_df dataframe.

figZ, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_word_freq_df.plot(ax=ax, column='i', legend=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1a281611d0>

Not much there, but that’s because the distribution here is highly skewed. Let's take a look at the frequency distribution. Since each geopandas dataframe is also a Pandas dataframe, we can use its built-in plotting features to show histograms (among oether things). Here, let's try showing a histogram of 40 bins for "i" frequency by county

county_word_freq_df.i.hist(bins=40)

<matplotlib.axes._subplots.AxesSubplot at 0x1a28724940>

There's clearly an outliner here:

county_word_freq_df['i'].describe()

count      3075.000000
mean      44810.952039
std        6332.354688
min        3573.250000
25%       42300.065000
50%       45716.950000
75%       48359.530000
max      192438.870000
Name: i, dtype: float64

And colour breaks are assigned here based on assigning equally spaced intervals distinct colours between the min and max value, so most value are bunched together. Geopandas includes a scheme argument for plot() that tells geopandas how to decide which values get mapped to which colors. We can try to simulate the default behavior more explicitly by adding the scheme='equal_interval' argument to plot() which will use equal sized bins with discrete colors.

fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_word_freq_df.plot(ax=ax, column='i', legend=True,
                         scheme='equal_interval')

<matplotlib.axes._subplots.AxesSubplot at 0x1a2b210a58>

The basic alternative (there are other more complex approaches) is to define these colour breaks across unequal intervals so that each colour band has roughly an equal numbers of locations by setting scheme='quantile'

fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_word_freq_df.plot(ax=ax, column='i', legend=True, scheme='quantiles')

<matplotlib.axes._subplots.AxesSubplot at 0x1a2b117518>

This approach clearly shows patterns better at least in this case because of the skewed distribution, but do note the differences in the legends.

Here are a couple more maps for common words in both styles, plus using Fisher-Jenks style breaks, which is a more complex approach that falls somewhere in between these two extremes.

fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('and', fontsize=28)
county_word_freq_df.plot(ax=ax, column='and', scheme='equal_interval', legend=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1a2afea908>

fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('and', fontsize=28)
county_word_freq_df.plot(ax=ax, column='and', legend=True,
                                            scheme='quantiles')

<matplotlib.axes._subplots.AxesSubplot at 0x1a2b1e1908>

fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('and', fontsize=28)
county_word_freq_df.plot(ax=ax, column='and', legend=True,
                                            scheme='fisher_jenks')

<matplotlib.axes._subplots.AxesSubplot at 0x1a2e8f9ac8>

fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('but', fontsize=28)
county_word_freq_df.plot(ax=ax, column='but', scheme='equal_interval', legend=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1a2a6c2eb8>

fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('but', fontsize=28)
county_word_freq_df.plot(ax=ax, column='but', legend=True,
                                            scheme='quantiles')

<matplotlib.axes._subplots.AxesSubplot at 0x1a2a63ab38>

fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('but', fontsize=28)
county_word_freq_df.plot(ax=ax, column='but', legend=True,
                                            scheme='fisher_jenks')

<matplotlib.axes._subplots.AxesSubplot at 0x1a2a790dd8>

The clear patterns in the usage of these function words, and for example the complementary patterns for these two coordinators are suprising. We will argue in the second half of this workshop that this is important information for our understanding of language variation and change that cannot be accessed through the analysis of alternations, as is standard in variationist sociolinguistics, but first we’ll look at map formating and how to do basic spatial analysis.

Note, however, that plotting this way using quantile breaks doesn’t always work, in particular when we try to map lower frequency words like ‘battlefield’, with lots of 0 values.

fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('battlefield', fontsize=28)
county_word_freq_df.plot(ax=ax, column='battlefield', scheme='equal_interval',
                         legend=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1a2aa332b0>

fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('battlefield', fontsize=28)
county_word_freq_df.plot(ax=ax, column='battlefield', scheme='quantiles',
                         legend=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1a29a64be0>

fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('battlefield', fontsize=28)
county_word_freq_df.plot(ax=ax, column='battlefield', legend=True,
                                            scheme='fisher_jenks')

<matplotlib.axes._subplots.AxesSubplot at 0x1a29b52860>

This is all a bit problematic, since we really want to map all words consistently and informatively. The simplest solution in terms of code is probably to use jenks (the complication there is explaining how it works), but we can also use quantile style breaks for all cases if we set the class intervals ourselves, rather than let the geopandas library plotting function do it for us, so let’s look at how to do that next.

4.2 Color Palettes¶

First, we need to take control of the colour palette. For example, we can specify a simple white through to purple palette.

# Note: typically the color palettes in Matplotlib are called color maps (cmap for short)
cmap = plt.get_cmap('Purples')

And then re-plot the map with that palette, by specifying cmap=cmap

fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('i', fontsize=28)
county_word_freq_df.plot(ax=ax, column='i', scheme='fisher_jenks', legend=True, cmap=cmap)

<matplotlib.axes._subplots.AxesSubplot at 0x1a2aa70588>

We can also use a blue to red palette that runs through white.

cmap = mpl.colors.LinearSegmentedColormap.from_list(
        'Custom cmap', ["steelblue", "white", "tomato"])

fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('i', fontsize=28)
county_word_freq_df.plot(ax=ax, column='i', scheme='fisher_jenks', legend=True, cmap=cmap)

<matplotlib.axes._subplots.AxesSubplot at 0x1a2a51b550>

Or we can do something more complex – for better or worse.

# Create the color scale; note we need an even number of colors to avoid
# having white as a middle color that's not exactly "middle" in value
cmap = mpl.colors.LinearSegmentedColormap.from_list(
        'Custom cmap', [(.05, .90, .65),
                        (.85, .45, .05),
                        (.10, .20, .70),
                        (.65, .05, .75)])

fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('i', fontsize=28)
county_word_freq_df.plot(ax=ax, column='i', scheme='quantiles', legend=True, cmap=cmap)

<matplotlib.axes._subplots.AxesSubplot at 0x1a2a31fd68>

4.3 Class Intervals¶

Now we can specify how the the observed values of a variable and divided into class intervals are and then link those intervals to the bands of our colour palette.

To do this, we first specify how many intervals and the style of breaks we want and then find those class intervals, in this case 5 quantile breaks for the word ‘i’. Under the hood, geopandas uses the mapclassify library to divide up the data according to different strategies. We can directly invoke each strategy to see where it sets its cut-off points

mapclassify.Quantiles(county_word_freq_df.i, k=5)

                    Quantiles                    
 
  Lower                Upper                Count
=================================================
             x[i] <=  41366.236               615
 41366.236 < x[i] <=  44469.782               615
 44469.782 < x[i] <=  46804.648               615
 46804.648 < x[i] <=  48904.748               615
 48904.748 < x[i] <= 192438.870               615

Second, we extract as many colours as we need from a colour palette. We’ll use the purple one that we already defined.

cmap = mpl.colors.LinearSegmentedColormap.from_list(
    'Custom cmap', ["#FFFFFF", "#D6D1E2", "#AEA3C4", "#8575A8", "#5D478B"], 5)

# This defines how we want separate our regions by color
bounds = list(mapclassify.Quantiles(county_word_freq_df.i, k=5).bins)

# This controls how the colors are scaled using the above bounds
norm = mpl.colors.BoundaryNorm(bounds, cmap.N, clip=True)

Now we can plot using our custom color bounds. Note that since we defined our own color boundaries, we don't have a color bar and legend. We'll fix this later in 4.5

fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_word_freq_df.plot(cmap=cmap, scheme='quantiles', 
                         ax=ax, column='i', linewidth=0.3, edgecolor='0.8')

<matplotlib.axes._subplots.AxesSubplot at 0x1a2d61a3c8>

4.4 Map Formatting¶

Let’s now look at various ways of tweaking the appearance of our maps, using the word ‘i’ as an example.

First, we set up the class interval and the colour palette. If we switch words we must redo this step, but for now we’ll just work with this map. Note that we’re still using the purple colour list collist defined earlier.

cmap = mpl.colors.LinearSegmentedColormap.from_list(
    'Custom cmap', ["#FFFFFF", "#D6D1E2", "#AEA3C4", "#8575A8", "#5D478B"], 5)

Next, we can change the width of the county borders by setting linewidth=0.3, which is especially useful because the counties are so small in the East. We'll also manually specify the color of the edges using edgecolor='black'

fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_word_freq_df.plot(cmap=cmap, scheme='quantiles', 
                         ax=ax, column='i', linewidth=0.3, edgecolor='black')

<matplotlib.axes._subplots.AxesSubplot at 0x1a2b21ca90>

We can also change the colour of the county borders (reusing the colour palette we just defined) which by default aren't drawn. In the west where there are many counties without any mentions of the word, light lines can help make sense of the space. We use the edgecolor='0.8' argument to draw them as light gray. In matplotlib you can specify colors in many ways and we've used three examples in this tutorial (names, hexcodes, and fractions fo black) to give you a few examples.

fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_word_freq_df.plot(cmap='Purples', scheme='quantiles', 
                         ax=ax, column='i', linewidth=0.3, edgecolor='0.8')

<matplotlib.axes._subplots.AxesSubplot at 0x1a2b2c8ac8>

We might also want so superimpose state borders here in a different colour and/or width. To do this we first have to generate a state-level geopandas dataframe from a shapefile. You can download the states shapefile here or get it from the included data.

We can plot both states and counties together by creating a single "figure" object using python's main plotting library, matplotlib and telling geopandas to plot the two shapefiles on it. Here, we'll use different colors and line widths for the edges to highlight the different boundaries. There are lots of options to the plot command that you can use to fine tune your figures.

Let's plot the states map file just as a sanity check.

us_states = gpd.read_file('states.shp')
us_states = us_states[(us_states['STATE_NAME'] != 'Alaska') 
            & (us_states['STATE_NAME'] != 'Hawaii')]

projection = "+proj=laea +lat_0=30 +lon_0=-95"
us_states = us_states.to_crs(projection)
us_states.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x1a2bbe0b00>

Now let's plot the states and counties together. We don't want the states' interiors to be drawn over the counties, so we'll use facecolor="none" to indicate that the states shouldn't be filled.

fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_word_freq_df.plot(cmap='Purples', scheme='quantiles', 
                         ax=ax, column='i', linewidth=0.3, edgecolor='0.8')

us_states.plot(facecolor="none", ax=ax, linewidth=0.4, edgecolor='0')

<matplotlib.axes._subplots.AxesSubplot at 0x1a2a3a3710>

4.5 Legend¶

Additionally, let’s look at modifying the legend and title, just by drawing it in directly, using the class intervals and colours we defined, and the coordinates from the (projected) map.

fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')

cmap = mpl.colors.LinearSegmentedColormap.from_list(
    'Custom cmap', ["#FFFFFF", "#D6D1E2", "#AEA3C4", "#8575A8",
                    "#5D478B"], 5)

county_word_freq_df.plot(cmap=cmap, scheme='quantiles', 
                         ax=ax, column='i', linewidth=0.3, edgecolor='0.8')

us_states.plot(facecolor="none", ax=ax, linewidth=0.4, edgecolor='0')

# The tick marks we want for our map.  We'll pull these directly from
# the mapclassify code.
bounds = list(mapclassify.Quantiles(county_word_freq_df.i, k=5).bins)

# This controls how the colors are scaled using the above bounds
norm = mpl.colors.BoundaryNorm(bounds, cmap.N, clip=True)

# This sets where the colorbar goes and its size
cbax = fig.add_axes([0.15, 0.25, 0.25, 0.02])

cb2 = mpl.colorbar.ColorbarBase(cbax, cmap=cmap,
                                norm=norm, 
                                ticks=bounds,
                                orientation='horizontal')

cb2.set_label(r'\Huge{i}\\\large{per million words}')
cb2.ax.xaxis.set_label_position('top')
cb2.ax.xaxis.label.set_multialignment('center')
plt.setp(plt.getp(cb2.ax.axes, 'xticklabels'), fontsize='small')

[None, None, None, None, None, None, None, None, None, None]

4.6 Mapping Function¶

Finally, since this is all a lot of code, and since we want to be able to map all word from the dataset easily and consistently, we’ll put all this code into a single mapping function, so we just have to specify the word and the map title.

def plot_freq_map(word, title):
    '''plots the US map of the word using the data in the geopandas dataframe
       the geo_df object is expected to have a column labeled "z-score" for the
       target word
    '''
 
    fig, ax = plt.subplots(1, figsize=(11,8.5))
    ax.axis('off')


    cmap = mpl.colors.LinearSegmentedColormap.from_list(
        'Custom cmap', ["#FFFFFF", "#D6D1E2", "#AEA3C4", "#8575A8",
                        "#5D478B"], 5)

    county_word_freq_df.plot(cmap=cmap, scheme='quantiles', 
                             ax=ax, column=word, linewidth=0.3, edgecolor='0.8')

    us_states.plot(facecolor="none", ax=ax, linewidth=0.4, edgecolor='0')

    # The tick marks we want for our map
    bounds = list(mapclassify.Quantiles(county_word_freq_df[word], k=5).bins)

    # This controls how the colors are scaled using the above bounds
    norm = mpl.colors.BoundaryNorm(bounds, cmap.N, clip=True)

    # This sets where the colorbar goes and its size
    cbax = fig.add_axes([0.15, 0.25, 0.25, 0.02])

    cb2 = mpl.colorbar.ColorbarBase(cbax, cmap=cmap,
                                    norm=norm, 
                                    ticks=bounds,
                                    orientation='horizontal')

    cb2.set_label(r'\Huge{%s}\\\large{%s}' % (word, title))
    cb2.ax.xaxis.set_label_position('top')
    cb2.ax.xaxis.label.set_multialignment('center')
    plt.setp(plt.getp(cb2.ax.axes, 'xticklabels'), fontsize='small')    
 
plot_freq_map('i', 'per million words')

plot_freq_map('the', 'per million words')

plot_freq_map('and', 'per million words')

plot_freq_map('but', 'per million words')

plot_freq_map('oregon', 'per million words')

plot_freq_map('fuck', 'per million words')

plot_freq_map('sneakers', 'per million words')

/Users/jurgens/anaconda3/lib/python3.7/site-packages/mapclassify/classifiers.py:95: UserWarning: Warning: Not enough unique values in array to form k classes
  UserWarning)
/Users/jurgens/anaconda3/lib/python3.7/site-packages/mapclassify/classifiers.py:96: UserWarning: Warning: setting k to 4
  Warn('Warning: setting k to %d' % k_q, UserWarning)
/Users/jurgens/anaconda3/lib/python3.7/site-packages/mapclassify/classifiers.py:95: UserWarning: Warning: Not enough unique values in array to form k classes
  UserWarning)
/Users/jurgens/anaconda3/lib/python3.7/site-packages/mapclassify/classifiers.py:96: UserWarning: Warning: setting k to 4
  Warn('Warning: setting k to %d' % k_q, UserWarning)

plot_freq_map('battlefield', 'per million words')

5 Spatial Autocorrelation Analysis¶

Now let’s look at how to conduct spatial autocorrelation analyses. The basic purpose of these methods is to identify regional patterns in the map for a quantitative variable, like any of our swear word maps. There are two basic types of spatial autocorrelation analyses (global and local), which we’ll look at in turn, but first we need to learn how to define a spatial weights matrix, which is the foundation for both global and local measures, as well as various other methods for spatial analysis.

5.1 Spatial Weights Matrix¶

A spatial weights matrix (SWM) is a big location-by-location matrix that specifies the spatial relationship between every pair of locations in a regional dataset. Basically, nearby locations are given a stronger weight than distant locations (often 1 for nearby locations and 0 for distant locations). These values are then used to weight the comparisons between locations in spatial autocorrelation analyses.

5.1.1 Finding Nearest Neighbours¶

There are various ways to build an SWM. We’ll use what’s known as a “k-nearest neighbour” SWM, which is a very common and simple approach, an essentially the default on Python. The k part refers to how many nearest neighbors you want to include.

As a sanity check, let's look at the centroids of the counties

# We'll create a new Geopandas dataframe where each county gets mapped to its centroid
county_centroids_df = county_word_freq_df[['geometry']].copy()
county_centroids_df.geometry = county_word_freq_df.centroid

fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_centroids_df.plot(ax=ax, marker='*', color='red', markersize=1)
county_word_freq_df.plot(ax=ax, facecolor="none", linewidth=0.4, edgecolor='0')

<matplotlib.axes._subplots.AxesSubplot at 0x1a2d51da58>

5.1.2 Building the Spatial Weights Matrix¶

Geopandas doesn't know how to build the nearest neighbors data on its own, but our second library PySAL does! Even better, since a geopandas has spatial geometry to describe where the counties are, we can use our dataframe as is with PySAL to compute the nearest neighbors.

Here, we'll use the KNN class (k-Nearest Neighbors) and choose 25 neighbors usign the k=25 argument. We'll explore what happens if we choose other values of k later too.

Now we can identify the nearest neighbours for each location (let’s use 25 for now, including the location itself) and then give each of these pairings a weight of 1 (or 1/n), and all other pairings a weight of 0. We don’t have to assign a non-zero weight to the pairing of each location and itself, but this leads to more intuitive results later on. Note that this SWM is non-symetric: even if location A is one of the nearest neighbours of location B, location B is not necessarily one of the nearest neigbours of location A.

listws25 = KNN.from_dataframe(county_word_freq_df, k=25)

5.1.3 Varying Spatial Weights¶

The issues we skipped is how to select the number of nearest neighbours. There is no simple rule of thumb, which is unfortunate to some extent, since it places greater burden on us to make a principled choice, but this also gives us more flexibility, kind of like adjusting the resolution on a microscope. Plus if the values are set at a reasonable level (say with the number of nearest neighbours around 10% of the total locations or less, see Grieve 2017), it shouldn’t make too much difference. Anyway, this is all pretty abstract, so let’s generate a couple more spatial weights matrices (for 5 and 50 nearest neighbours) and then we’ll see how that changes our analyses later on.

listws05 = KNN.from_dataframe(county_word_freq_df, k=5)
listws50 = KNN.from_dataframe(county_word_freq_df, k=50)

5.2 Global Spatial Autocorrelation¶

Now that we have our SWMs, we can conduct global spatial autocorrelation analyses, spefically global Moran’s I. Our goal is to test if the values of a variable show an overall regional pattern, as opposed to being randomly distributed across space. Basically, Moran’s I compares the value of the variable around each location to see if nearby locations (as defined by the SWM) tend to have similar (clustering), different (dispersal), or random values. Moran’s I generally ranges from -1 to +1, with clustering being associated with positive values, dispersion with negative values, and randomness with values around 0. In dialectology, we don’t really find much dispersion (e.g. a checkboard pattern), so in general we’re just interested in measuring the strength of clustering.

mi = Moran(county_word_freq_df.i, listws25)
print("Moran's I:", mi.I) 
print("Moran's p-value (under normality assumption): %0.19f" % mi.p_norm)

Moran's I: 0.2253822031683297
Moran's p-value (under normality assumption): 0.0000000000000000000

mi = Moran(county_word_freq_df.the, listws25)
print("Moran's I:", mi.I) 
print("Moran's p-value (under normality assumption): %0.19f" % mi.p_norm)

Moran's I: 0.26306523916603974
Moran's p-value (under normality assumption): 0.0000000000000000000

# NOTE: 'and' is a python keyword so we have to use the other way of looking up values
mi = Moran(county_word_freq_df['and'], listws25)
print("Moran's I:", mi.I) 
print("Moran's p-value (under normality assumption): %0.19f" % mi.p_norm)

Moran's I: 0.30729811359522646
Moran's p-value (under normality assumption): 0.0000000000000000000

mi = Moran(county_word_freq_df.oregon, listws25)
print("Moran's I:", mi.I) 
print("Moran's p-value (under normality assumption): %0.19f" % mi.p_norm)

Moran's I: 0.32614344285868935
Moran's p-value (under normality assumption): 0.0000000000000000000

mi = Moran(county_word_freq_df.now, listws25)
print("Moran's I:", mi.I) 
print("Moran's p-value (under normality assumption): %0.19f" % mi.p_norm)

Moran's I: 0.06443194223076998
Moran's p-value (under normality assumption): 0.0000000000000000000

mi = Moran(county_word_freq_df.sneakers, listws25)
print("Moran's I:", mi.I) 
print("Moran's p-value (under normality assumption): %0.19f" % mi.p_norm)

Moran's I: 0.29900107494088407
Moran's p-value (under normality assumption): 0.0000000000000000000

mi = Moran(county_word_freq_df.battlefield, listws25)
print("Moran's I:", mi.I) 
print("Moran's p-value (under normality assumption): %0.19f" % mi.p_norm)

Moran's I: 0.010774342612320652
Moran's p-value (under normality assumption): 0.0247682802860265117

Notably, given the p-values, we can see that they all show non-random regional patterns (as can be plainly seen by inspecting the maps), which is generally true of word maps in this dataset at least. We can also use these Moran’s I scores to rank maps in terms of the strength of the regional patterns they show.

5.2.1 Bonus Feature: Ranking words by Moran's I¶

To calculate Moran’s I for all 10,000 word maps, we loop through the base dataset word by word and then store the results in a new data frame, including the word, the Moran’s I value, and the associated p value, which we calculate under the assumption of randomness (i.e. we do not assume normality), using a one-tail greater than test, since we’re not looking for dispersion. We’re also going to run three different tests, one for each SWM, so we can look at how that changes our results.

# NOTE: we'll convert this to a pandas dataframe after
moran_swear = defaultdict(list)

# Let's pick 100 random words
words = list(county_word_freq_df.columns[7:])
random.shuffle(words)
words = words[:100]

for word in tqdm.tqdm_notebook(words):
    moran_swear['NAME'].append(word)
    
    mi = Moran(county_word_freq_df[word], listws25)
    morans_I_25 = round(mi.I, 3)
    z_25 =  mi.p_norm
    moran_swear['I_25nn'].append(morans_I_25)
    moran_swear['p_25nn'].append(z_25)

    
moran_swear = pd.DataFrame(moran_swear) 
moran_swear.sort_values('I_25nn').head()

moran_swear.sort_values('I_25nn').tail()

From this you can see that all the maps show significant spatial clustering. We can also see that varying the SWM doesn’t make that much difference here: we get highly significant results across the board (even if we correct alpha). For the most part, Moran’s I and the associated p-value are strongest here for SWMs based on 5 nearest neighbours, but it is not generally the case that Moran’s I will be maximised by looking at the fewest number of neighbours (i.e. n = 2), or that Moran’s I will be maximised for all the variables in a regional linguistic dataset for SWMs based on the same number of neighbours, so this result shouldn’t be overinterpretted.

5.3 Local Spatial Autocorrelation Analysis¶

Next, we can run a local spatial autocorrelation analysis. Whereas global Moran’s I provides a single value reflecting the overall amount of spatial clustering in a map, local Getis-Ord Gi* provides a value for each location indicating the degree to which it is part of a high or low value cluster, assigning a positive z-score to locations that are part of high value clusters and a negative z-score to locations that are part of a low value clusters.

Crucially, these values can then be mapped across the locations to identify areas of high and low values (hot spot maps), essentially generating smoothed maps that isolate the underlying regional signal in each variable, while ignoring non-regional variability. Local spatial autocorrelation analysis is therefore a very useful in dialectology as it basically automates the process of mapping isoglosses.

To conduct the local spatial autocorrelation analysis, we generate Getis-Ord Gi* z-scores for each location for each variable, using 25 nearest neighbour SWM.

getis_ord_g = pysal.explore.esda.G_Local(county_word_freq_df['i'], listws25, transform='B')

getis_ord_g.Zs

array([-2.18410436, -3.2377711 , -4.4314233 , ...,  3.34761178,
        0.83175678, -5.58241857])

Let's make a function that will plot a word's Getis-Ord G* for all the counties

def plot_spatial_corr_map(word, neighbors=listws25):
    '''plots the US map of the word using the data in the geopandas dataframe
       the geo_df object is expected to have a column labeled "z-score" for the
       target word
    '''
    
    fig, ax = plt.subplots(1, figsize=(11,8.5))
    
    gol = pysal.explore.esda.G_Local(county_word_freq_df[word], neighbors, transform='B')
    county_word_freq_df['z-score'] = gol.Zs

    # Create the color scale; note we need an even number of colors to avoid
    # having white as a middle color that's not exactly "middle" in value
    cmap = mpl.colors.LinearSegmentedColormap.from_list(
        'Custom cmap', ["steelblue", "white", "tomato"], 10)

    county_word_freq_df.plot(cmap=cmap, ax=ax, column='z-score', 
                                      linewidth=0.8, edgecolor='0.8')
    tmp = us_states.plot(facecolor="none", ax=ax, 
                         linewidth=0.4, edgecolor='0')


    # The tick marks we want for our map
    bounds = [min(county_word_freq_df['z-score']), -2.33, -1.96, -1.28, -.66, 0, 0.66,
              1.28, 1.96, 2.33, max(county_word_freq_df['z-score'])]

    # This controls how the colors are scaled using the above bounds
    norm = mpl.colors.BoundaryNorm(bounds, cmap.N, clip=True)

    # This sets where the colorbar goes and its size
    cbax = fig.add_axes([0.15, 0.25, 0.25, 0.02])

    cb2 = mpl.colorbar.ColorbarBase(cbax, cmap=cmap,
                                    norm=norm, 
                                    ticks=bounds,
                                    orientation='horizontal')


    cb2.set_label(r'\Huge{%s}\\\large{Getis-Ord $G_i^*$ z-score}' % word)
    cb2.ax.xaxis.set_label_position('top')
    #for t in leg.texts:
    cb2.ax.xaxis.label.set_multialignment('center')
    plt.setp(plt.getp(cb2.ax.axes, 'xticklabels'), fontsize='small')


    # NOTE: this is kind of a hack so that we only show specific labels 
    # in the color bar legend, which otherwise becomes crowded
    #
    # NOTE2: Matplotlib doesn't use the '-' character found on your 
    # keyboard when printing negative numbers; it instead uses a 
    # specific negation character, which is pasted into the examples below
    # .replace('-', '−')
    labels_we_want_to_show = [('%0.2f' % min(bounds)), 
                                  '%0.2f' % max(bounds), 
                                  '0.00', '1.96', '-1.96']
    # Using LaTeX for captions means matplotlib adds the latex math-mode markers
    labels_we_want_to_show = ['$'+x+'$' for x in labels_we_want_to_show]

    for label in cb2.ax.xaxis.get_ticklabels()[::]:
        if label.get_text() not in labels_we_want_to_show:
            #print(label.get_text())
            label.set_visible(False)
            pass
    # Turn off matplotlib's bounding box
    tmp = ax.axis('off')
    
plot_spatial_corr_map('i')

plot_spatial_corr_map('the')

plot_spatial_corr_map('and')

plot_spatial_corr_map('but')

plot_spatial_corr_map('oregon')

plot_spatial_corr_map('snow')

plot_spatial_corr_map('fuck')

plot_spatial_corr_map('sneakers')

plot_spatial_corr_map('battlefield')

plot_spatial_corr_map('and', neighbors=listws05)

plot_spatial_corr_map('and', neighbors=listws25)

plot_spatial_corr_map('and', neighbors=listws50)

plot_spatial_corr_map('oregon', neighbors=listws05)

plot_spatial_corr_map('oregon', neighbors=listws25)

plot_spatial_corr_map('oregon', neighbors=listws50)

plot_spatial_corr_map('gtf', neighbors=listws25)

plot_spatial_corr_map('hurricane', neighbors=listws25)

Conclusion¶

That’s the end of the tutorial. Please let us know if you have any questions or feedback. In addition to this python code and the dataset, please also see the OSF repository (https://osf.io/e2f63/) for the matching R code by Jack Grieve and our slides for the rest of this workshop, where we argue for the value of this type of research in dialectology and sociolinguistics, as well as additional code. And if you’re interested in learning more about these types of methods and implementing them in R, please see Jack's upcoming book with Martijn Wieling ‘Regional Dialectology: Quantitative Approaches in R’.

	NAME	STATE_NAME	STATE_FIPS	CNTY_FIPS	FIPS	geometry
0	Lake of the Woods	Minnesota	27	077	27077	POLYGON ((-95.34283 48.54668, -95.34105 48.715...
1	Ferry	Washington	53	019	53019	POLYGON ((-118.85163 47.94956, -118.84846 48.4...
2	Stevens	Washington	53	065	53065	POLYGON ((-117.43883 48.04412, -117.54219 48.0...
3	Okanogan	Washington	53	047	53047	POLYGON ((-118.97209 47.93915, -118.97406 47.9...
4	Pend Oreille	Washington	53	051	53051	POLYGON ((-117.43858 48.99992, -117.03205 48.9...

	county_ID	FIPS	a	a.	aa	aaliyah	aaron	ab	abandoned	abbey	...	zip	zoe	zoey	zombie	zombies	zone	zoned	zoo	zoom	zumba
3071	wyoming,sweetwater	56037	17786.28	0.0	6.44	3.22	12.88	6.44	3.22	0.00	...	6.44	6.44	0.0	28.97	12.88	35.41	6.44	12.88	16.09	12.88
3072	wyoming,teton	56039	18783.86	5.2	10.40	0.00	20.80	2.60	15.60	23.39	...	7.80	10.40	7.8	18.20	5.20	54.59	0.00	7.80	13.00	10.40
3073	wyoming,uinta	56041	17720.52	0.0	0.00	0.00	14.66	14.66	0.00	58.63	...	0.00	0.00	0.0	29.31	29.31	29.31	0.00	0.00	0.00	29.31
3074	wyoming,washakie	56043	19774.68	0.0	14.75	0.00	14.75	14.75	0.00	0.00	...	0.00	0.00	0.0	29.49	0.00	29.49	0.00	0.00	44.24	0.00
3075	wyoming,weston	56045	20534.10	0.0	0.00	0.00	0.00	27.31	0.00	0.00	...	0.00	0.00	0.0	0.00	0.00	54.61	0.00	0.00	0.00	0.00

	a	a.	aa	aaliyah	aaron	ab	abandoned	abbey	abby	abc	...	zip	zoe	zoey	zombie	zombies	zone	zoned	zoo	zoom	zumba
count	3076.000000	3076.000000	3076.000000	3076.000000	3076.000000	3076.000000	3076.000000	3076.000000	3076.000000	3076.000000	...	3076.000000	3076.000000	3076.000000	3076.000000	3076.000000	3076.000000	3076.000000	3076.000000	3076.000000	3076.000000
mean	17686.561661	4.036619	8.119798	4.479535	31.770312	18.642633	5.292051	3.490634	14.144178	10.266057	...	10.606531	4.899717	4.677120	15.390471	8.407350	39.698365	3.514844	15.925293	4.380306	91.983557
std	1806.721774	50.617120	24.687379	10.843293	54.384890	43.162770	10.050835	8.150068	23.625402	19.273659	...	33.785532	14.547293	9.880733	25.511622	13.369255	35.477047	5.215757	23.753035	10.352628	4898.997368
min	2196.160000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	16837.517500	0.000000	0.000000	0.000000	13.637500	3.700000	0.000000	0.000000	2.537500	3.760000	...	1.407500	0.000000	0.000000	7.597500	0.000000	26.610000	0.000000	5.270000	0.000000	0.000000
50%	17721.195000	1.460000	3.460000	1.130000	24.195000	9.020000	3.800000	1.225000	9.460000	8.410000	...	5.355000	2.405000	2.710000	13.725000	6.875000	36.950000	2.755000	11.350000	3.100000	1.620000
75%	18596.060000	3.512500	6.772500	5.630000	37.350000	16.712500	6.060000	4.275000	17.610000	13.060000	...	8.672500	5.722500	5.172500	19.170000	10.555000	46.462500	4.762500	19.862500	5.190000	4.782500
max	55189.400000	2770.080000	446.890000	241.440000	1936.620000	875.660000	297.420000	169.070000	547.950000	898.110000	...	683.880000	607.560000	220.390000	1203.370000	316.290000	731.140000	110.380000	575.620000	329.380000	271710.000000

	NAME	STATE_NAME	STATE_FIPS	CNTY_FIPS	FIPS	geometry	county_ID	a	a.	aa	...	zip	zombie	zombies	zone	zoned	zoo	zumba
0	Lake of the Woods	Minnesota	27	077	27077	POLYGON ((-25655.892 2049834.670, -25444.188 2...	minnesota,lake of the woods	21027.84	0.00	0.00	...	0.00	33.89	0.00	16.94	33.89	16.94	16.94
1	Ferry	Washington	53	019	53019	POLYGON ((-1775417.832 2196570.086, -1758130.5...	washington,ferry	12828.09	0.00	0.00	...	337.58	0.00	0.00	84.39	0.00	0.00	0.00
2	Stevens	Washington	53	065	53065	POLYGON ((-1670527.465 2182361.609, -1678023.8...	washington,stevens	16853.05	0.00	0.00	...	51.16	38.37	123.64	21.32	0.00	4.26	0.00
3	Okanogan	Washington	53	047	53047	POLYGON ((-1784425.905 2197600.293, -1783913.3...	washington,okanogan	18685.00	17.85	4.46	...	4.46	17.85	4.46	22.31	0.00	13.38	0.00
4	Pend Oreille	Washington	53	051	53051	POLYGON ((-1641330.622 2285156.266, -1612439.6...	washington,pend oreille	15708.26	0.00	21.40	...	64.20	21.40	0.00	21.40	0.00	0.00	0.00

	NAME	I_25nn	p_25nn
63	weapon	-0.010	0.050787
43	mixing	-0.002	0.665449
22	brake	-0.001	0.960017
57	forecast	-0.000	0.974963
39	std	0.001	0.865913

	NAME	I_25nn
8	mo	0.306
68	wtf	0.309
18	fa	0.394
71	hella	0.566
27	iowa	0.620