Python Markdown for Jack Grieve and David Jurgens. 2019. Mapping word frequencies on Twitter using R and Python. Workshop presented at NWAV 48, University of Oregon, 2019-10-10.
David Jurgens, University of Michigan, jurgens@umich.edu
2019-10-03
This markdown presents the python code for our workshop at NWAV 48 on Mapping Word Frequencies on Twitter using R and Python. It was prepared by Jack Grieve; the parallel Python code was prepared by David Jurgens. The point of the workshop is to show how to map word frequency in R and to explain why a linguist might want to do this. We’ll be working with a dataset based on a corpus of American geolocated tweets. For more information on the conference and the workshop see https://nwav48.uoregon.edu/workshops/.
This is the full set of python packages used in this markdown. For this workshop we’ll be working mostly with the Geopandas and PySAL packages in Python. These are typically the easiest ways to work with spatial data, plot it, and perform spatial statistics.
# This first line makes sure that our plots will show up
# in the jupyter notebook if you want to run this code
%matplotlib inline
import geopandas as gpd
import pandas as pd
import pysal
from collections import defaultdict
import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib.pylab import cm
from pysal.explore.esda.getisord import G_Local
from pysal.explore.esda.moran import Moran
from pysal.lib.weights import KNN
import tqdm
import random
import mapclassify
# This lets us do slightly fancier LaTeX-based captions in plots
plt.rc('text', usetex=True)
First, we read in the US counties map from a shapefile using geopandas, which is a kind of file that describes the geometry of objects (e.g., their latitute and longitude). You can download this counties shapefile here.
county_map = gpd.read_file('UScounties.shp')
Geopandas is just like regular pandas dataframe (or an R dataframe if you know that) so any regular pandas operation also works here. Let's take a look at what's in this dataframe using head()
which shows the first five rows of data. We can see that we have regular columns and a special geometry
column that describes the shape; geopandas knows how to interpret this column in order to do spatial operations, which will make our life easier laterl.
county_map.head()
Let's plot the state map.
county_map.plot()
For visual convencience, we'll filter out Alaska and Hawaii so it's just the continental US (apologies to those of you from these states!).
# Drop Alaska and Hawaii to just be the continental US
county_map = county_map[(county_map['STATE_NAME'] != 'Alaska')
& (county_map['STATE_NAME'] != 'Hawaii')]
county_map.plot()
In addition, we can re-project the map. In this case we’ll use a Lambert equal area projection, which can make our map look more professional. It’s nice to use a different projection system, so we get a less flat map, as we’re used to seeing for the US. The Albers Projection is a standard choice. We do this by defining the projection in the PROJ.4 system (don’t worry about the details...I don’t entirely understand them myself---but thankfully geopandas does!) and then transforming the spatial polygon map. In practice, we can use one of many standard projections or define our own to highight a particular geographic region best.
I've included one below that might be of use. We'll use the same setup as before, but before calling plot()
, we first call to_crs()
on the dataframe itself which changes the coordinate reference system to the projection (essentially, changing where a particular lat/lon value gets placed on a map).
# This is our fancy projection
projection = "+proj=laea +lat_0=30 +lon_0=-95"
county_map = county_map.to_crs(projection)
county_map.plot()
We can also get rid of that pesky bounding box by turning the axis off of the plot, which is the last prettification step we'll do for now. To do this, we'll (1) create a figure object with Matplotlib and turn its axis markers off using ax.axis('off')
and (2) tell geopandas to plot the map on that figure.
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_map.plot(ax=ax)
Now, let’s read in the dataset, which can be downloaded here. The dataset contains each county's FIPS which is a five-digit numeric code that uniquely idenfities it. Python can sometimes get confused and read this in as an "int" type, which causes a few subtle bugs later, so we're going to manually specify the column should be treated as a string using dtype={'FIPS': str}
when loading the data.
word_freq_df = pd.read_csv('US_CNTY_3076_TWIT_10K_WORD.txt', sep=' ', dtype={'FIPS': str})
This dataset consists of the relative frequency per million words of top 10,000 most common words (defined as a string of case-insensitive alphabetic characters plus hyphens) measured across the 3,076 counties in the contiguous US, based on a multi-billion word corpus of American geolocated Twitter data collected between October 2013 and November 2014 (see Huang et al. 2016, Grieve et al. 2018).
For example, consider the values for the last few variables. The county name in column 1 and the FIPS code for the county is in column 2.
word_freq_df.tail()
We can also get a quick summary of the data too
word_freq_df.describe()
Before moving on, it’s worth quickly explain how this dataset was generated. First, each tweet in the corpus was grouped by the county in which it occurred based on the longitude and latitude in the metadata. To normalise for variation in sample size, which reflects variation in county population, the relative frequency of each word in each county was computed by dividing its frequency by the total number of word tokens in that county and then multiplying by 1 million. It is crucial to normalise otherwise you just get population maps.
Now we can join linguistic dataset to the sf map object based on the two sets of county IDs, which in the US are known as FIPS codes. For those of you who know databases or SQL, pandas (and therefore geopandas) lets us easily do inner joins on the data frames.
county_word_freq_df = pd.merge(county_map, word_freq_df, on='FIPS', how='inner')
After joining the data, we'll reset the coordinate reference system (CRS) to the one we want so everything keeps plotting the way we expect
county_word_freq_df = county_word_freq_df.to_crs(projection)
We can see that the county_word_freq_df
data frame now contains 10,000 more columns than the county_map
data frame.
county_word_freq_df.shape
As a sanity check, let's take a look at the first few rows to see that it has the relative counts for each word (as separate columns)
county_word_freq_df.head()
Now we can map the relative frequencies of any of these 10,000 words across the 3,076 counties in the contiguous US.
Let’s first make a quick map for the relative frequency of ‘i’, the most common word in the corpus.
To make the map, we'll tell geopandas which column to use to color the map, which makes a Chloropleth map, where the color of each shape is based on the value of an associated variable). Here, we just include the argument column='i'
to our plot()
command on the county_word_freq_df
dataframe.
figZ, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_word_freq_df.plot(ax=ax, column='i', legend=True)
Not much there, but that’s because the distribution here is highly skewed. Let's take a look at the frequency distribution. Since each geopandas dataframe is also a Pandas dataframe, we can use its built-in plotting features to show histograms (among oether things). Here, let's try showing a histogram of 40 bins for "i" frequency by county
county_word_freq_df.i.hist(bins=40)
There's clearly an outliner here:
county_word_freq_df['i'].describe()
And colour breaks are assigned here based on assigning equally spaced intervals distinct colours between the min and max value, so most value are bunched together. Geopandas includes a scheme
argument for plot()
that tells geopandas how to decide which values get mapped to which colors. We can try to simulate the default behavior more explicitly by adding the scheme='equal_interval'
argument to plot()
which will use equal sized bins with discrete colors.
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_word_freq_df.plot(ax=ax, column='i', legend=True,
scheme='equal_interval')
The basic alternative (there are other more complex approaches) is to define these colour breaks across unequal intervals so that each colour band has roughly an equal numbers of locations by setting scheme='quantile'
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_word_freq_df.plot(ax=ax, column='i', legend=True, scheme='quantiles')
This approach clearly shows patterns better at least in this case because of the skewed distribution, but do note the differences in the legends.
Here are a couple more maps for common words in both styles, plus using Fisher-Jenks style breaks, which is a more complex approach that falls somewhere in between these two extremes.
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('and', fontsize=28)
county_word_freq_df.plot(ax=ax, column='and', scheme='equal_interval', legend=True)
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('and', fontsize=28)
county_word_freq_df.plot(ax=ax, column='and', legend=True,
scheme='quantiles')
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('and', fontsize=28)
county_word_freq_df.plot(ax=ax, column='and', legend=True,
scheme='fisher_jenks')
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('but', fontsize=28)
county_word_freq_df.plot(ax=ax, column='but', scheme='equal_interval', legend=True)
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('but', fontsize=28)
county_word_freq_df.plot(ax=ax, column='but', legend=True,
scheme='quantiles')
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('but', fontsize=28)
county_word_freq_df.plot(ax=ax, column='but', legend=True,
scheme='fisher_jenks')
The clear patterns in the usage of these function words, and for example the complementary patterns for these two coordinators are suprising. We will argue in the second half of this workshop that this is important information for our understanding of language variation and change that cannot be accessed through the analysis of alternations, as is standard in variationist sociolinguistics, but first we’ll look at map formating and how to do basic spatial analysis.
Note, however, that plotting this way using quantile breaks doesn’t always work, in particular when we try to map lower frequency words like ‘battlefield’, with lots of 0 values.
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('battlefield', fontsize=28)
county_word_freq_df.plot(ax=ax, column='battlefield', scheme='equal_interval',
legend=True)
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('battlefield', fontsize=28)
county_word_freq_df.plot(ax=ax, column='battlefield', scheme='quantiles',
legend=True)
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('battlefield', fontsize=28)
county_word_freq_df.plot(ax=ax, column='battlefield', legend=True,
scheme='fisher_jenks')
This is all a bit problematic, since we really want to map all words consistently and informatively. The simplest solution in terms of code is probably to use jenks (the complication there is explaining how it works), but we can also use quantile style breaks for all cases if we set the class intervals ourselves, rather than let the geopandas library plotting function do it for us, so let’s look at how to do that next.
First, we need to take control of the colour palette. For example, we can specify a simple white through to purple palette.
# Note: typically the color palettes in Matplotlib are called color maps (cmap for short)
cmap = plt.get_cmap('Purples')
And then re-plot the map with that palette, by specifying cmap=cmap
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('i', fontsize=28)
county_word_freq_df.plot(ax=ax, column='i', scheme='fisher_jenks', legend=True, cmap=cmap)
We can also use a blue to red palette that runs through white.
cmap = mpl.colors.LinearSegmentedColormap.from_list(
'Custom cmap', ["steelblue", "white", "tomato"])
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('i', fontsize=28)
county_word_freq_df.plot(ax=ax, column='i', scheme='fisher_jenks', legend=True, cmap=cmap)
Or we can do something more complex – for better or worse.
# Create the color scale; note we need an even number of colors to avoid
# having white as a middle color that's not exactly "middle" in value
cmap = mpl.colors.LinearSegmentedColormap.from_list(
'Custom cmap', [(.05, .90, .65),
(.85, .45, .05),
(.10, .20, .70),
(.65, .05, .75)])
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
ax.set_title('i', fontsize=28)
county_word_freq_df.plot(ax=ax, column='i', scheme='quantiles', legend=True, cmap=cmap)
Now we can specify how the the observed values of a variable and divided into class intervals are and then link those intervals to the bands of our colour palette.
To do this, we first specify how many intervals and the style of breaks we want and then find those class intervals, in this case 5 quantile breaks for the word ‘i’. Under the hood, geopandas uses the mapclassify library to divide up the data according to different strategies. We can directly invoke each strategy to see where it sets its cut-off points
mapclassify.Quantiles(county_word_freq_df.i, k=5)
Second, we extract as many colours as we need from a colour palette. We’ll use the purple one that we already defined.
cmap = mpl.colors.LinearSegmentedColormap.from_list(
'Custom cmap', ["#FFFFFF", "#D6D1E2", "#AEA3C4", "#8575A8", "#5D478B"], 5)
# This defines how we want separate our regions by color
bounds = list(mapclassify.Quantiles(county_word_freq_df.i, k=5).bins)
# This controls how the colors are scaled using the above bounds
norm = mpl.colors.BoundaryNorm(bounds, cmap.N, clip=True)
Now we can plot using our custom color bounds. Note that since we defined our own color boundaries, we don't have a color bar and legend. We'll fix this later in 4.5
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_word_freq_df.plot(cmap=cmap, scheme='quantiles',
ax=ax, column='i', linewidth=0.3, edgecolor='0.8')
Let’s now look at various ways of tweaking the appearance of our maps, using the word ‘i’ as an example.
First, we set up the class interval and the colour palette. If we switch words we must redo this step, but for now we’ll just work with this map. Note that we’re still using the purple colour list collist defined earlier.
cmap = mpl.colors.LinearSegmentedColormap.from_list(
'Custom cmap', ["#FFFFFF", "#D6D1E2", "#AEA3C4", "#8575A8", "#5D478B"], 5)
Next, we can change the width of the county borders by setting linewidth=0.3
, which is especially useful because the counties are so small in the East. We'll also manually specify the color of the edges using edgecolor='black'
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_word_freq_df.plot(cmap=cmap, scheme='quantiles',
ax=ax, column='i', linewidth=0.3, edgecolor='black')
We can also change the colour of the county borders (reusing the colour palette we just defined) which by default aren't drawn. In the west where there are many counties without any mentions of the word, light lines can help make sense of the space. We use the edgecolor='0.8'
argument to draw them as light gray. In matplotlib you can specify colors in many ways and we've used three examples in this tutorial (names, hexcodes, and fractions fo black) to give you a few examples.
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_word_freq_df.plot(cmap='Purples', scheme='quantiles',
ax=ax, column='i', linewidth=0.3, edgecolor='0.8')
We might also want so superimpose state borders here in a different colour and/or width. To do this we first have to generate a state-level geopandas dataframe from a shapefile. You can download the states shapefile here or get it from the included data.
We can plot both states and counties together by creating a single "figure" object using python's main plotting library, matplotlib and telling geopandas to plot the two shapefiles on it. Here, we'll use different colors and line widths for the edges to highlight the different boundaries. There are lots of options to the plot command that you can use to fine tune your figures.
Let's plot the states map file just as a sanity check.
us_states = gpd.read_file('states.shp')
us_states = us_states[(us_states['STATE_NAME'] != 'Alaska')
& (us_states['STATE_NAME'] != 'Hawaii')]
projection = "+proj=laea +lat_0=30 +lon_0=-95"
us_states = us_states.to_crs(projection)
us_states.plot()
Now let's plot the states and counties together. We don't want the states' interiors to be drawn over the counties, so we'll use facecolor="none"
to indicate that the states shouldn't be filled.
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_word_freq_df.plot(cmap='Purples', scheme='quantiles',
ax=ax, column='i', linewidth=0.3, edgecolor='0.8')
us_states.plot(facecolor="none", ax=ax, linewidth=0.4, edgecolor='0')
Additionally, let’s look at modifying the legend and title, just by drawing it in directly, using the class intervals and colours we defined, and the coordinates from the (projected) map.
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
cmap = mpl.colors.LinearSegmentedColormap.from_list(
'Custom cmap', ["#FFFFFF", "#D6D1E2", "#AEA3C4", "#8575A8",
"#5D478B"], 5)
county_word_freq_df.plot(cmap=cmap, scheme='quantiles',
ax=ax, column='i', linewidth=0.3, edgecolor='0.8')
us_states.plot(facecolor="none", ax=ax, linewidth=0.4, edgecolor='0')
# The tick marks we want for our map. We'll pull these directly from
# the mapclassify code.
bounds = list(mapclassify.Quantiles(county_word_freq_df.i, k=5).bins)
# This controls how the colors are scaled using the above bounds
norm = mpl.colors.BoundaryNorm(bounds, cmap.N, clip=True)
# This sets where the colorbar goes and its size
cbax = fig.add_axes([0.15, 0.25, 0.25, 0.02])
cb2 = mpl.colorbar.ColorbarBase(cbax, cmap=cmap,
norm=norm,
ticks=bounds,
orientation='horizontal')
cb2.set_label(r'\Huge{i}\\\large{per million words}')
cb2.ax.xaxis.set_label_position('top')
cb2.ax.xaxis.label.set_multialignment('center')
plt.setp(plt.getp(cb2.ax.axes, 'xticklabels'), fontsize='small')
Finally, since this is all a lot of code, and since we want to be able to map all word from the dataset easily and consistently, we’ll put all this code into a single mapping function, so we just have to specify the word and the map title.
def plot_freq_map(word, title):
'''plots the US map of the word using the data in the geopandas dataframe
the geo_df object is expected to have a column labeled "z-score" for the
target word
'''
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
cmap = mpl.colors.LinearSegmentedColormap.from_list(
'Custom cmap', ["#FFFFFF", "#D6D1E2", "#AEA3C4", "#8575A8",
"#5D478B"], 5)
county_word_freq_df.plot(cmap=cmap, scheme='quantiles',
ax=ax, column=word, linewidth=0.3, edgecolor='0.8')
us_states.plot(facecolor="none", ax=ax, linewidth=0.4, edgecolor='0')
# The tick marks we want for our map
bounds = list(mapclassify.Quantiles(county_word_freq_df[word], k=5).bins)
# This controls how the colors are scaled using the above bounds
norm = mpl.colors.BoundaryNorm(bounds, cmap.N, clip=True)
# This sets where the colorbar goes and its size
cbax = fig.add_axes([0.15, 0.25, 0.25, 0.02])
cb2 = mpl.colorbar.ColorbarBase(cbax, cmap=cmap,
norm=norm,
ticks=bounds,
orientation='horizontal')
cb2.set_label(r'\Huge{%s}\\\large{%s}' % (word, title))
cb2.ax.xaxis.set_label_position('top')
cb2.ax.xaxis.label.set_multialignment('center')
plt.setp(plt.getp(cb2.ax.axes, 'xticklabels'), fontsize='small')
plot_freq_map('i', 'per million words')
plot_freq_map('the', 'per million words')
plot_freq_map('and', 'per million words')
plot_freq_map('but', 'per million words')
plot_freq_map('oregon', 'per million words')
plot_freq_map('fuck', 'per million words')
plot_freq_map('sneakers', 'per million words')
plot_freq_map('battlefield', 'per million words')
Now let’s look at how to conduct spatial autocorrelation analyses. The basic purpose of these methods is to identify regional patterns in the map for a quantitative variable, like any of our swear word maps. There are two basic types of spatial autocorrelation analyses (global and local), which we’ll look at in turn, but first we need to learn how to define a spatial weights matrix, which is the foundation for both global and local measures, as well as various other methods for spatial analysis.
A spatial weights matrix (SWM) is a big location-by-location matrix that specifies the spatial relationship between every pair of locations in a regional dataset. Basically, nearby locations are given a stronger weight than distant locations (often 1 for nearby locations and 0 for distant locations). These values are then used to weight the comparisons between locations in spatial autocorrelation analyses.
There are various ways to build an SWM. We’ll use what’s known as a “k-nearest neighbour” SWM, which is a very common and simple approach, an essentially the default on Python. The k part refers to how many nearest neighbors you want to include.
As a sanity check, let's look at the centroids of the counties
# We'll create a new Geopandas dataframe where each county gets mapped to its centroid
county_centroids_df = county_word_freq_df[['geometry']].copy()
county_centroids_df.geometry = county_word_freq_df.centroid
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_centroids_df.plot(ax=ax, marker='*', color='red', markersize=1)
county_word_freq_df.plot(ax=ax, facecolor="none", linewidth=0.4, edgecolor='0')
Geopandas doesn't know how to build the nearest neighbors data on its own, but our second library PySAL does! Even better, since a geopandas has spatial geometry to describe where the counties are, we can use our dataframe as is with PySAL to compute the nearest neighbors.
Here, we'll use the KNN class (k-Nearest Neighbors) and choose 25 neighbors usign the k=25
argument. We'll explore what happens if we choose other values of k
later too.
Now we can identify the nearest neighbours for each location (let’s use 25 for now, including the location itself) and then give each of these pairings a weight of 1 (or 1/n), and all other pairings a weight of 0. We don’t have to assign a non-zero weight to the pairing of each location and itself, but this leads to more intuitive results later on. Note that this SWM is non-symetric: even if location A is one of the nearest neighbours of location B, location B is not necessarily one of the nearest neigbours of location A.
listws25 = KNN.from_dataframe(county_word_freq_df, k=25)
The issues we skipped is how to select the number of nearest neighbours. There is no simple rule of thumb, which is unfortunate to some extent, since it places greater burden on us to make a principled choice, but this also gives us more flexibility, kind of like adjusting the resolution on a microscope. Plus if the values are set at a reasonable level (say with the number of nearest neighbours around 10% of the total locations or less, see Grieve 2017), it shouldn’t make too much difference. Anyway, this is all pretty abstract, so let’s generate a couple more spatial weights matrices (for 5 and 50 nearest neighbours) and then we’ll see how that changes our analyses later on.
listws05 = KNN.from_dataframe(county_word_freq_df, k=5)
listws50 = KNN.from_dataframe(county_word_freq_df, k=50)
Now that we have our SWMs, we can conduct global spatial autocorrelation analyses, spefically global Moran’s I. Our goal is to test if the values of a variable show an overall regional pattern, as opposed to being randomly distributed across space. Basically, Moran’s I compares the value of the variable around each location to see if nearby locations (as defined by the SWM) tend to have similar (clustering), different (dispersal), or random values. Moran’s I generally ranges from -1 to +1, with clustering being associated with positive values, dispersion with negative values, and randomness with values around 0. In dialectology, we don’t really find much dispersion (e.g. a checkboard pattern), so in general we’re just interested in measuring the strength of clustering.
mi = Moran(county_word_freq_df.i, listws25)
print("Moran's I:", mi.I)
print("Moran's p-value (under normality assumption): %0.19f" % mi.p_norm)
mi = Moran(county_word_freq_df.the, listws25)
print("Moran's I:", mi.I)
print("Moran's p-value (under normality assumption): %0.19f" % mi.p_norm)
# NOTE: 'and' is a python keyword so we have to use the other way of looking up values
mi = Moran(county_word_freq_df['and'], listws25)
print("Moran's I:", mi.I)
print("Moran's p-value (under normality assumption): %0.19f" % mi.p_norm)
mi = Moran(county_word_freq_df.oregon, listws25)
print("Moran's I:", mi.I)
print("Moran's p-value (under normality assumption): %0.19f" % mi.p_norm)
mi = Moran(county_word_freq_df.now, listws25)
print("Moran's I:", mi.I)
print("Moran's p-value (under normality assumption): %0.19f" % mi.p_norm)
mi = Moran(county_word_freq_df.sneakers, listws25)
print("Moran's I:", mi.I)
print("Moran's p-value (under normality assumption): %0.19f" % mi.p_norm)
mi = Moran(county_word_freq_df.battlefield, listws25)
print("Moran's I:", mi.I)
print("Moran's p-value (under normality assumption): %0.19f" % mi.p_norm)
Notably, given the p-values, we can see that they all show non-random regional patterns (as can be plainly seen by inspecting the maps), which is generally true of word maps in this dataset at least. We can also use these Moran’s I scores to rank maps in terms of the strength of the regional patterns they show.
To calculate Moran’s I for all 10,000 word maps, we loop through the base dataset word by word and then store the results in a new data frame, including the word, the Moran’s I value, and the associated p value, which we calculate under the assumption of randomness (i.e. we do not assume normality), using a one-tail greater than test, since we’re not looking for dispersion. We’re also going to run three different tests, one for each SWM, so we can look at how that changes our results.
# NOTE: we'll convert this to a pandas dataframe after
moran_swear = defaultdict(list)
# Let's pick 100 random words
words = list(county_word_freq_df.columns[7:])
random.shuffle(words)
words = words[:100]
for word in tqdm.tqdm_notebook(words):
moran_swear['NAME'].append(word)
mi = Moran(county_word_freq_df[word], listws25)
morans_I_25 = round(mi.I, 3)
z_25 = mi.p_norm
moran_swear['I_25nn'].append(morans_I_25)
moran_swear['p_25nn'].append(z_25)
moran_swear = pd.DataFrame(moran_swear)
moran_swear.sort_values('I_25nn').head()
moran_swear.sort_values('I_25nn').tail()
From this you can see that all the maps show significant spatial clustering. We can also see that varying the SWM doesn’t make that much difference here: we get highly significant results across the board (even if we correct alpha). For the most part, Moran’s I and the associated p-value are strongest here for SWMs based on 5 nearest neighbours, but it is not generally the case that Moran’s I will be maximised by looking at the fewest number of neighbours (i.e. n = 2), or that Moran’s I will be maximised for all the variables in a regional linguistic dataset for SWMs based on the same number of neighbours, so this result shouldn’t be overinterpretted.
Next, we can run a local spatial autocorrelation analysis. Whereas global Moran’s I provides a single value reflecting the overall amount of spatial clustering in a map, local Getis-Ord Gi* provides a value for each location indicating the degree to which it is part of a high or low value cluster, assigning a positive z-score to locations that are part of high value clusters and a negative z-score to locations that are part of a low value clusters.
Crucially, these values can then be mapped across the locations to identify areas of high and low values (hot spot maps), essentially generating smoothed maps that isolate the underlying regional signal in each variable, while ignoring non-regional variability. Local spatial autocorrelation analysis is therefore a very useful in dialectology as it basically automates the process of mapping isoglosses.
To conduct the local spatial autocorrelation analysis, we generate Getis-Ord Gi* z-scores for each location for each variable, using 25 nearest neighbour SWM.
getis_ord_g = pysal.explore.esda.G_Local(county_word_freq_df['i'], listws25, transform='B')
getis_ord_g.Zs
Let's make a function that will plot a word's Getis-Ord G* for all the counties
def plot_spatial_corr_map(word, neighbors=listws25):
'''plots the US map of the word using the data in the geopandas dataframe
the geo_df object is expected to have a column labeled "z-score" for the
target word
'''
fig, ax = plt.subplots(1, figsize=(11,8.5))
gol = pysal.explore.esda.G_Local(county_word_freq_df[word], neighbors, transform='B')
county_word_freq_df['z-score'] = gol.Zs
# Create the color scale; note we need an even number of colors to avoid
# having white as a middle color that's not exactly "middle" in value
cmap = mpl.colors.LinearSegmentedColormap.from_list(
'Custom cmap', ["steelblue", "white", "tomato"], 10)
county_word_freq_df.plot(cmap=cmap, ax=ax, column='z-score',
linewidth=0.8, edgecolor='0.8')
tmp = us_states.plot(facecolor="none", ax=ax,
linewidth=0.4, edgecolor='0')
# The tick marks we want for our map
bounds = [min(county_word_freq_df['z-score']), -2.33, -1.96, -1.28, -.66, 0, 0.66,
1.28, 1.96, 2.33, max(county_word_freq_df['z-score'])]
# This controls how the colors are scaled using the above bounds
norm = mpl.colors.BoundaryNorm(bounds, cmap.N, clip=True)
# This sets where the colorbar goes and its size
cbax = fig.add_axes([0.15, 0.25, 0.25, 0.02])
cb2 = mpl.colorbar.ColorbarBase(cbax, cmap=cmap,
norm=norm,
ticks=bounds,
orientation='horizontal')
cb2.set_label(r'\Huge{%s}\\\large{Getis-Ord $G_i^*$ z-score}' % word)
cb2.ax.xaxis.set_label_position('top')
#for t in leg.texts:
cb2.ax.xaxis.label.set_multialignment('center')
plt.setp(plt.getp(cb2.ax.axes, 'xticklabels'), fontsize='small')
# NOTE: this is kind of a hack so that we only show specific labels
# in the color bar legend, which otherwise becomes crowded
#
# NOTE2: Matplotlib doesn't use the '-' character found on your
# keyboard when printing negative numbers; it instead uses a
# specific negation character, which is pasted into the examples below
# .replace('-', '−')
labels_we_want_to_show = [('%0.2f' % min(bounds)),
'%0.2f' % max(bounds),
'0.00', '1.96', '-1.96']
# Using LaTeX for captions means matplotlib adds the latex math-mode markers
labels_we_want_to_show = ['$'+x+'$' for x in labels_we_want_to_show]
for label in cb2.ax.xaxis.get_ticklabels()[::]:
if label.get_text() not in labels_we_want_to_show:
#print(label.get_text())
label.set_visible(False)
pass
# Turn off matplotlib's bounding box
tmp = ax.axis('off')
plot_spatial_corr_map('i')
plot_spatial_corr_map('the')
plot_spatial_corr_map('and')
plot_spatial_corr_map('but')
plot_spatial_corr_map('oregon')
plot_spatial_corr_map('snow')
plot_spatial_corr_map('fuck')
plot_spatial_corr_map('sneakers')
plot_spatial_corr_map('battlefield')
plot_spatial_corr_map('and', neighbors=listws05)
plot_spatial_corr_map('and', neighbors=listws25)
plot_spatial_corr_map('and', neighbors=listws50)
plot_spatial_corr_map('oregon', neighbors=listws05)
plot_spatial_corr_map('oregon', neighbors=listws25)
plot_spatial_corr_map('oregon', neighbors=listws50)
plot_spatial_corr_map('gtf', neighbors=listws25)
plot_spatial_corr_map('hurricane', neighbors=listws25)
That’s the end of the tutorial. Please let us know if you have any questions or feedback. In addition to this python code and the dataset, please also see the OSF repository (https://osf.io/e2f63/) for the matching R code by Jack Grieve and our slides for the rest of this workshop, where we argue for the value of this type of research in dialectology and sociolinguistics, as well as additional code. And if you’re interested in learning more about these types of methods and implementing them in R, please see Jack's upcoming book with Martijn Wieling ‘Regional Dialectology: Quantitative Approaches in R’.