Python Markdown for Jack Grieve and David Jurgens. 2019. Mapping word frequencies on Twitter using R and Python. Workshop presented at NWAV 48, University of Oregon, 2019-10-10.
David Jurgens, University of Michigan, jurgens@umich.edu
2019-10-03
This markdown presents the python code for our workshop at NWAV 48 on Mapping Word Frequencies on Twitter using R and Python. It was prepared by Jack Grieve; the parallel Python code was prepared by David Jurgens. The point of the workshop is to show how to map word frequency in R and to explain why a linguist might want to do this. We’ll be working with a dataset based on a corpus of American geolocated tweets. For more information on the conference and the workshop see https://nwav48.uoregon.edu/workshops/.
This is the full set of python packages used in this markdown. For this workshop we’ll be working mostly with the Geopandas and PySAL packages in Python. These are typically the easiest ways to work with spatial data, plot it, and perform spatial statistics.
# This first line makes sure that our plots will show up
# in the jupyter notebook if you want to run this code
%matplotlib inline
import geopandas as gpd
import pandas as pd
import pysal
from collections import defaultdict
import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib.pylab import cm
from pysal.explore.esda.getisord import G_Local
from pysal.explore.esda.moran import Moran
from pysal.lib.weights import KNN
import tqdm
import random
import mapclassify
# This lets us do slightly fancier LaTeX-based captions in plots
plt.rc('text', usetex=True)
First, we read in the US counties map from a shapefile using geopandas, which is a kind of file that describes the geometry of objects (e.g., their latitute and longitude). You can download this counties shapefile here.
county_map = gpd.read_file('UScounties.shp')
Geopandas is just like regular pandas dataframe (or an R dataframe if you know that) so any regular pandas operation also works here. Let's take a look at what's in this dataframe using head()
which shows the first five rows of data. We can see that we have regular columns and a special geometry
column that describes the shape; geopandas knows how to interpret this column in order to do spatial operations, which will make our life easier laterl.
county_map.head()
Let's plot the state map.
county_map.plot()
For visual convencience, we'll filter out Alaska and Hawaii so it's just the continental US (apologies to those of you from these states!).
# Drop Alaska and Hawaii to just be the continental US
county_map = county_map[(county_map['STATE_NAME'] != 'Alaska')
& (county_map['STATE_NAME'] != 'Hawaii')]
county_map.plot()
In addition, we can re-project the map. In this case we’ll use a Lambert equal area projection, which can make our map look more professional. It’s nice to use a different projection system, so we get a less flat map, as we’re used to seeing for the US. The Albers Projection is a standard choice. We do this by defining the projection in the PROJ.4 system (don’t worry about the details...I don’t entirely understand them myself---but thankfully geopandas does!) and then transforming the spatial polygon map. In practice, we can use one of many standard projections or define our own to highight a particular geographic region best.
I've included one below that might be of use. We'll use the same setup as before, but before calling plot()
, we first call to_crs()
on the dataframe itself which changes the coordinate reference system to the projection (essentially, changing where a particular lat/lon value gets placed on a map).
# This is our fancy projection
projection = "+proj=laea +lat_0=30 +lon_0=-95"
county_map = county_map.to_crs(projection)
county_map.plot()
We can also get rid of that pesky bounding box by turning the axis off of the plot, which is the last prettification step we'll do for now. To do this, we'll (1) create a figure object with Matplotlib and turn its axis markers off using ax.axis('off')
and (2) tell geopandas to plot the map on that figure.
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_map.plot(ax=ax)
Now, let’s read in the dataset, which can be downloaded here. The dataset contains each county's FIPS which is a five-digit numeric code that uniquely idenfities it. Python can sometimes get confused and read this in as an "int" type, which causes a few subtle bugs later, so we're going to manually specify the column should be treated as a string using dtype={'FIPS': str}
when loading the data.
word_freq_df = pd.read_csv('US_CNTY_3076_TWIT_10K_WORD.txt', sep=' ', dtype={'FIPS': str})
This dataset consists of the relative frequency per million words of top 10,000 most common words (defined as a string of case-insensitive alphabetic characters plus hyphens) measured across the 3,076 counties in the contiguous US, based on a multi-billion word corpus of American geolocated Twitter data collected between October 2013 and November 2014 (see Huang et al. 2016, Grieve et al. 2018).
For example, consider the values for the last few variables. The county name in column 1 and the FIPS code for the county is in column 2.
word_freq_df.tail()
We can also get a quick summary of the data too
word_freq_df.describe()
Before moving on, it’s worth quickly explain how this dataset was generated. First, each tweet in the corpus was grouped by the county in which it occurred based on the longitude and latitude in the metadata. To normalise for variation in sample size, which reflects variation in county population, the relative frequency of each word in each county was computed by dividing its frequency by the total number of word tokens in that county and then multiplying by 1 million. It is crucial to normalise otherwise you just get population maps.
Now we can join linguistic dataset to the sf map object based on the two sets of county IDs, which in the US are known as FIPS codes. For those of you who know databases or SQL, pandas (and therefore geopandas) lets us easily do inner joins on the data frames.
county_word_freq_df = pd.merge(county_map, word_freq_df, on='FIPS', how='inner')
After joining the data, we'll reset the coordinate reference system (CRS) to the one we want so everything keeps plotting the way we expect
county_word_freq_df = county_word_freq_df.to_crs(projection)
We can see that the county_word_freq_df
data frame now contains 10,000 more columns than the county_map
data frame.
county_word_freq_df.shape
As a sanity check, let's take a look at the first few rows to see that it has the relative counts for each word (as separate columns)
county_word_freq_df.head()
Now we can map the relative frequencies of any of these 10,000 words across the 3,076 counties in the contiguous US.
Let’s first make a quick map for the relative frequency of ‘i’, the most common word in the corpus.
To make the map, we'll tell geopandas which column to use to color the map, which makes a Chloropleth map, where the color of each shape is based on the value of an associated variable). Here, we just include the argument column='i'
to our plot()
command on the county_word_freq_df
dataframe.
figZ, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_word_freq_df.plot(ax=ax, column='i', legend=True)
Not much there, but that’s because the distribution here is highly skewed. Let's take a look at the frequency distribution. Since each geopandas dataframe is also a Pandas dataframe, we can use its built-in plotting features to show histograms (among oether things). Here, let's try showing a histogram of 40 bins for "i" frequency by county
county_word_freq_df.i.hist(bins=40)
There's clearly an outliner here:
county_word_freq_df['i'].describe()
And colour breaks are assigned here based on assigning equally spaced intervals distinct colours between the min and max value, so most value are bunched together. Geopandas includes a scheme
argument for plot()
that tells geopandas how to decide which values get mapped to which colors. We can try to simulate the default behavior more explicitly by adding the scheme='equal_interval'
argument to plot()
which will use equal sized bins with discrete colors.
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_word_freq_df.plot(ax=ax, column='i', legend=True,
scheme='equal_interval')
The basic alternative (there are other more complex approaches) is to define these colour breaks across unequal intervals so that each colour band has roughly an equal numbers of locations by setting scheme='quantile'
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_word_freq_df.plot(ax=ax, column='i', legend=True, scheme='quantiles')