Mapping Word Frequencies on Twitter using Python

Python Markdown for Jack Grieve and David Jurgens. 2019. Mapping word frequencies on Twitter using R and Python. Workshop presented at NWAV 48, University of Oregon, 2019-10-10.

David Jurgens, University of Michigan, jurgens@umich.edu

2019-10-03

1 Summary

This markdown presents the python code for our workshop at NWAV 48 on Mapping Word Frequencies on Twitter using R and Python. It was prepared by Jack Grieve; the parallel Python code was prepared by David Jurgens. The point of the workshop is to show how to map word frequency in R and to explain why a linguist might want to do this. We’ll be working with a dataset based on a corpus of American geolocated tweets. For more information on the conference and the workshop see https://nwav48.uoregon.edu/workshops/.

2 Packages

This is the full set of python packages used in this markdown. For this workshop we’ll be working mostly with the Geopandas and PySAL packages in Python. These are typically the easiest ways to work with spatial data, plot it, and perform spatial statistics.

In [1]:
# This first line makes sure that our plots will show up 
# in the jupyter notebook if you want to run this code
%matplotlib inline
import geopandas as gpd
import pandas as pd
import pysal
from collections import defaultdict
import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib.pylab import cm
from pysal.explore.esda.getisord import G_Local
from pysal.explore.esda.moran import Moran
from pysal.lib.weights import KNN
import tqdm
import random
import mapclassify

# This lets us do slightly fancier LaTeX-based captions in plots
plt.rc('text', usetex=True)
/Users/jurgens/anaconda3/lib/python3.7/site-packages/pysal/model/spvcm/abstracts.py:10: UserWarning: The `dill` module is required to use the sqlite backend fully.
  from .sqlite import head_to_sql, start_sql

3 US Counties Map Template

3.1 US County Map

First, we read in the US counties map from a shapefile using geopandas, which is a kind of file that describes the geometry of objects (e.g., their latitute and longitude). You can download this counties shapefile here.

In [2]:
county_map = gpd.read_file('UScounties.shp')

Geopandas is just like regular pandas dataframe (or an R dataframe if you know that) so any regular pandas operation also works here. Let's take a look at what's in this dataframe using head() which shows the first five rows of data. We can see that we have regular columns and a special geometry column that describes the shape; geopandas knows how to interpret this column in order to do spatial operations, which will make our life easier laterl.

In [3]:
county_map.head()
Out[3]:
NAME STATE_NAME STATE_FIPS CNTY_FIPS FIPS geometry
0 Lake of the Woods Minnesota 27 077 27077 POLYGON ((-95.34283 48.54668, -95.34105 48.715...
1 Ferry Washington 53 019 53019 POLYGON ((-118.85163 47.94956, -118.84846 48.4...
2 Stevens Washington 53 065 53065 POLYGON ((-117.43883 48.04412, -117.54219 48.0...
3 Okanogan Washington 53 047 53047 POLYGON ((-118.97209 47.93915, -118.97406 47.9...
4 Pend Oreille Washington 53 051 53051 POLYGON ((-117.43858 48.99992, -117.03205 48.9...

Let's plot the state map.

In [4]:
county_map.plot()
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a279b34a8>

For visual convencience, we'll filter out Alaska and Hawaii so it's just the continental US (apologies to those of you from these states!).

In [5]:
# Drop Alaska and Hawaii to just be the continental US
county_map = county_map[(county_map['STATE_NAME'] != 'Alaska') 
                          & (county_map['STATE_NAME'] != 'Hawaii')]
In [6]:
county_map.plot()
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a27a56358>

In addition, we can re-project the map. In this case we’ll use a Lambert equal area projection, which can make our map look more professional. It’s nice to use a different projection system, so we get a less flat map, as we’re used to seeing for the US. The Albers Projection is a standard choice. We do this by defining the projection in the PROJ.4 system (don’t worry about the details...I don’t entirely understand them myself---but thankfully geopandas does!) and then transforming the spatial polygon map. In practice, we can use one of many standard projections or define our own to highight a particular geographic region best.

I've included one below that might be of use. We'll use the same setup as before, but before calling plot(), we first call to_crs() on the dataframe itself which changes the coordinate reference system to the projection (essentially, changing where a particular lat/lon value gets placed on a map).

In [7]:
# This is our fancy projection
projection = "+proj=laea +lat_0=30 +lon_0=-95"
county_map = county_map.to_crs(projection)
In [8]:
county_map.plot()
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a27d1b630>

We can also get rid of that pesky bounding box by turning the axis off of the plot, which is the last prettification step we'll do for now. To do this, we'll (1) create a figure object with Matplotlib and turn its axis markers off using ax.axis('off') and (2) tell geopandas to plot the map on that figure.

In [9]:
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_map.plot(ax=ax)
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a27bf2780>

3.3 Linguistic Data

Now, let’s read in the dataset, which can be downloaded here. The dataset contains each county's FIPS which is a five-digit numeric code that uniquely idenfities it. Python can sometimes get confused and read this in as an "int" type, which causes a few subtle bugs later, so we're going to manually specify the column should be treated as a string using dtype={'FIPS': str} when loading the data.

In [10]:
word_freq_df = pd.read_csv('US_CNTY_3076_TWIT_10K_WORD.txt', sep=' ', dtype={'FIPS': str})

This dataset consists of the relative frequency per million words of top 10,000 most common words (defined as a string of case-insensitive alphabetic characters plus hyphens) measured across the 3,076 counties in the contiguous US, based on a multi-billion word corpus of American geolocated Twitter data collected between October 2013 and November 2014 (see Huang et al. 2016, Grieve et al. 2018).

For example, consider the values for the last few variables. The county name in column 1 and the FIPS code for the county is in column 2.

In [11]:
word_freq_df.tail()
Out[11]:
county_ID FIPS a a. aa aaliyah aaron ab abandoned abbey ... zip zoe zoey zombie zombies zone zoned zoo zoom zumba
3071 wyoming,sweetwater 56037 17786.28 0.0 6.44 3.22 12.88 6.44 3.22 0.00 ... 6.44 6.44 0.0 28.97 12.88 35.41 6.44 12.88 16.09 12.88
3072 wyoming,teton 56039 18783.86 5.2 10.40 0.00 20.80 2.60 15.60 23.39 ... 7.80 10.40 7.8 18.20 5.20 54.59 0.00 7.80 13.00 10.40
3073 wyoming,uinta 56041 17720.52 0.0 0.00 0.00 14.66 14.66 0.00 58.63 ... 0.00 0.00 0.0 29.31 29.31 29.31 0.00 0.00 0.00 29.31
3074 wyoming,washakie 56043 19774.68 0.0 14.75 0.00 14.75 14.75 0.00 0.00 ... 0.00 0.00 0.0 29.49 0.00 29.49 0.00 0.00 44.24 0.00
3075 wyoming,weston 56045 20534.10 0.0 0.00 0.00 0.00 27.31 0.00 0.00 ... 0.00 0.00 0.0 0.00 0.00 54.61 0.00 0.00 0.00 0.00

5 rows × 10002 columns

We can also get a quick summary of the data too

In [12]:
word_freq_df.describe()
Out[12]:
a a. aa aaliyah aaron ab abandoned abbey abby abc ... zip zoe zoey zombie zombies zone zoned zoo zoom zumba
count 3076.000000 3076.000000 3076.000000 3076.000000 3076.000000 3076.000000 3076.000000 3076.000000 3076.000000 3076.000000 ... 3076.000000 3076.000000 3076.000000 3076.000000 3076.000000 3076.000000 3076.000000 3076.000000 3076.000000 3076.000000
mean 17686.561661 4.036619 8.119798 4.479535 31.770312 18.642633 5.292051 3.490634 14.144178 10.266057 ... 10.606531 4.899717 4.677120 15.390471 8.407350 39.698365 3.514844 15.925293 4.380306 91.983557
std 1806.721774 50.617120 24.687379 10.843293 54.384890 43.162770 10.050835 8.150068 23.625402 19.273659 ... 33.785532 14.547293 9.880733 25.511622 13.369255 35.477047 5.215757 23.753035 10.352628 4898.997368
min 2196.160000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 16837.517500 0.000000 0.000000 0.000000 13.637500 3.700000 0.000000 0.000000 2.537500 3.760000 ... 1.407500 0.000000 0.000000 7.597500 0.000000 26.610000 0.000000 5.270000 0.000000 0.000000
50% 17721.195000 1.460000 3.460000 1.130000 24.195000 9.020000 3.800000 1.225000 9.460000 8.410000 ... 5.355000 2.405000 2.710000 13.725000 6.875000 36.950000 2.755000 11.350000 3.100000 1.620000
75% 18596.060000 3.512500 6.772500 5.630000 37.350000 16.712500 6.060000 4.275000 17.610000 13.060000 ... 8.672500 5.722500 5.172500 19.170000 10.555000 46.462500 4.762500 19.862500 5.190000 4.782500
max 55189.400000 2770.080000 446.890000 241.440000 1936.620000 875.660000 297.420000 169.070000 547.950000 898.110000 ... 683.880000 607.560000 220.390000 1203.370000 316.290000 731.140000 110.380000 575.620000 329.380000 271710.000000

8 rows × 10000 columns

Before moving on, it’s worth quickly explain how this dataset was generated. First, each tweet in the corpus was grouped by the county in which it occurred based on the longitude and latitude in the metadata. To normalise for variation in sample size, which reflects variation in county population, the relative frequency of each word in each county was computed by dividing its frequency by the total number of word tokens in that county and then multiplying by 1 million. It is crucial to normalise otherwise you just get population maps.

3.4 Joining the Data

Now we can join linguistic dataset to the sf map object based on the two sets of county IDs, which in the US are known as FIPS codes. For those of you who know databases or SQL, pandas (and therefore geopandas) lets us easily do inner joins on the data frames.

In [13]:
county_word_freq_df = pd.merge(county_map, word_freq_df, on='FIPS', how='inner')

After joining the data, we'll reset the coordinate reference system (CRS) to the one we want so everything keeps plotting the way we expect

In [14]:
county_word_freq_df = county_word_freq_df.to_crs(projection)

We can see that the county_word_freq_df data frame now contains 10,000 more columns than the county_map data frame.

In [15]:
county_word_freq_df.shape
Out[15]:
(3075, 10007)

As a sanity check, let's take a look at the first few rows to see that it has the relative counts for each word (as separate columns)

In [16]:
county_word_freq_df.head()
Out[16]:
NAME STATE_NAME STATE_FIPS CNTY_FIPS FIPS geometry county_ID a a. aa ... zip zoe zoey zombie zombies zone zoned zoo zoom zumba
0 Lake of the Woods Minnesota 27 077 27077 POLYGON ((-25655.892 2049834.670, -25444.188 2... minnesota,lake of the woods 21027.84 0.00 0.00 ... 0.00 0.0 0.0 33.89 0.00 16.94 33.89 16.94 0.0 16.94
1 Ferry Washington 53 019 53019 POLYGON ((-1775417.832 2196570.086, -1758130.5... washington,ferry 12828.09 0.00 0.00 ... 337.58 0.0 0.0 0.00 0.00 84.39 0.00 0.00 0.0 0.00
2 Stevens Washington 53 065 53065 POLYGON ((-1670527.465 2182361.609, -1678023.8... washington,stevens 16853.05 0.00 0.00 ... 51.16 0.0 0.0 38.37 123.64 21.32 0.00 4.26 0.0 0.00
3 Okanogan Washington 53 047 53047 POLYGON ((-1784425.905 2197600.293, -1783913.3... washington,okanogan 18685.00 17.85 4.46 ... 4.46 0.0 0.0 17.85 4.46 22.31 0.00 13.38 0.0 0.00
4 Pend Oreille Washington 53 051 53051 POLYGON ((-1641330.622 2285156.266, -1612439.6... washington,pend oreille 15708.26 0.00 21.40 ... 64.20 0.0 0.0 21.40 0.00 21.40 0.00 0.00 0.0 0.00

5 rows × 10007 columns

4. Mapping Word Frequencies

Now we can map the relative frequencies of any of these 10,000 words across the 3,076 counties in the contiguous US.

4.1 Simple Maps

Let’s first make a quick map for the relative frequency of ‘i’, the most common word in the corpus.

To make the map, we'll tell geopandas which column to use to color the map, which makes a Chloropleth map, where the color of each shape is based on the value of an associated variable). Here, we just include the argument column='i' to our plot() command on the county_word_freq_df dataframe.

In [17]:
figZ, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_word_freq_df.plot(ax=ax, column='i', legend=True)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a281611d0>

Not much there, but that’s because the distribution here is highly skewed. Let's take a look at the frequency distribution. Since each geopandas dataframe is also a Pandas dataframe, we can use its built-in plotting features to show histograms (among oether things). Here, let's try showing a histogram of 40 bins for "i" frequency by county

In [18]:
county_word_freq_df.i.hist(bins=40)
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a28724940>

There's clearly an outliner here:

In [19]:
county_word_freq_df['i'].describe()
Out[19]:
count      3075.000000
mean      44810.952039
std        6332.354688
min        3573.250000
25%       42300.065000
50%       45716.950000
75%       48359.530000
max      192438.870000
Name: i, dtype: float64

And colour breaks are assigned here based on assigning equally spaced intervals distinct colours between the min and max value, so most value are bunched together. Geopandas includes a scheme argument for plot() that tells geopandas how to decide which values get mapped to which colors. We can try to simulate the default behavior more explicitly by adding the scheme='equal_interval' argument to plot() which will use equal sized bins with discrete colors.

In [20]:
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_word_freq_df.plot(ax=ax, column='i', legend=True,
                         scheme='equal_interval')
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2b210a58>

The basic alternative (there are other more complex approaches) is to define these colour breaks across unequal intervals so that each colour band has roughly an equal numbers of locations by setting scheme='quantile'

In [21]:
fig, ax = plt.subplots(1, figsize=(11,8.5))
ax.axis('off')
county_word_freq_df.plot(ax=ax, column='i', legend=True, scheme='quantiles')
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2b117518>