Home » Posts tagged 'DATA 712'

Tag Archives: DATA 712

Fangraphs Seasonal Pitching Data, 2000 – 2019

This data set contains seasonal pitching data from the MLB from 2000 to 2019.

Download a Zip Archive of the Data and Script (includes CSV and RDS)

Click here for an explanation of the variables

Wrangling Operation

The operation requires the following packages, particularly Bill Petti’s excellent baseballr package for wrangling MLB data:

# Download the Baseball R Package:
# devtools::install_github(repo = "BillPetti/baseballr")


First, I create a table of player identifiers from the Chadwick Baseball Bureau using get_chadwick_lu() in baseballr. These identifiers will help users merge this data table to other baseball data.

# Player Identifiers
dat.playerid <- get_chadwick_lu()
dat.playerid <- dat.playerid[c(1:7,13:15,19,25:28)]
saveRDS(dat.playerid, "Player Identifiers.RDS")
identifiers <- readRDS("Player Identifiers.RDS")

Next, I download seasonal MLB pitching performance data from Fangraphs through fg_pitch_leaders()

# Scraping Batting Data
for (i in 2000:2019){
  temp <- fg_pitch_leaders(i, i, league = "all", qual = "n", ind = 1)
  assign(paste0("fg_pitch_", i), temp)

As these are all identical versions of the same data table, just representing different years, I can stack them together using rbind():

dat.pit <- fg_pitch_2000
for (i in 2000:2019){
  temp <- get(paste0("fg_pitch_", i))
  temp <- rbind(dat.pit, temp)
  assign("dat.pit", temp)

Rename the identifier in the Fangraphs table so that it is the same as the Chadwick Bureau identifer data. I then merge the two sets so that the batting data can more readily be merged with other sources.

names(dat.pit)[1] <- paste("key_fangraphs")

# Clean Up Data Types and Sort
dat.pit$key_fangraphs <- as.numeric(dat.pit$key_fangraphs)
dat.pit$Season <- as.numeric(dat.pit$Season)
dat.pit <- arrange(dat.pit, Name, Season)

# Write data
write.csv(dat.pit, "Fangraphs Pitching Leaders 2000 - 2019.csv")
saveRDS(dat.pit, "Fangraphs Pitching Leaders 2000 - 2019.RDS")

# Clean up memory 
rm(list=ls(pattern = "fg_pitch"))

Photo Credit. By derivative work: Amineshaker (talk)Image:Nolan_Ryan_in_Atlanta.jpg: Wahkeenah – Image:Nolan_Ryan_in_Atlanta.jpg|200px, Public Domain, https://commons.wikimedia.org/w/index.php?curid=5022538

Fangraphs Seasonal Batting Data, 2000 – 2019

This data set contains seasonal batting data from the MLB from 2000 to 2019.

Download a Zip Archive of the Data and Script (includes CSV and RDS)

Click here for an explanation of the variables

Wrangling Operation

The operation requires the following packages, particularly Bill Petti’s excellent baseballr package for wrangling MLB data:

# Download the Baseball R Package:
# devtools::install_github(repo = "BillPetti/baseballr")


First, I create a table of player identifiers from the Chadwick Baseball Bureau using get_chadwick_lu() in baseballr. These identifiers will help users merge this data table to other baseball data.

# Player Identifiers
dat.playerid <- get_chadwick_lu()
dat.playerid <- dat.playerid[c(1:7,13:15,19,25:28)]
saveRDS(dat.playerid, "Player Identifiers.RDS")
identifiers <- readRDS("Player Identifiers.RDS")

Next, I download seasonal MLB batting leader tables from Fangraphs through fg_bat_leaders()

# Scraping Batting Data
for (i in 2000:2019){
  temp <- fg_bat_leaders(i, i, league = "all", qual = "n", ind = 1)
  assign(paste0("fg_bat_", i), temp)

As these are all identical versions of the same data table, just representing different years, I can stack them together using rbind():

dat.bat <- fg_bat_2000
for (i in 2001:2019){
  temp <- get(paste0("fg_bat_", i))
  temp <- rbind(dat.bat, temp)
  assign("dat.bat", temp)

Rename the identifier in the Fangraphs table so that it is the same as the Chadwick Bureau identifer data. I then merge the two sets so that the batting data can more readily be merged with other sources.

names(dat.bat)[1] <- paste("key_fangraphs")

temp <- merge(dat.bat, identifiers, by = "key_fangraphs")

# Clean Up Data Types and Sort
temp$key_fangraphs <- as.numeric(temp$key_fangraphs)
temp$Season <- as.numeric(temp$Season)
temp <- arrange(temp, Name, Season)

# Write data
write.csv(temp, "Fangraphs Batting Leaders 2000 - 2019.csv")
saveRDS(temp, "Fangraphs Batting Leaders 2000 - 2019.RDS")

# Clean up memory 
rm(list=ls(pattern = "fg_bat"))

Using the Data

To call the data in your analysis

# To load the CSV
dat.bat <- read.csv(paste0(data.directory("Fangraphs Batting Leaders 2000 - 2019.csv"))

# To load the RDS
dat.bat <- readRDS(paste0(data.directory("Fangraphs Batting Leaders 2000 - 2019.RDS"))

Where data.directory is the path to the folder containing the data files.

An Introduction to Longitudinal Analysis

What is “Longitudinal Data”?

Longitudinal data is data that covers subjects over multiple time periods. Longitudinal data is often contrasted with “cross-sectional data*, which measures subjects at a single point in time.

The data object dat (below) offers an example of longitudinal attendance data for Major League Baseball teams.

# To install baseballr package:
# library(devtools)
# install_github("BillPetti/baseballr")


team <- c("TOR", "NYY", "BAL", "BOS", "TBR",
          "KCR", "CLE", "DET", "MIN", "CHW",
          "LAA", "HOU", "SEA", "OAK", "TEX",
          "NYM", "PHI", "ATL", "MIA", "WSN",
          "MIL", "STL", "PIT", "CHC", "CIN",
          "SFG", "LAD", "ARI", "COL", "SDP")
for (i in team){
  temp <- team_results_bref(i, 2019)
  temp <- temp[c(1, 2, 3, 4, 5, 18)]
  assign(paste0("temp.dat.", i), temp)
temp.dat <- temp.dat.TOR
for (i in team[-1]){
  temp <- get(paste0("temp.dat.",i))
  temp.1 <- rbind(temp.dat, temp)
  assign("temp.dat", temp.1)
dat <- temp.dat

dat$Date <- paste0(dat$Date, ", 2019")

dat <- subset(dat, !(grepl("(", dat$Date, fixed= T)))

dat$Date <- as.Date(parse_date(dat$Date, default_tz=""))
head(dat, 10)
## # A tibble: 10 x 6
##       Gm Date       Tm    H_A   Opp   Attendance
##    <dbl> <date>     <chr> <chr> <chr>      <dbl>
##  1     1 2019-03-28 TOR   H     DET        45048
##  2     2 2019-03-29 TOR   H     DET        18054
##  3     3 2019-03-30 TOR   H     DET        25429
##  4     4 2019-03-31 TOR   H     DET        16098
##  5     5 2019-04-01 TOR   H     BAL        10460
##  6     6 2019-04-02 TOR   H     BAL        12110
##  7     7 2019-04-03 TOR   H     BAL        11436
##  8     8 2019-04-04 TOR   A     CLE        10375
##  9     9 2019-04-05 TOR   A     CLE        12881
## 10    10 2019-04-06 TOR   A     CLE        18429

Our data set includes date, home/away, opponent, and attendance data for the 2019 season. Data comes from Baseball Reference, downloaded using the excellent baseballr package. To see how I downloaded and prepared this data for analysis, download this page’s Markdown file here.

Basic Terminology

Some terminology:

  • Units refer to the individual subjects that we are following across time. In the above data, our units are baseball teams.
  • Periods refer to the time periods in which the subjects were observed. Above, our periods are dates.
  • Cross-sections refer to comparisons across units within the same time period. A cross-section of our data set would only include attendance data for a single day.
  • Time series refer to data series pertaining to the same unit over time. Were our data set only comprised of one team’s attendance data, it could be said to contain only one time series.

How is Longitudinal Data Useful?

Without longitudinal data, we are left to work with cross-sectional snapshots. A snapshot might tell us that 44,424 people came to see the Mets’ 2019 home opener, or that fans made 2,412,887 visits to the Citi Field that year.

Longitudinal data allows us to assess the effects of changes in our units of analysis and the environments in which they operate. Unpackaging phenomenon over time gives you new vantage points and bases for comparison:


dat.nym$Date.P <- as.POSIXct(dat.nym$Date)  
#Converting Date into POSIXct format.  See below.

ggplot(dat.nym, aes(x = Date.P, y = Attendance)) + 
  geom_col() +
  scale_x_datetime(breaks = date_breaks("7 days"), labels = date_format(format = "%m/%d")) +
  xlab("Date") + ylab("Attendance") + 
  scale_y_continuous(breaks = seq(0,45000,5000), label = comma) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ggtitle("New York Mets Home Game Attendance, 2019")

plot of chunk unnamed-chunk-3

It allows us to unpackage time effects specifically, as with this table that gives us mean attendance by day:

dat.nym$day <- factor(as.POSIXlt(dat.nym$Date.P)$wday,
                      labels = c("Monday", "Tuesday", "Wednesday",
                                 "Thursday", "Friday", "Saturday", "Sunday"))
temp <- aggregate(Attendance ~ day, data = dat.nym, mean)
temp$Attendance = round(temp$Attendance, 0)
names(temp)[1] <- paste("Day")
##         Day Attendance
## 1    Monday      22184
## 2   Tuesday      27660
## 3 Wednesday      27406
## 4  Thursday      30734
## 5    Friday      30826
## 6  Saturday      36087
## 7    Sunday      32914

Or mean attendance by month:

dat.nym$month <- factor(as.POSIXlt(dat.nym$Date.P)$mo,
                      labels = c("April", "May", "June",
                                 "July", "August", "September"))
temp <- aggregate(Attendance ~ month, data = dat.nym, mean)
temp$Attendance = round(temp$Attendance, 0)
names(temp)[1] <- paste("Month")
##       Month Attendance
## 1     April      28613
## 2       May      28244
## 3      June      30943
## 4      July      35780
## 5    August      34135
## 6 September      26831

And, of course, the finer grained data allows us to test more relationships, like mean attendance by opponent. Here are the top five teams:

temp <- aggregate(Attendance ~ Opp, data = dat.nym, mean)
temp$Attendance = round(temp$Attendance, 0)
names(temp)[1] <- paste("Opponent")
temp <- temp[order(-temp$Attendance),]
rownames(temp) <- 1:18
head(temp, 5)
##   Opponent Attendance
## 1      NYY      42736
## 2      LAD      35627
## 3      PIT      35565
## 4      CHC      35511
## 5      WSN      34885

And this finer-grained data allows us to develop models that incorporate consideration of time:

dat$month <- factor(as.POSIXlt(dat$Date)$mo,
                      labels = c("March", "April", "May", "June",
                                 "July", "August", "September"))
dat$day <- factor(as.POSIXlt(dat$Date)$wday,
                      labels = c("Monday", "Tuesday", "Wednesday",
                                 "Thursday", "Friday", "Saturday", "Sunday"))
summary(lm(Attendance ~ month + day, dat))
## Call:
## lm(formula = Attendance ~ month + day, data = dat)
## Residuals:
##    Min     1Q Median     3Q    Max 
## -28160  -8806    -18   8168  30107 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     32092.0     1075.6  29.836  < 2e-16 ***
## monthApril      -3977.2     1108.0  -3.590 0.000334 ***
## monthMay        -3146.8     1102.0  -2.855 0.004316 ** 
## monthJune        -839.5     1100.5  -0.763 0.445598    
## monthJuly         136.7     1110.2   0.123 0.902040    
## monthAugust     -1428.9     1100.5  -1.298 0.194190    
## monthSeptember  -2943.5     1103.9  -2.666 0.007694 ** 
## dayTuesday      -5247.4      634.8  -8.266  < 2e-16 ***
## dayWednesday    -5023.0      550.1  -9.131  < 2e-16 ***
## dayThursday     -4592.4      554.6  -8.281  < 2e-16 ***
## dayFriday       -3461.8      593.3  -5.834 5.76e-09 ***
## daySaturday       167.4      541.2   0.309 0.757108    
## daySunday        3257.9      536.2   6.076 1.33e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 10660 on 4713 degrees of freedom
## Multiple R-squared:  0.09668,    Adjusted R-squared:  0.09438 
## F-statistic: 42.03 on 12 and 4713 DF,  p-value: < 2.2e-16

Special Considerations with Longitudinal Data

If you are getting into analyzing longitudinal data, there are three things to know from the outset:

  1. Special wrangling operations, particularly understanding how unit-time identifiers work and how to perform commonly-used data transformation operations associated with longitudinal analysis.
  2. Standard description methods employed with this kind of data
  3. Special modeling considerations to consider when using longitudinal data in regressions.

Each topic will be discussed in this module.

Post to WordPress through R Markdown

Blogs are a great communications tool. Finding a way to post to WordPress through R Markdown can be a nice efficiency hack. This post documents a method for generating WordPress posts through RStudio. It uses the RWordPress package by Duncan Temple Lang. This tutorial is basically a scaled-down version of Peter Bumgartner’s excellent tutorial.

At the time of posting, both that and this tutorial are distributed using a CC-BY-NC-SA license.

Set Up Your Session

Installing the RWordPress and XMLRPC Packages

The RWordPress package is being distributed through GitHub. It can be installed using the install_github() operation in the devtools package. You also need to install Lang’s XMLRPC package as well.


Load Necessary Packages

In addition to RWordPress and XMLRP, we need knitr, a suite of operations for generating reports.


Create / Load a Script with Your Credentials

To post to WordPress, your login and password must be included in the script. Posting or circulating a Markdown file with your WordPress login credentials creates a huge web security risk. On the other hand, we teach you to be open about your coding, so that others can review, replicate, and build on it. How to negoatiate the two?

Bumgartner’s solution is to add the following variables to your .Rprofile file using this code

options(WordPressLogin = c(yourUserName = 'yourPassword'),
        WordPressURL = 'yourWordPressURL')

Do not delete the quotation marks in this template. For the ‘yourWordPressURL’, be sure to add a “https://” before and and “/xmlrpc.php” after your site address. So, for the site “joeblog.com”, you would enter “https://joeblog.com/xmlrpc.php”

Once you have generated and stored the file, you can add it to any Markdown file that you intend to post. I called my file “WPCrds.R”:

source("E:/My Documents/BlogCreds.R") 
#Or wherever you store the file

Set Images to Post to imgur.com

To save time and resources, post your images to the open online site Imgur.com. To do so, add these knitr options to your script:

# To upload images to imgur.com
opts_knit$set(upload.fun = imgur_upload, base.url = NULL)  

# Set default image dimensions
# Change them to change the dimensions of figures
opts_chunk$set(fig.width = 5, fig.height = 5, cache = TRUE)

Generate a Polished html Report

Create a Markdown file that renders your report in html. Render it locally to something sufficiently polished, and then move on to the next step.

I will download and insert an image into the report, in case someone needs to do that:

#Download the image
  destfile = "BBCard.jpg")

Then, OUTSIDE the chunk, enter into the main text of the document this line, without the hashtag

![Baseball Card](BBcard.jpg)

Baseball Card
Baseball Card

Post Your Information

In another chunk, set your post title as the variable postTitle, and define fileName as the name of the Markdown file that you are using to generate the post:

postTitle = "Your Title"
fileName = "Your-Markdown-File.Rmd"

Post using the knit2wp command, whose options include:

  • the Markdown file name (option: input)
  • the post title (option: title)
  • whether the post should be published upon rendering (option: publish, set true to directly publish – not advisable)
  • for a new post, the action variable should be set to “newPost” (option: action)
  • to set the post’s category. Use the slugs designated on your site. (option: categories)
  • to set the post’s tags. Use the slugs designated on your site. (option: tags)
postID <- knit2wp(
        input = fileName, 
        title = postTitle, 
        publish = FALSE,
        action = "newPost",
        categoories = c("Insert Slugs here"),
        tags = c("insert slugs here"),
        mt_excerpt = "Insert excerpt here"

Clean It Up on WordPress

From here, you will have a draft post on your WordPress site. I find WordPress to have a far more user-friendly interface to complete the last steps of a post.

Remember: Risk the temptation to do everything in R if it costs a lot of time. Yes, it would be cool if you could do it. The problem is no one but data nerds will appreciate it, and there’s not a lot of data nerds out there.


To post to WordPress through R Markdown, start with this setup to your script. Remember that the source() line points to a file with your credentials. For this post, it is:


postTitle = "Post to WordPress via R Markdown"
fileName = "Using-RWordPress.Rmd"

postID <- knit2wp(
        input = fileName, 
        title = postTitle, 
        publish = FALSE,
        action = "newPost",
        categories = c("analytics"),
        tags = c("communications", "Markdown"),
        mt_excerpt = "Publish a Markdown report direct to your WordPress blog"

Then write the script until it is polished, and then you can post by running this code. I recommend that you set eval = F, so that it doesn’t run when you render the Markdown file into html. Then, once the file runs, execute this code on its own, not as part of a larger operation.

Then, polish the piece in WordPress.

Steps in a Data Analysis Job

Part of teaching data analytics involves giving students an idea of the steps involved in performing analytical work. Here is my take on the steps. Please offer your comments below. The goal is to prepare young people to do better work when they go out into the field.

1. Know Your Client

On whose behalf am I performing this analysis?

The first step in a data analysis job is pinning down the identity of your client. Ultimately, data analysis is about creating information. If you are going to do that well, you need some sense of your audience’s technical background, their practical interests, their preferred mode of reasoning or communicating, and much else.

2. Start With a Clear Understanding of the Question

What information does my client want? Why do they want this information? How do they plan to use it, and is this something that I want to be a part of? Can I answer it using the resources or tools at my disposal, or do I have a clear idea of which resources or tools need to be required to get the job done?

You do not want to take on an analysis job that lacks clear informational goals. You may have to work with a client to pin down those informational goals. Doing so requires that you identify factual information that would substantially influence your client’s practical decision-making choices. In addition, it is best to avoid (or at least avoid taking lead on) jobs that you can’t handle. It is probably also a good idea to avoid jobs that are not genuinely interested or influenced by your findings, or begin with an answer already in mind.

Many clients are not highly familiar with data analysis, and may not be highly numerate. It is your professional responsibility to ensure that they understand what data analysis can and cannot do.

3. What Am I Supposed to be Examining?

What are my objects of analysis?  Which qualities, behaviors, or outcomes am I assessing?  Why?

These kinds of questions point students towards working hard to pin down the research design that guides their study. Researchers should have a clear sense of whose personal qualities or behaviors need to be examined, and exactly what about them is relevant to the client’s decision-making problem or goal. If done well, the researcher should have a clear sense of the theoretical propositions or concepts that are most crucial to the client’s goals, and how to find analyzable data to examine them.

4. Assess and Acquire Data

How can I get data on these units’ characteristics or behaviors? How do I get access to them? What is the quality of the data’s sample (i.e., how representative is the data)? Do these measures look reliable and valid?

All analyses should make some assessment of the data, using the tools that you acquired in your methods class. As the saying goes: Garbage In, Garbage Out. Don’t bother bringing a battery of high-powered analytical tools to a crap data set.

5. Clean and Prepare Data

Secure data, correct errors, refine measurements, assess missingness, and identify possible outliers.

This is the process of turning the raw, messy data that one often encounters into a clean, tidy set that is amenable to the analytical operations of your toolkit.

6. Implement Analytical Operations.

Implement analytical procedures designed to extract specific types of information from data.

Over the course of the semester, we will study a range of tools to extract information from data.

7. Interpret Analytical Operations

Convert statistical results into natural-language explanation, and assess implications to client’s thought processes.

The computer does the job of processing the data. Your job is to interface that output with the problem being confronted by your client. In the interest of making them a partner, it helps to develop the skill of translating complicated results into cognitively-accessible, but still rigorous and correct, explanations that everyone can understand.

You empower clients by giving them access to what you see in the statistical results. Win their confidence on this front, and they will work with you again in the future.

8. Communication

Finding ways to convey information in a way that resonates with client.

These are the skills that will keep you from being someone’s basement tech.

Photo Credit. United States Office Of Scientific Research And Development. National Defense Research Committee. Probability and Statistical Studies in Warfare Analysis. Washington, D.C.: Office of Scientific Research and Development, National Defense Research Committee, Applied Mathematics Panel, 1946. Pdf. https://www.loc.gov/item/2009655233/.

Quick Tutorial: American Community Survey

The American Community Survey is a workhorse survey series in demography, geography, and many other fields. This post provides a walk-through of an analysis that I conducted with the 2018 release of the American Community Survey’s five-year sample. Its examples will focus on assessing the prevalence and migration patterns of New York State’s elderly population. For a more complicated analysis that conducts these comparisons across states, review the Markdown file associated with my recent research note on the topic.

The American Community Survey (ACS) is a large-scale, high-quality running survey program run by the United States Census Bureau. It samples thousands of U.S. households annually through multiple survey projects, collecting information on their residences, demographics, household finances, and other topics. The set is a widely-used basis for estimating more detailed information that is not captured in the U.S. Census, and for estimates changes in the U.S. population that have occurred since the Census.

We begin by starting our script as a fresh session:

#Start Session

#Clear Memory

#Set directory to folder where I plan to store the data
directory <- "E:/Dropbox/Research/ACS/2018/5-Year"

# Load Packages
library(curl)    #For downloading data
library(survey)  #Package for survey analysis

Downloading the Data

The are distributed through a structured folder and file naming system. The survey data we wish to access is distributed through the same web folder: https://www2.census.gov/programs-surveys/acs/data/pums/2018/5-Year/

Web folder where this ACS set is distributed. Note the patterned file names.

The data are distributed as state-specific CSV files. Each state-level file is demarcated using a standard two-letter state naming system. This series is stored in R’s base package as state.abb, but as lower-case characters.

I ran the operation as a loop:

states <- tolower(state.abb)   #Convert *state.abb* to lower-case

for (i in states){
  #Part 1: Downalod and Unzip Person Files
  #Create state directory
  dir.create(paste0(directory,"/", i))
  setwd(paste0(directory, "/", i))
  #Create objects with file URL & destination
  #based on state abbreviation
  source_url <- 
           i, ".zip")
  dest_url <- paste0(directory, "/", i, "/csv_p", i, ".zip")
  #Download and unzip
  curl_download(source_url, dest_url)
  unzip(paste0("csv_p", i, ".zip"))
  temp.nam <- list.files(pattern = "\\.csv$")
  temp.dat <- read.csv(temp.nam)
  #Save data as RDS in main directory
  saveRDS(temp.dat, file = paste0("acs2018_5yp_", i, ".RDS"))
  #Part 2: Download & Unzip Household Files
  #Same as Part 1, but applied to csv_h* files
  setwd(paste0(directory, "/", i))
  source_url <- 
           i, ".zip")
  dest_url <- paste0(directory, "/", i, "/csv_h", i, ".zip")
  curl_download(source_url, dest_url)
  unzip(paste0("csv_p", i, ".zip"))
  temp.nam <- list.files(pattern = "\\.csv$")
  temp.dat <- read.csv(temp.nam)
  temp.dat$STATE <- i
  saveRDS(temp.dat, file = paste0("acs2018_5yh_", i, ".RDS"))

The result is that I have all the state-specific household and person data files stored in our data directory. In this example, we will work with the Arkansas person set:

DAT.AL <- readRDS("acs2018_5yp_al.RDS")

Set Up Survey Object

We are going to use the survey package to analyze these data (documentation here). The package author, Thomas Lumley, has written an excellent introduction to survey analysis using this package: Complex Surveys: A Guide to Analysis Using R (2010, Wiley).

The first step in the process of analyzing these data is to create the survey design object, which combines the respondents’ answers along with information that helps us make more sophisticated population estimates from our sample data. The information required to set up this set’s survey object is in its readme file. The code that I’m using is built on Anthony Damico’s scripts. Note that I am creating an object for a specific state survey. My strategy is to get state-level estimates through state-specific objects, and then consolidate them after estimation.

        weight = ~PWGTP ,              #Person's weight
        repweights = 'PWGTP[0-9]+' ,   #Replicate person weighting variables
        scale = 4 / 80 ,               #See page 15
        rscales = rep( 1 , 80 ) ,      
              #Scales constant across 80 replicates
        mse = TRUE ,
        type = 'JK1' ,                 
              #Jacknife pop corr, for unstratified samples
        data = DAT.AL                  
              #Using Alabama data in this analysis

And now we can engage the data.

Creating or Changing Variables

Use the update() function to add or change variables in the survey object. For example, I create a variable that designates respondents as “elderly” if they are at least 65 years old:

ACS_DESIGN_AL <- update(ACS_DESIGN_AL,   #Update the object 
                     ELDERLY = factor(AGEP>65)  #Creating a variable

Refer to this set’s data dictionary for variable names, explanations, and codes.


Use subset() to subset your survey sample:


The new object restricts its estimates to elderly Alabamans, defined as age 65+.

Parameter Estimates

Continuous Univariate

Using the AGEP variable, which measures respondents’ ages. The variable is top-coded at 99 years.

Use svymean() for a parameter mean.

#Mean age:
svymean( ~ AGEP , ACS_DESIGN_AL )
##        mean     SE
## AGEP 38.995 0.0105

So, the mean age of Alabamans, as suggested by this data, is 38.995 years old, with a standard error of 0.01 on that estimate.

#Median age
svyquantile( ~ AGEP , ACS_DESIGN_AL , 0.5 , na.rm = TRUE )
## Statistic:
##      AGEP
## q0.5   38
## SE:
##           AGEP
## q0.5 0.2511995
#10th percentile age
svyquantile( ~ AGEP , ACS_DESIGN_AL , 0.1 , na.rm = TRUE )
## Statistic:
##      AGEP
## q0.1    8
## SE:
##           AGEP
## q0.1 0.2511995

The median Alabaman is 38 years old. The 10th percentile in this distribution is 8 years old.

Discrete Univariate

FOr discrete variables, svymean() works.

#Proporotion elderly
svymean( ~ ELDERLY , ACS_DESIGN_AL )
##                mean    SE
## ELDERLYFALSE 0.8504 3e-04
## ELDERLYTRUE  0.1496 3e-04

An estimated 14.96% of Albama is “elderly”, which we defined as over 65 years old.

If the operation seems to be giving you a mean rather than a proportion, then respecify the variables as a factor using factor().

Discrete-Discrete Bivariate

Use svyby() in conjunction with svymean() for cross-tabs. In the operation svyby(x, y), the results will express proportions of “x” across categories of “y”. This command looks at the proportion of ELDERLY by categories of SEX:

#Elderly by sex
svyby( ~ ELDERLY , ~ SEX , ACS_DESIGN_AL , svymean )
##   SEX ELDERLYFALSE ELDERLYTRUE          se1          se2
## 1   1    0.8665133   0.1334867 0.0003242783 0.0003242783
## 2   2    0.8352724   0.1647276 0.0004199916 0.0004199916

The results suggest that 13.3% of males in Alabama are elderly. This figure is 16.5% for females.

#Elderly by sex
svyby( ~ factor(SEX) , ~ ELDERLY , ACS_DESIGN_AL , svymean )
##       ELDERLY factor(SEX)1 factor(SEX)2          se1          se2
## FALSE   FALSE    0.4932625    0.5067375 0.0003111538 0.0003111538
## TRUE     TRUE    0.4319345    0.5680655 0.0007504035 0.0007504035

So 56.8% of the elderly are female, whereas only 50.7% of the non-elderly population is female.

Discrete-Continuous Bivariate

Use svyby() for the mean score of a continuous variable across the categories of a discrete one.

#Mean age by sex
svyby( ~ AGEP , ~ SEX , ACS_DESIGN_AL , svymean , na.rm = TRUE )
##   SEX     AGEP         se
## 1   1 37.78940 0.01966696
## 2   2 40.12691 0.01833421

The operation can also render the percentile estimates in conjunction with svyquantile():

#Median age by sex
    ~ AGEP , 
    ~ SEX , 
    svyquantile , 
    0.5 ,
    ci = TRUE ,
    keep.var = TRUE ,
    na.rm = TRUE
##   SEX V1        se
## 1   1 37 0.2511995
## 2   2 40 0.2511995

So the median Alabaman male is 37 years old, whereas the median Alabaman female is 40.

Continuous-Continuous Bivariate

The survey package does not include an operation for calculating correlations. There is such an operatino in the jtools package. Use svycor() for correlations in survey data. For example, AGEP and personal income (PINCP)

library(jtools)   #Remember, svycor is not from survey package
svycor(~PINCP + AGEP, na.rm = T,  design = ACS_DESIGN_AL)
##       PINCP AGEP
## PINCP  1.00 0.17
## AGEP   0.17 1.00

Personal income and age have a correlatino of 0.17. For a recommended extended tutorial on svycor(), see this post by Jacob Long.

Linear Regression

For an OLS regression of a continuous variable on predictors, use svyglm() with default settings. Below, I use SEX and WKHP (hours worked last week) to estimate PINP (personal income):

model.1 <- svyglm(PINCP ~ SEX + WKHP, ACS_DESIGN_AL)
## Call:
## svyglm(formula = PINCP ~ SEX + WKHP, ACS_DESIGN_AL)
## Survey design:
## update(ACS_DESIGN_AL, ELDERLY = factor(AGEP > 65))
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18908.20     834.36   22.66   <2e-16 ***
## SEX         -13474.16     317.23  -42.47   <2e-16 ***
## WKHP          1154.15      15.89   72.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Dispersion parameter for gaussian family taken to be 2.746482e+14)
## Number of Fisher Scoring iterations: 2

This model suggests a baseline income of $18,908, a $13,474 earnings penalty to women, and an additional $1,154 in income per hour worked last week.

Logit Regression

For a logit:

model.2 <- svyglm(ELDERLY ~ WKHP, family = binomial(link = "logit"), ACS_DESIGN_AL)
## Call:
## svyglm(formula = ELDERLY ~ WKHP, family = binomial(link = "logit"), 
## Survey design:
## update(ACS_DESIGN_AL, ELDERLY = factor(AGEP > 65))
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.289704   0.043399  -29.72   <2e-16 ***
## WKHP        -0.046834   0.001216  -38.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Dispersion parameter for binomial family taken to be 121606.1)
## Number of Fisher Scoring iterations: 6

Obviously, we’re not positing a causal model here. The model predicts a -0.04 decrease in the logged odds of being elderly for each addition hour of work performed in the previous week. This converts into $ (-0.04) = 0.96 $ the odds of being elderly per additional hour worked. In other words a 0.96 – 1= -0.04 = 4% $ decrease in odds per additional hour.

Federal Spending, Poverty, and Population

This analysis examines the distribution of total federal spending by state.

Distribution of Federal Spending

We begin with a look at the distribution of total expenditures by state. The distribution of expenditures is far more equal than that of federal revenues. Particularly states with particularly high expenditures enjoy roughly double the

Five states receive particularly high levels of federal transfers. Foremost among them are the states of the national capital region – Virginia ($18,678 per person) and Maryland ($18,505) – which receive a disproportionate share of federal spending. Spending is also particularly high in the two non-continguous states: Alaska($18,046) and Hawaii ($15,213). Finally, New Mexico ($16,852) appears to enjoy particularly high spending. The remainder of the states receives between $7,800 (Utah) and $15,012 (Vermont) per person.

Figure 1 (below) presents a map depicting this distribution, minus these extraordinary four states. A full table with precise estimates is appended to this report.

Figure 1: Per Capita Federal Spending by State, 2018. National capital region and non-continugous states excluded

Although per capita expenditures seem more equitable in comparison to the wild variance in federal revenues, these differences in per capita spending amount to a lot of money in the aggregate. A thousand dollar difference in per capita spending is not chump change. Were New Jersey ($10,304 per capita expenditures received) to receive the same spending levels as Missouri ($12,787), more than $22 billion in additional spending would be infused to its economy annually. Such spending amounts to an additional 3.6% of gross state product annually.

My exploration of the data suggests that the most powerful and parsimonious explanation of inter-state differences in federal spending is presented below in Table 1. Diagnostics suggest that, in addition to the capital region states (MD and VA), Utah and Alaska are also outliers. Alaska has very high spending relative to what the model would predict, and Utah is an outlier for low federal expenditures.

Regression predicting per capita federal spending on poverty rates and logged popluation.  2018 data.
Table 1: Regression Predicting Per Capita Federal Spending, 2018

This model suggests that, at a baseline, states receive $19,145 per person, assuming zero population and zero poor. Moreover, it suggests that states receive an extra $245 per person for each additional percentage point of the population living below the poverty line. This poverty premium amounts to an additional $1,862 for New Hampshire at 7.6%, the state with the lowest rate. In Mississippi, which leads the nation at 19.8%, the poverty premium to federal spending amounted to an additional $4,851.

The model also predicts a population penalty to government spending that amounts to about $7 per person for each additional percent higher population1. So, for example, in 2018, New York State’ had a population of 19,542,209 was 547% that of Connecticut’s population of 3,572,665. The population penalty levied against New York State is $11,675 per person, compared to $10,493 in Connecticut. New York’s larger population results in an additional spending penalty of $1,182 per person.2. In other words, our analysis shows that federal spending is systematically lower in larger states.

This penalty amounts to considerable stimulus for small states. The predicted difference in federal expenditures between Wyoming (558 thousand people) and California (40 million people) amounts to a 6800% population difference, and a predicted spending difference of $2,937 less spending per person 3. The federal government’s penchant to prefer small states provides these states with a considerable economic boost. Were California to receive Wyoming’s small state funding advantage, California would be receiving on the order of $116 billion in additional spending to the California economy annually.

The Poverty Premium and Population Penalty

This analysis of federal expenditures suggests that the distribution of federal spending across states is influenced by at least two important factors: population size and the prevalence of poverty. This vantage point offers us more insight as to why federal policies shift money away from the union’s donor states. Analysts and commentators routinely point to taxes as a source of these balance of payment deficits, and the comparatively wide variation in tax remittances suggests that it is the biggest factor. However, unequal spending is also a factor. The union’s more donor states – California, New York, New Jersey, Massachusetts, Illinois, and Minnesota – have comparatively low poverty rates and larger populations.

This “population penalty” in spending suggests that the unequal distribution of power in the federal government that favors small states is being leveraged to provide surplus economic transfers to small states. Metropolitan New York has around 21 million people and is represented by six Senators and a few dozen House members. The union’s 16 lowest population states4 also amount to about 21 million people but are represented by 32 Senators and many more Representatives. This structural disadvantage in power over federal government decisions is translated into economic transfers.

Photo Credit. Cary, William De La Montagne, Artist. Emigrants to the West / W.M. Cary. The United States, ca. 1880. [Published 1881] Photograph. https://www.loc.gov/item/92505911/.

Federal Redistribution between U.S. States, 2018

This analysis examines how federal taxation and expenditures result in net inflows or outflows to difference U.S. states. The analysis uses our Data on U.S. Federal Balance of Payments, 2018 data set. Figure 1 (below) describes these inter-state money flows resulting from federal policies. A table with presice estimates are also presented at the end of this post..

Per Capita Federal Balance of Payments by State, 2018

The data suggests that six (maybe seven) states transferred considerable income to other states in 2018: New Jersey, Massachusetts, New York, Minnesota, Illinois, and California. An addition six states are within $500 in per capita transfers. The majority of U.S. states receive thousands of dollars in per capita transfers from our six “donor states”. These differences are examined through data and analyses detailed in my Federal Redistribution project materials. They describe how a great deal of this redistribution is the result of more tax payments from states that house the metro areas containing the country’s higher income households and businesses. However, there is also inequality in federal spending that is less rooted in the prevalence of poverty or some other need for redistribution, and more in smaller states’ use of political power within our union.

Appendix: Per Capita Balance of Payments by State, 2018

StatePer Capita
Balance of Payment
New Mexico10671
West Virginia8536
South Carolina5139
North Carolina2926
Rhode Island2672
North Dakota1157
New Hampshire304
South Dakota242
New York-1363
New Jersey-2792

Photo Credit. Henle, Fritz, photographer. Las Vegas, Nevada. Transmission towers and transformers redistributing power from Boulder Dam to Basic Magnesium Incorporated, which produces huge quantities of the lightest of all metals for aircraft and other wartime manufacturing. Dec. Photograph. Retrieved from the Library of Congress, <www.loc.gov/item/2017866539/>.

Combined Statistical Areas

What are Combined Statistical Areas?

A Combined Statistical Area (CSA) is a large geographic region comprised of localities (townships, towns, cities) that are substantially sustained by, and dependent on, a common urban center. The concept of Combined Statistical Areas denotes our sense of major cities’ “greater metropolitan areas”.

For example, New York’s CSA stretches out to about Princeton in the Southwest, the end of I-80 at the NJ/PA border, up through Kingston going North, out to New Haven to the East along the US mainland, and Long Island. Within that area, communities are substantially populated by people who commute to New York City, or a larger city like Newark, New Haven, White Plains, or New Brunswick, whose own population is substantially connected to the City.

Map of U.S. CSAs

The map below depicts the United States’ CSAs, as defined in 2013. Click here to download a high-quality version.

This map depicts the U.S.'s designated Combined Statistical Areas for 2013.
US Combined Statistical Areas, 2013.
Source: Bureau of the Census, U.S. Department of Commerce, Public Domain

Largest CSAs

These are the United States’ ten largest Combined Statistical Areas. It uses data from the Census Bureau (see citation below). R code here.

New York-Newark, NY-NJ-CT-PA22,589,036
Los Angeles-Long Beach, CA18,711,436
Chicago-Naperville, IL-IN-WI9,825,325
Washington-Baltimore-Arlington, DC-MD-VA-WV-PA9,814,928
San Jose-San Francisco-Oakland, CA9,665,887
Boston-Worcester-Providence, MA-RI-NH-CT8,287,710
Dallas-Fort Worth, TX-OK8,057,796
Houston-The Woodlands, TX7,253,193
Philadelphia-Reading-Camden, PA-NJ-DE-MD7,209,620
Miami-Port St. Lucie-Fort Lauderdale, FL6,889,936


A comma-separated values file with demographic information about these areas can be downloaded here. This set’s codebook is here. Cite it as:

Census Bureau (2020) "Annual Resident Population Estimates and Estimated Components of Resident Population Change for Combined Statistical Areas and Their Geographic Components: April 1, 2010 to July 1, 2019" Downloaded 12/9/2020 from <https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/metro/totals/cbsa-est2019-alldata.csv>

Labeling Exhibits in R Markdown

This tutorial describes how to label exhibits in your Markdown file. We will caption tables, figures, and images. This tutorial shows quick-and-dirty methods. Download the Markdown file used to generate this post here.


Imagine I want to convert this data frame named “table.raw” to a polished table in a report that I am generating using Markdown.

##            city  pop
## 1      New York 22.6
## 2   Los Angeles 18.7
## 3       Chicago  9.8
## 4          D.C.  9.8
## 5 San Francisco  9.6

Were I to generate the table using the kable() command from the knitr package, I could implement the caption using the command option caption. You can format the table using kableExtra package (click here for tips on formatting kable() tables). I will add the caption “Major Metro Area Populations”:

library(knitr)                                                     #Load packages
kable(table.raw,                                                   #Specify object presented in table
      col.names = c("City", "Pop."),                               #Setting column labels
      caption = "Table 1: Major Metro Area Populations") %>%       #Table caption inserted here,
                                                                   #Pipe at end of line to invoke next line:
  kable_styling(bootstrap_options = "striped", full_width = F)     #Additional formatting options
Table 1: Major Metro Area Populations
City Pop.
New York 22.6
Los Angeles 18.7
Chicago 9.8
D.C. 9.8
San Francisco 9.6


I use “table.raw” data to generate a figure using the package ggplot2. Note that I use the reorder() operation when defining the x-axis, in order to sort cities by population size rather than alphabetical order.:

To caption figures, you can set figure captions through the fig.cap option when you set your Markdown chunk.

``{r, echo = T, warning = F, fig.cap = "Major Metro Area Populations (in M)"}
ggplot(table.raw, aes(x = reorder(city, -pop), y = pop)) + geom_bar(stat = 'identity')

Figure 1: Major Metro Area Populations (in M)


Captioning images is easy for a basic image insertion. There are other methods, which allow for finer adjustments (e.g., click here).1. Note that you enter this command in the main body of a Markdown file, not in a chunk.

![Figure 2: Nice pic, even better caption!](Picture.jpg)

Figure 2: Nice pic, even better caption!

Need help with the Commons? Visit our
help page
Send us a message