Home » Posts tagged 'Survey Analysis'

Tag Archives: Survey Analysis

Quick Tutorial: American Community Survey

The American Community Survey is a workhorse survey series in demography, geography, and many other fields. This post provides a walk-through of an analysis that I conducted with the 2018 release of the American Community Survey’s five-year sample. Its examples will focus on assessing the prevalence and migration patterns of New York State’s elderly population. For a more complicated analysis that conducts these comparisons across states, review the Markdown file associated with my recent research note on the topic.

The American Community Survey (ACS) is a large-scale, high-quality running survey program run by the United States Census Bureau. It samples thousands of U.S. households annually through multiple survey projects, collecting information on their residences, demographics, household finances, and other topics. The set is a widely-used basis for estimating more detailed information that is not captured in the U.S. Census, and for estimates changes in the U.S. population that have occurred since the Census.

We begin by starting our script as a fresh session:

#Start Session

#Clear Memory

#Set directory to folder where I plan to store the data
directory <- "E:/Dropbox/Research/ACS/2018/5-Year"

# Load Packages
library(curl)    #For downloading data
library(survey)  #Package for survey analysis

Downloading the Data

The are distributed through a structured folder and file naming system. The survey data we wish to access is distributed through the same web folder: https://www2.census.gov/programs-surveys/acs/data/pums/2018/5-Year/

Web folder where this ACS set is distributed. Note the patterned file names.

The data are distributed as state-specific CSV files. Each state-level file is demarcated using a standard two-letter state naming system. This series is stored in R’s base package as state.abb, but as lower-case characters.

I ran the operation as a loop:

states <- tolower(state.abb)   #Convert *state.abb* to lower-case

for (i in states){
  #Part 1: Downalod and Unzip Person Files
  #Create state directory
  dir.create(paste0(directory,"/", i))
  setwd(paste0(directory, "/", i))
  #Create objects with file URL & destination
  #based on state abbreviation
  source_url <- 
           i, ".zip")
  dest_url <- paste0(directory, "/", i, "/csv_p", i, ".zip")
  #Download and unzip
  curl_download(source_url, dest_url)
  unzip(paste0("csv_p", i, ".zip"))
  temp.nam <- list.files(pattern = "\\.csv$")
  temp.dat <- read.csv(temp.nam)
  #Save data as RDS in main directory
  saveRDS(temp.dat, file = paste0("acs2018_5yp_", i, ".RDS"))
  #Part 2: Download & Unzip Household Files
  #Same as Part 1, but applied to csv_h* files
  setwd(paste0(directory, "/", i))
  source_url <- 
           i, ".zip")
  dest_url <- paste0(directory, "/", i, "/csv_h", i, ".zip")
  curl_download(source_url, dest_url)
  unzip(paste0("csv_p", i, ".zip"))
  temp.nam <- list.files(pattern = "\\.csv$")
  temp.dat <- read.csv(temp.nam)
  temp.dat$STATE <- i
  saveRDS(temp.dat, file = paste0("acs2018_5yh_", i, ".RDS"))

The result is that I have all the state-specific household and person data files stored in our data directory. In this example, we will work with the Arkansas person set:

DAT.AL <- readRDS("acs2018_5yp_al.RDS")

Set Up Survey Object

We are going to use the survey package to analyze these data (documentation here). The package author, Thomas Lumley, has written an excellent introduction to survey analysis using this package: Complex Surveys: A Guide to Analysis Using R (2010, Wiley).

The first step in the process of analyzing these data is to create the survey design object, which combines the respondents’ answers along with information that helps us make more sophisticated population estimates from our sample data. The information required to set up this set’s survey object is in its readme file. The code that I’m using is built on Anthony Damico’s scripts. Note that I am creating an object for a specific state survey. My strategy is to get state-level estimates through state-specific objects, and then consolidate them after estimation.

        weight = ~PWGTP ,              #Person's weight
        repweights = 'PWGTP[0-9]+' ,   #Replicate person weighting variables
        scale = 4 / 80 ,               #See page 15
        rscales = rep( 1 , 80 ) ,      
              #Scales constant across 80 replicates
        mse = TRUE ,
        type = 'JK1' ,                 
              #Jacknife pop corr, for unstratified samples
        data = DAT.AL                  
              #Using Alabama data in this analysis

And now we can engage the data.

Creating or Changing Variables

Use the update() function to add or change variables in the survey object. For example, I create a variable that designates respondents as “elderly” if they are at least 65 years old:

ACS_DESIGN_AL <- update(ACS_DESIGN_AL,   #Update the object 
                     ELDERLY = factor(AGEP>65)  #Creating a variable

Refer to this set’s data dictionary for variable names, explanations, and codes.


Use subset() to subset your survey sample:


The new object restricts its estimates to elderly Alabamans, defined as age 65+.

Parameter Estimates

Continuous Univariate

Using the AGEP variable, which measures respondents’ ages. The variable is top-coded at 99 years.

Use svymean() for a parameter mean.

#Mean age:
svymean( ~ AGEP , ACS_DESIGN_AL )
##        mean     SE
## AGEP 38.995 0.0105

So, the mean age of Alabamans, as suggested by this data, is 38.995 years old, with a standard error of 0.01 on that estimate.

#Median age
svyquantile( ~ AGEP , ACS_DESIGN_AL , 0.5 , na.rm = TRUE )
## Statistic:
##      AGEP
## q0.5   38
## SE:
##           AGEP
## q0.5 0.2511995
#10th percentile age
svyquantile( ~ AGEP , ACS_DESIGN_AL , 0.1 , na.rm = TRUE )
## Statistic:
##      AGEP
## q0.1    8
## SE:
##           AGEP
## q0.1 0.2511995

The median Alabaman is 38 years old. The 10th percentile in this distribution is 8 years old.

Discrete Univariate

FOr discrete variables, svymean() works.

#Proporotion elderly
svymean( ~ ELDERLY , ACS_DESIGN_AL )
##                mean    SE
## ELDERLYFALSE 0.8504 3e-04
## ELDERLYTRUE  0.1496 3e-04

An estimated 14.96% of Albama is “elderly”, which we defined as over 65 years old.

If the operation seems to be giving you a mean rather than a proportion, then respecify the variables as a factor using factor().

Discrete-Discrete Bivariate

Use svyby() in conjunction with svymean() for cross-tabs. In the operation svyby(x, y), the results will express proportions of “x” across categories of “y”. This command looks at the proportion of ELDERLY by categories of SEX:

#Elderly by sex
svyby( ~ ELDERLY , ~ SEX , ACS_DESIGN_AL , svymean )
##   SEX ELDERLYFALSE ELDERLYTRUE          se1          se2
## 1   1    0.8665133   0.1334867 0.0003242783 0.0003242783
## 2   2    0.8352724   0.1647276 0.0004199916 0.0004199916

The results suggest that 13.3% of males in Alabama are elderly. This figure is 16.5% for females.

#Elderly by sex
svyby( ~ factor(SEX) , ~ ELDERLY , ACS_DESIGN_AL , svymean )
##       ELDERLY factor(SEX)1 factor(SEX)2          se1          se2
## FALSE   FALSE    0.4932625    0.5067375 0.0003111538 0.0003111538
## TRUE     TRUE    0.4319345    0.5680655 0.0007504035 0.0007504035

So 56.8% of the elderly are female, whereas only 50.7% of the non-elderly population is female.

Discrete-Continuous Bivariate

Use svyby() for the mean score of a continuous variable across the categories of a discrete one.

#Mean age by sex
svyby( ~ AGEP , ~ SEX , ACS_DESIGN_AL , svymean , na.rm = TRUE )
##   SEX     AGEP         se
## 1   1 37.78940 0.01966696
## 2   2 40.12691 0.01833421

The operation can also render the percentile estimates in conjunction with svyquantile():

#Median age by sex
    ~ AGEP , 
    ~ SEX , 
    svyquantile , 
    0.5 ,
    ci = TRUE ,
    keep.var = TRUE ,
    na.rm = TRUE
##   SEX V1        se
## 1   1 37 0.2511995
## 2   2 40 0.2511995

So the median Alabaman male is 37 years old, whereas the median Alabaman female is 40.

Continuous-Continuous Bivariate

The survey package does not include an operation for calculating correlations. There is such an operatino in the jtools package. Use svycor() for correlations in survey data. For example, AGEP and personal income (PINCP)

library(jtools)   #Remember, svycor is not from survey package
svycor(~PINCP + AGEP, na.rm = T,  design = ACS_DESIGN_AL)
##       PINCP AGEP
## PINCP  1.00 0.17
## AGEP   0.17 1.00

Personal income and age have a correlatino of 0.17. For a recommended extended tutorial on svycor(), see this post by Jacob Long.

Linear Regression

For an OLS regression of a continuous variable on predictors, use svyglm() with default settings. Below, I use SEX and WKHP (hours worked last week) to estimate PINP (personal income):

model.1 <- svyglm(PINCP ~ SEX + WKHP, ACS_DESIGN_AL)
## Call:
## svyglm(formula = PINCP ~ SEX + WKHP, ACS_DESIGN_AL)
## Survey design:
## update(ACS_DESIGN_AL, ELDERLY = factor(AGEP > 65))
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18908.20     834.36   22.66   <2e-16 ***
## SEX         -13474.16     317.23  -42.47   <2e-16 ***
## WKHP          1154.15      15.89   72.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Dispersion parameter for gaussian family taken to be 2.746482e+14)
## Number of Fisher Scoring iterations: 2

This model suggests a baseline income of $18,908, a $13,474 earnings penalty to women, and an additional $1,154 in income per hour worked last week.

Logit Regression

For a logit:

model.2 <- svyglm(ELDERLY ~ WKHP, family = binomial(link = "logit"), ACS_DESIGN_AL)
## Call:
## svyglm(formula = ELDERLY ~ WKHP, family = binomial(link = "logit"), 
## Survey design:
## update(ACS_DESIGN_AL, ELDERLY = factor(AGEP > 65))
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.289704   0.043399  -29.72   <2e-16 ***
## WKHP        -0.046834   0.001216  -38.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Dispersion parameter for binomial family taken to be 121606.1)
## Number of Fisher Scoring iterations: 6

Obviously, we’re not positing a causal model here. The model predicts a -0.04 decrease in the logged odds of being elderly for each addition hour of work performed in the previous week. This converts into $ (-0.04) = 0.96 $ the odds of being elderly per additional hour worked. In other words a 0.96 – 1= -0.04 = 4% $ decrease in odds per additional hour.

Studying Household Finance for Your Thesis

Household finance is a big topic. There is much to learn. There will be a lot you don’t know after studying the topic for years. In fact, there’s a lot about household finance that no one knows for sure.

When getting started with a thesis or empirical paper on the topic, it is a smart move to narrow your focus. Focus on one facet of household finance, and build out from there in subsequent projects.

Why Study Household Finance?

  1. It matches your personal interest and aptitudes.  The most important reason to study this topic is that talking about money interests you, and you have a natural aptitude for it. 
  2. Good topic for those seeking work in marketing or finance.  Marketing and finance are two major industries in New York City.  Develop your knowledge of people’s money as part of your attempt to enter these industries.

Focus on What?

I mean “focus” in three ways. Focus on:

  1. Playing the role of clarifying our collective discussion by generating and interpreting empirics. What distinguishes our program, and your training with us, is a philosophical dedication to the primacy of probing and verifying factual propositions. We do not strongly emphasize philosophy or theory or politics. In any discussion that you confront in your online research or literature review, remember that your strength is to be the one who comes in with data and helps the group establish which assumed facts are likely true or false.
  2. A particular facet of households’ finances. Choose some facet of income, expenditures, assets, and debts. See below for more.
  3. A particular group. I recommend waiting until you see some empirics before making that choice. If, for example, you develop a whole study to explain why college grads spend more on books, and then you find that they don’t spend more on books, then it’s going to result in lost time and possibly a late thesis.

Which Facet of Household Finances?

For your thesis, I recommend that you focus on a particular facet of households’ income, expenditures, savings, assets, debt, or wealth. To learn more about these categories, read the first few chapters of my 2017 book (free to download here).


Incomes are concerned with money flowing into a household. Just about any survey tries to capture some kind of income measure, but a household finance specialist would dive into the weeds of how people earn money. Households receive income as a result of work wages, transfers from family members or friends, government payments, investment returns, and much else.

I am most familiar with the Survey of Consumer Finances as a means of fine-grained analysis for U.S. household incomes. The Panel Survey of Income Dynamics is another very well-regarded set.


Expenditures concern where households expend money. Note that expending (spending) money is not the same as investing. For U.S. households, the Consumer Expenditure Survey is the best gold standard fine-grained data set. It parses household spending into a wide range of categories — from food to apparel to shelter to entertainment to healthcare and much else, and drilled down to narrow product categories.


Savings represent the unspent portion of a household’s income. It is income less savings. These savings are channeled into asset categories in discussions of wealth (below). The Survey of Consumer Finances offers a view of how Americans save their money.


Assets are property that can be exchanged for money or used to make money. This includes houses, financial investments, vehicles, retirement savings, real estate investments, intellectual property rights, and other property. Focus on the Survey of Consumer Finances.


Debt is money that is owed to others. Common debt vehicles include credit cards, student loans, medical loans, installment purchase loans, car loans, and mortgages. Focus on Survey of Consumer Finances

Net Worth / Wealth

Net worth is assets less debt. It is the value of a household’s owned property were its debts to be instantaneously paid off through asset liquidation. Focus on Survey of Consumer Finances

Pick a Focus

I recommend focusing on one of the above topics. You decide how narrow or broad. The Consumer Expenditure Survey will allow you to go as narrow as figuring out how much money people spend on toothpaste, or who spends the most on it. You can go as broad as asking how much households typically spend on all their healthcare in a year.

Get some ideas, and then return to your professor and colleagues to discuss next steps.

Studying Household Finance at QC MADASR

A household finance thesis might be a good option for a student who is interested in pursuing professional work in fields that engage marketing, finance, or economics. I do supervise theses on these topics, but require that students meet with me to ensure that this type of work is both practical and fits in their career strategy. To schedule a meeting, contact me at joseph.cohen@qc.cuny.edu.

Need help with the Commons? Visit our
help page
Send us a message