Performing basic geo spatial analysis with US Census data.

Part 1: Geo-spatial analysis with US Census data

The goal of this document is to show basic geo-spatial analysis in R. These plots or similar version of these plots were used to perform a basic corporate diligence looking at the general workforce metrics including the SAHIE insurance data and net migration. Together these can help provide an overall understanding of the average demographics of the population within a 45 mile radius (~1-1.5 hr by car) from the capital. This document is meant for education purposes only.

Document summary:

Performing basic geo-spatial analysis with US Census data.
Written by: Pete Skach and Eric W. Olle
CC-BY-SA license for the r-markdown document.
GPLv3 for the code.
No Warranty Provided
Created: Oct 2022
Edited for blog: October 25, 2022

The GitHub for this project contains the R-markdown and the data used to generate this blog. The data and the markdown document are CC-BY-SA.

Loading the packages required.

One item needed for the used prior to using the US Census API packages (tidycensus or ) is a API key. Please see the Kyle Walkers Tidycensus page for more information.

Once the packages are loaded the next step is to start on the data collection, plots and finally making a full figure.

Looking at net migration on a county level

In this first call the function get_estimates will be used to obtain the county level information on net migration to (+) or from (-) a county.

Once the basic migration data has been obtained from the US census summary tables, this can be summarized to create an insert into the graph. The county level spatial information will be used for other plots. The inclusion of county-line information with census-tract data allow for an easy comparison between the graphs.

sum_df <- net_migration %>%
    select(-geometry) %>%
    data.frame() %>%
    summarize(min = round(min(value),2), 
              max = round(max(value), 2), 
            mean = round(mean(value),2), 
            median = round(median(value),2),
            sd =  round(sd(value),2))

knitr::kable(sum_df)
minmaxmeanmediansd
-24.7213.66-5.35-4.856.51
net_migration %>%
    ggplot(aes(fill = value)) + 
    geom_sf(color = "white") +
    theme_void() +
    geom_point(color = "Black",
               aes(x = -81.626500, y = 38.347410, group = NULL)) +
    geom_polygon(data =  charls$circle, aes(lon, lat), inherit.aes=FALSE, color = "black", alpha = 0) +
    scale_fill_viridis_c(option = "turbo")  +
    theme(plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5)) +
    labs(title = "Migration Rate per 100,000",
         fill = "Migration") +
    annotate(geom = "table", x = -83, y  = 36.5, label = list(sum_df), 
             vjust = 0, hjust = 0, size = 2) -> migration

migration
WV net migratoin
Rate of migration

This plot shows the overall migration to (+) and from (-) the different counties in WV. The summary table shows the overall distribution of the county migration.

Overall workforce participation

The next plot will concentration on the number of people between 18-64 that are employed full-time as a ratio to the total population to estimate workforce participation. This is different from unemployment.

## Getting data from the 2016-2020 5-year ACS
## Using FIPS code '54' for state 'WV'
WV_employ  %>%
    ggplot(aes(fill = pct_labor_partic)) + 
    geom_sf(color = NA) +
    geom_sf(data = net_migration, color = "white", fill = NA, inherit.aes = FALSE) + 
    theme_void() +
    scale_fill_viridis_c(option = "turbo") +
    geom_point(color = "black",
               aes(x = -81.626500, y = 38.347410, group = NULL), alpha = .2) +
    geom_polygon(data =  charls$circle, aes(lon, lat), inherit.aes=FALSE, color = "black", alpha = 0) +
    theme(plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5)) +
    labs(title = "Probability in the workforce.",
         fill = "Participation") -> wkfrce_partic

wkfrce_partic
Workforce participation 18-64

The above plot shows the overall full-time workforce participation. This data looks at eh 18-64 year old demographics and may underestimate total participation in the workforce. In some of the southern WV counties, there is a higher than average participation in the 65 and older group.

Income data

The next graph will look at the median income on a census tract level.

## Getting data from the 2016-2020 5-year ACS
## Using FIPS code '54' for state 'WV'

Once the income data is obtained next is to plot in a format similar to above.

WV_income %>%
  ggplot(aes(fill = estimate)) +
  geom_sf(color = NA) +
  geom_sf(data = net_migration, color = "white", fill = NA, inherit.aes = FALSE) +
  theme_void() +
  geom_point(color = "black",
               aes(x = -81.626500, y = 38.347410, group = NULL), alpha = .2) +
  geom_polygon(data =  charls$circle, aes(lon, lat), inherit.aes=FALSE, color = "black", alpha = 0) +
  scale_fill_viridis_c(option = "turbo") +
  theme(plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5)) +
  labs(title = "Median Income.") -> median_income

median_income
median income
Median Income by census tract

This shows the median income by census tract with county-line shown in white. On average, WV is below the national median.

Ratio of Individual versus Employer provided health insurance

Finally, the final graph before making a aggregated figure is to look at the ratio of individual provided versus employer provided health insurance coverage. This can be an indicator for a large number of part-time jobs that provide enough income to not qualify for medicaid but do not provide health insurance benefits.

WV_healt_ins <- get_acs(
  state = "WV",
  geography = "tract",
  variables = c(Employer = "C24050_001E",
                Ind_purch  = "C27005_001E"),
  geometry = TRUE,
  output = "wide",  
  year = 2020
)
## Getting data from the 2016-2020 5-year ACS
## Using FIPS code '54' for state 'WV'

This time the ggplots ability to calculate the ratio by row will be used.

WV_healt_ins %>%
    ggplot(aes(fill = Ind_purch/Employer)) + 
    geom_sf(color = NA) +
    geom_sf(data = net_migration, color = "white", fill = NA, inherit.aes = FALSE) + 
    theme_void() +
    geom_point(color = "black",
               aes(x = -81.626500, y = 38.347410, group = NULL), alpha = 0.2) +
    geom_polygon(data =  data.frame(charls), aes(circle.lon, circle.lat), inherit.aes=FALSE, color = "black", alpha = 0) +
    scale_fill_viridis_c(option = "turbo")  +
    theme(plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5)) +
        labs(title = "Insurance rate Individual/Employer.",
         fill = "Ind/Emp") -> ins_source

ins_source
ration of ind/emp Health coverage
Ratio of Individual to Employer provided health insurance

The above plot shows that on average there is a higher rate of individual provide health insurance in the southern side of the state.

Putting four plots together.

Finally, the plots will be arrange into a single plot using ggpubr.

 ggarrange(median_income, wkfrce_partic, ins_source, migration,   
           nrow = 2, ncol =2, align = "none") %>%
  annotate_figure(., bottom = text_grob("Data source 2020 5-year ACS. Circle is 45 mi from Charleston"))
4 panel figure
Four panel plot of the data

Conclusion

The tidycensus package combined with ggplot2 can be a useful tool for basic geospatial analysis. In the next blog, how to make a pseudo-3d plot and stack the figures will be shown.

## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Pop!_OS 20.04 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] knitr_1.39       RWordPress_0.2-3 ggpmisc_0.5.0    ggpp_0.4.5      
##  [5] magrittr_2.0.3   ggpubr_0.4.0     tigris_1.6.1     forcats_0.5.1   
##  [9] stringr_1.4.0    dplyr_1.0.9      purrr_0.3.4      readr_2.1.2     
## [13] tidyr_1.2.0      tibble_3.1.7     ggplot2_3.3.6    tidyverse_1.3.1 
## [17] tidycensus_1.2.2 geosphere_1.5-14
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7       fs_1.5.2           sf_1.0-7           lubridate_1.8.0   
##  [5] httr_1.4.3         gh_1.3.0           tools_4.2.1        backports_1.4.1   
##  [9] utf8_1.2.2         rgdal_1.5-32       R6_2.5.1           KernSmooth_2.23-20
## [13] DBI_1.1.3          colorspace_2.0-3   withr_2.5.0        sp_1.5-0          
## [17] tidyselect_1.1.2   gridExtra_2.3      curl_4.3.2         compiler_4.2.1    
## [21] cli_3.3.0          rvest_1.0.2        quantreg_5.93      SparseM_1.81      
## [25] xml2_1.3.3         labeling_0.4.2     scales_1.2.0       classInt_0.4-7    
## [29] proxy_0.4-27       rappdirs_0.3.3     digest_0.6.29      foreign_0.8-82    
## [33] pkgconfig_2.0.3    highr_0.9          dbplyr_2.2.1       rlang_1.0.4       
## [37] readxl_1.4.0       XMLRPC_0.3-1       rstudioapi_0.13    farver_2.1.1      
## [41] generics_0.1.3     jsonlite_1.8.0     car_3.1-0          RCurl_1.98-1.8    
## [45] s2_1.0.7           Matrix_1.5-1       Rcpp_1.0.9         munsell_0.5.0     
## [49] fansi_1.0.3        abind_1.4-5        lifecycle_1.0.1    yaml_2.3.5        
## [53] stringi_1.7.8      carData_3.0-5      MASS_7.3-58.1      grid_4.2.1        
## [57] maptools_1.1-4     crayon_1.5.1       lattice_0.20-45    haven_2.5.0       
## [61] cowplot_1.1.1      splines_4.2.1      hms_1.1.1          pillar_1.7.0      
## [65] uuid_1.1-0         markdown_1.1       ggsignif_0.6.3     XML_3.99-0.10     
## [69] wk_0.6.0           reprex_2.0.1       glue_1.6.2         evaluate_0.15     
## [73] modelr_0.1.8       vctrs_0.4.1        tzdb_0.3.0         MatrixModels_0.5-0
## [77] cellranger_1.1.0   gtable_0.3.0       assertthat_0.2.1   xfun_0.31         
## [81] mime_0.12          broom_1.0.0        gitcreds_0.1.2     e1071_1.7-11      
## [85] rstatix_0.7.0      class_7.3-20       survival_3.4-0     viridisLite_0.4.0 
## [89] units_0.8-0        ellipsis_0.3.2

References

Aphalo P (2022). ggpp: Grammar Extensions to ‘ggplot2’. R package version 0.4.5,
https://CRAN.R-project.org/package=ggpp.

Aphalo P (2022). ggpmisc: Miscellaneous Extensions to ‘ggplot2’. R package version 0.5.0,
https://CRAN.R-project.org/package=ggpmisc.

Bache S, Wickham H (2022). magrittr: A Forward-Pipe Operator for R. R package version 2.0.3,
https://CRAN.R-project.org/package=magrittr.

Kassambara A (2020). ggpubr: ‘ggplot2’ Based Publication Ready Plots. R package version
0.4.0, https://CRAN.R-project.org/package=ggpubr.

Hijmans R (2021). geosphere: Spherical Trigonometry. R package version 1.5-14,
https://CRAN.R-project.org/package=geosphere.

Walker K (2022). tigris: Load Census TIGER/Line Shapefiles. R package version 1.6.1,
https://CRAN.R-project.org/package=tigris.

Walker K, Herman M (2022). tidycensus: Load US Census Boundary and Attribute Data as
‘tidyverse’ and ‘sf’-Ready Data Frames
. R package version 1.2.2,
https://CRAN.R-project.org/package=tidycensus.

Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686,
https://doi.org/10.21105/joss.01686

Leave a Reply