educationdata R RepositoryMixture Modeling for Discipline Based Education Researchers (MM4DBER) is an NSF funded training grant to support STEM Education scholars in integrating mixture modeling into their research.
https://github.com/UrbanInstitute/education-data-package-r
https://educationdata.urban.org/documentation/index.html#data_sources
https://www.isbe.net/Documents/Civil-Rights-Data-Collection-Terms.pdf
What is a data.frame (shorthand; df)?
A data.frame is a specific type of R-object
(an object that we save into the R Global Environment; see
upper-right window pane). When we save data.frames using
functions frome the tidyverse package they are saved as
tibbles which is just a special format of
df.
What is an R function? E.g, read_csv()
R functions are the tools which are designed to conduct specific data
science tasks such as reading CSV files into R. They work by having the
user provide 1 or more inputs called arguments. Arguments
are separated within R functions by commas (,). R functions
can also do things like estimate statistical models lm()
and calculate equations sum(). This is a bit confusing
because in the statistics the same term is used to refer to a
statistical or mathematical operations (e.g., f(x) = m(x) + b).
R packages contain pre-made functions, or you can write your own.
What are arguments within R functions?
write_csv() functions first
argument is where you input the data.frame
which you want to use to write a CSV file frombullying indicators & save
it within your R-projectInstall & load packages
NOTE: The educationdata R package is the location in
which this data was downloaded from
*We will not need it for this tutorial but if you want to access this large education repository (see link)
#install.packages("educationdata")
library(educationdata)
library(tidyverse)
library(here)
library(naniar)
library(janitor)
library(naniar)
library(gt)
library(lubridate)
library(glue)
Step 1: Load data from repository source (we will skip this step - it takes to long!):
educationdata R package using the
function get_education_data()level,
source, topic, subtopic,
filters)data sub-folder)# bullydis17 <- get_education_data(
# level = 'schools',
# source = 'crdc',
# topic = 'harassment-or-bullying',
# subtopic = list('disability', 'sex'),
# filters = list(year = 2017),
# add_labels = TRUE)
Step 2: Create a CSV file for easy access in your data
directory (again- we will skip this step):
write_csv() function (so you
don’t have to reload it every time)data.frame R-object to read
into the functionhere function to specify the
location & name of the CSV file you createhere::here() is excessively verbose,
just here() would also work. In some instances if you want
to make it clear to future readers such as your future-self
which R package the here() function comes from you can use
this long-format (in this case it is obvious and so
redundant/unnecessary to include).# write_csv(bullydis17, here::here("data", "crdc_bully2017.csv"))
Step 3: Read in the data
bully17_raw <- read_csv(here::here("data", "crdc_bully2017.csv"))
Step 4: Always check your data
tabyl() from the package
janitor() to take a quick look at variable frequencies# a. View the `data.frame` (click on the R-object in your `Environment`)
# b. Check frequencies using tabyl()... Does the data make sense?
tabyl(teach_staff$law_enforcement_ind)
tabyl(teach_staff$counselors_fte)
tabyl(teach_staff$psychologists_fte)
tabyl(teach_staff$social_workers_fte)
# You could also use `summary()`, `psych::describe()`, or many other functions to look at summary stats...
# I like to look at important variables 1-by-1 using tabyl() when doing my initial quick check for weird values and missing codes
staff indicators & save it
(repeat steps 1-3)Step 1: Load data from repository source
# teach_staff <- get_education_data(
# level = 'schools',
# source = 'crdc',
# topic = 'teachers-staff',
# filters = list(year = 2017),
# add_labels = TRUE)
Step 2: Create a CSV file
# write_csv(teach_staff, here::here("data", "teach_staff17.csv"))
Step 3: Read in the data
staff17_raw <- read_csv(here::here("data", "teach_staff17.csv"))
Step 4: Always check your data
# a. View the `data.frame` (click on the R-object in your `Environment`)
# b. Check frequencies using tabyl()... Does the data make sense?
tabyl(teach_staff$law_enforcement_ind)
tabyl(teach_staff$counselors_fte)
tabyl(teach_staff$psychologists_fte)
tabyl(teach_staff$social_workers_fte)
bullying dataStep 1. Let’s see what we are working with
Take a look at the data stratified by state (fips)
tabyl(bully17_raw$fips)
Step 2. Tidying up the bully17_raw data
bully17_raw data. Notice how each
observation is repeated 9 times (hint: check crdc_id)race, sex, disability)bully17_raw) & create
a new data.frame named bully_version1filter() function to select rows where the
three variables race, sex, and
disability have the variable level Total%>%) we will use the
select() function to select columns by label
(crdc_id, leaid, ncessch, fips) and then also select any
columns that contain the character string “report” and “disc”. Because
we want to select all variables with these two
character stemsbully_version1 <- bully17_raw %>%
filter(race == "Total", sex == "Total", disability == "Total") %>%
select(crdc_id, leaid, ncessch, fips,
contains("report"),
contains("disc"))
Let’s check to see if we got the right variables & do a quick frequency scan
# a. View the `data.frame` (click on the R-object in your `Environment`)
# b Check frequencies using tabyl()
tabyl(bully_version1$students_report_harass_dis)
Step 3. Extract the target sample (Arizona)
I chose Arizona state for the target sample- I wanted to use a smaller sub-sample for demonstration purposes (i.e., LCA estimation)
data.frame named
bully_version2 building off of our previous df named
bully_version1filter() function to filter observations where
fips == "Arizona"bully_version2 <- bully_version1 %>%
filter(fips == "Arizona")
Quick check
# a. View the `data.frame` (click on the R-object in your `Environment`)
# b Check frequencies using tabyl()
tabyl(bully_version2$students_report_harass_dis)
Step 4. Dichotomize indicator variables and clean up variable names
contains() to select variables that match a
character stringmutate_at() & case_when()
dichotomize the variable set (threshold; >=1 to 1, all else to
0)rename_all() & string_replace() to
change variable stem across the variable setrelocate()bully_version3 <- bully_version2 %>%
mutate_at(vars(contains("students")),
~ case_when(
. >= 1 ~ 1,
. == 0 ~ 0 )) %>%
rename_all(~stringr::str_replace(.,"students_report_harass_","report_")) %>%
rename_all(~stringr::str_replace(.,"students_disc_harass_","discip_")) %>%
relocate(crdc_id, fips, leaid,
ends_with("dis"),
ends_with("race"),
ends_with("sex"))
Take a look at our cleaned candidate indicators (frequencies)
# OOOPS, the code is 'broken' & needs fixing....
# Hold down `option` key and change all six rows to the correct df name
tabyl(bully_data$report_dis)
tabyl(bully_data$report_race)
tabyl(bully_data$report_sex)
tabyl(bully_data$discip_dis)
tabyl(bully_data$discip_race)
tabyl(bully_data$discip_sex)
data.frame named
staff_version1from the original datastaff_version1 <- staff17_raw %>%
select(crdc_id, leaid, ncessch, fips,
contains("counselor"), contains("psych"), contains("social"),
contains("nurse"),contains("security"),contains("law"))
Quick check
tabyl(staff_version1$social_workers_fte)
tabyl(staff_version1$psychologists_fte)
tabyl(staff_version1$counselors_fte)
tabyl(staff_version1$law_enforcement_fte)
# Hmmm... What to do about non-integer values (i.e., part-time staff / average number of school staff)?
# Recall- observation unit is at the school-level (so these are averages)
Extract the target sample (Arizona)
staff_version2 <- staff_version1 %>%
filter(fips == "Arizona")
Dichotomize & rename variables
staff_version3 <- staff_version2 %>%
mutate_at(vars(contains("teacher"),
contains("counselor"), contains("psych"), contains("social"),
contains("nurse"),contains("security"),law_enforcement_fte),
~ case_when(
. >= .1 ~ 1,
. == 0 ~ 0 )) %>%
rename_all(~stringr::str_replace(.,"_enforcement","")) %>%
rename_all(~stringr::str_replace(.,"_workers","")) %>%
rename_all(~stringr::str_replace(.,"_guard","")) %>%
rename_all(~stringr::str_replace(.,"ologists",""))
crdc_id (school
ID) and leaid (district ID)bully_staff_combined <- bully_version3 %>%
full_join(staff_version3) %>%
select(-fips, -law_ind) %>%
relocate(crdc_id, ncessch, leaid)
Extra step: Save the combined LCA indicator data
write_csv(bully_staff_combined, here::here("data", "CRDC_BullyStaff_AZ17.csv"))
educationdata R
package in a previous stepread_csv() & here()
package as in previous stepsOPTION-select method)# Read the final LCA indicator data created in "Task 3" back into our environment (we need this one)
lca_indicators <- read_csv(here::here("data", "CRDC_BullyStaff_AZ17.csv")) # LCA indicators (bully + staff)
assess18 <- read_csv(here::here("data_aux", "assessments18.csv")) # candidate distals 18'
gradrate18 <- read_csv(here::here("data_aux", "gradrate18.csv")) # candidate distals 18'
meps18 <- read_csv(here::here("data_aux", "meps18.csv")) # candidate distals 18'
discipline17 <- read_csv(here("data_aux", "discipline17.csv")) # candidate covariates 17'
finance17 <- read_csv(here("data_aux", "finance17.csv")) # candidate covariates 17'
offenses17 <- read_csv(here("data_aux", "offenses17.csv")) # candidate covariates 17'
offerings17 <- read_csv(here("data_aux", "offerings17.csv")) # candidate covariates 17'
restraint17 <- read_csv(here("data_aux", "restraint17.csv")) # candidate covariates 17'
math_sci17 <- read_csv(here("data_aux", "math_science_enroll17.csv")) # candidate covariates 17'
directory17 <- read_csv(here("data_aux", "directory17.csv")) # candidate covariates 17'
assess_AZ <- assess18 %>%
filter(fips == "Arizona") %>%
select(leaid, ncessch, # county & school IDs
contains("midpt"))
grad_AZ <- gradrate18 %>%
filter(disability == "Total", econ_disadvantaged == "Total", foster_care == "Total",
race == "Total", homeless == "Total", fips == "Arizona") %>%
group_by(ncessch) %>%
slice(1) %>%
ungroup() %>%
select(leaid, ncessch, # county & school IDs
contains("low"),
contains("mid"),
contains("high"))
meps_AZ <- meps18 %>%
filter(fips == "Arizona") %>%
select(leaid, ncessch, # county & school IDs
contains("pct"),
contains("ptl"))
discipline_AZ <- discipline17 %>%
filter(race == "Total", sex == "Total", disability == "Total", lep == "All students",
fips == "Arizona") %>%
select(leaid, ncessch, # county & school IDs
contains("student"),
contains("expulsion"),
contains("transfer"))
offenses_AZ <- offenses17 %>%
filter(fips == "Arizona") %>%
select(-fips,-year,-crdc_id,-homicide_ind)
offerings_AZ <- offerings17 %>%
filter(fips == "Arizona") %>%
select(leaid, ncessch, # county & school IDs
contains("class"),
contains("taught"),
-contains("single"))
restraint_AZ <- restraint17 %>%
filter(disability == "Total",fips == "Arizona") %>%
select(leaid, ncessch, # county & school IDs
contains("instances"))
math_sci_AZ <- math_sci17 %>%
filter(race == "Total", sex == "Total", disability == "Total", lep == "All students",
fips == "Arizona") %>%
select(leaid, ncessch, # county & school IDs
contains("enrl"))
finance_AZ <- finance17 %>%
filter(fips == "Arizona") %>%
select(leaid, ncessch, # county & school IDs
contains("salaries"),
contains("aides"),
contains("expend"), administration_fte)
directory_AZ <- directory17 %>%
filter(fips == "Arizona") %>%
select(leaid, ncessch, # county & school IDs
school_type,school_level,
urban_centric_locale,
enrollment,teachers_fte,
lunch_program,title_i_schoolwide,
title_i_eligible)
Quick check
tabyl(directory_AZ$urban_centric_locale)
tabyl(directory_AZ$school_type)
tabyl(directory_AZ$school_level)
tabyl(directory_AZ$lunch_program)
tabyl(directory_AZ$title_i_schoolwide)
tabyl(directory_AZ$teachers_fte)
crcd_data <- lca_indicators %>%
full_join(directory_AZ) %>%
full_join(assess_AZ) %>%
rename_all(~stringr::str_replace(.,"_test_pct_prof_midpt","_test")) %>%
rename_all(~stringr::str_replace(.,"title_i","title1")) %>%
rename_all(~stringr::str_replace(.,"_schoolwide","_s")) %>%
rename_all(~stringr::str_replace(.,"_eligible","_e")) %>%
rename_all(~stringr::str_replace(.,"_centric_locale","_rural")) %>%
mutate_at(vars(contains("lunch_program")),
~ case_when(
. == "No" ~ 0,
. == "Yes participating without using any Provision or the CEP" ~ 1,
. == "Yes under Provision 2" ~ 1,
. == "Yes under Provision 3" ~ 1,
. == "Yes under the Community Eligibility Provision (CEP)" ~ 1)) %>%
replace_with_na(replace = list(math_test = c(-3))) %>%
replace_with_na(replace = list(read_test = c(-3))) %>%
replace_with_na(replace = list(title1_s = c("Missing/not reported"))) %>%
replace_with_na(replace = list(title1_e = c("Missing/not reported"))) %>%
mutate_at(vars(contains("title")),
~ case_when(
. == "No" ~ 0,
. == "Yes" ~ 1))
Check covariate frequencies
tabyl(crcd_data$teachers_fte)
tabyl(crcd_data$title1_e)
tabyl(crcd_data$lunch_program)
tabyl(crcd_data$read_test)
tabyl(crcd_data$math_test)
Check distal outcome distribution using density plots
crcd_data %>%
ggplot(aes(x=read_test)) +
geom_density() +
theme_light()
crcd_data %>%
ggplot(aes(x=math_test)) +
geom_density() +
theme_light()
write_csv(crcd_data, here::here("data", "CRDC_Aux_AZ17.csv"))
crcd_3step <- crcd_data %>%
select(-school_type,-school_level,-urban_rural)
write_csv(crcd_3step, here::here("data", "crcd_aux_data.csv"))