Understanding the Limitations of Your Data: At-Risk Versus Low-Income

A while ago, I described how decisions about how to treat your data can have significant effects on your results–effects that can be as profound as, if not greater than, p-value tweaking. So a preliminary analysis of the effects of charters and income in D.C. provides a very clear example of what I meant (since I imagine that, for some, the genomics stuff was pretty unfamiliar).

Before I get started, I want to make clear that this isn’t close to my ideal analysis: ultimately, I would want to look at individual students. If I had to look at school-level trends, I would want to do something similar to what Bruce Baker did in his analyses of New Jersey schools. So consider this an example that should raise questions about how these analyses categorize students and what that might mean, not an attempt at offering a conclusive ‘answer’ to the charter school issue.

Anyway, let’s get to some data. G.F. Branderburg put together the following dataset:

To answer this question, I used some recent data. I just found out that the DC City Council has begun requiring that schools enumerate the number of students who are officially At-Risk. They define this as students who are:

“homeless, in the District’s foster care system, qualify for Temporary Assistance for Needy Families (TANF) or the Supplemental Nutrition Assistance Program (SNAP), or high school students that are one year older, or more, than the expected age for the grade in which the students are enrolled.” (That last group is high school students who have been held back at least one time at some point in their school career.)

So, it’s a simple (but tedious) affair for me to plot the percentage of such at risk students, at each of the roughly 200 publicly-funded schools in Washington, DC, versus the average percentage of students who were proficient or advanced in math and reading on the 2014 DC-CAS….

I took the average of the percentage of students ‘passing’ the DC-CAS in math and in reading as the proficiency rate.

Like I mentioned, not ideal, but useful for making a point about analysis. To Brandenburg’s data, I also added a column for each school (where available) the percentage of students who are low-income–that is, they qualify for free or reduced lunch. Economically, this is a much broader group of students (up to 185% of the poverty line or $36,612 annual income for a household of three). You can download the data I used here.

Anyway, when the percentage of students who are low-income is plotted against the average percentage of students who were proficient or advanced, there is a significant charter school effect: a charter school, on average, has the same effect as a fifteen point reduced in the percentage of low-income students (for every additional percentage point of low-income students, proficiency drops by 0.52%). The model has an R-squared of 0.489*, and the p-values for the two dependent variables are highly significant (< 10e-4):

Average percent proficient in reading and math, 2014 by %low income

When the same analysis is performed except using At-Risk students instead of low-income, the charter school effect essentially vanishes (charter schools increase the average percentage of proficient or advanced students by less than three percentage points, but this variable is not statistically significant, with p = 0.154). The model has an R-squared of 0.638, and the effect of At-Risk students is much greater than before, with every additional percentage point of low-income students decreasing proficiency by 0.76% percentage points:

Average percent proficient in reading and math, 2014 by % at risk

While D.C. public and charter schools don’t differ in the percentage of low-income students, they do differ in the percentage of At-Risk students, with charters having fewer such students:
At risk
Proportion of At-Risk students in charter and regular public schools. The heavy bar in the middle is the median, and the box edges are the 25th and 75th percentiles, with the “T’s” representing the maximum and minimum values. Over half of the regular public schools have more At-Risk students than 75 percent of the charter schools.

So, does this mean TEH SCIENTISMZ ARE FALSE!? No, what this does mean, however, is that all of the ‘pre-processing’ steps can dramatically influence analysis outcomes. Simply by changing what we mean by ‘low-income’ (that is, kids whose lives are pretty miserable), we can get a fundamentally different result: we can move from charters have no significant effect to have a large one. It also means that maybe we need to revisit some of the charter school analyses, using better dependent variables. A child in a home with earnings of $35,000/yr isn’t demographically equivalent to one in a household with annual earnings of $10,000/yr–and shouldn’t be treated as such.

As I wrote at the beginning, this shouldn’t be viewed as anything close to definitive. But I think there’s something to the criticism that how we classify students might affect our conclusions regarding various interventions, including charters (i.e., the CREDO study).

Understanding how your data ‘pre-processing’ affects your analysis is critical.

An aside: The third figure also demonstrates that D.C. charters and public schools do not have equivalent student bodies. Not even close.

*R-squared assesses the extent to which the model as a whole can account for the data. For example, a R-squared of 0.5 means that 50% of the total variation in whatever you’re measuring (in this case, student economic status) can be accounted for by the variables in the model.

This entry was posted in Education, Statistics. Bookmark the permalink.