Group vs. individual uses of data

Andrew Gelman notes that, on the subject of value-added assessments of teachers, “a skeptical consensus seems to have arisen…” How did we get here?

Value-added assessments grew out of the push for more emphasis on measuring success through standardized tests in education — simply looking at test scores isn’t OK because some teachers are teaching in better schools or are teaching better-prepared students. The solution was to look at how teachers’ students improve in comparison to other teachers’ students. Wikipedia has a fairly good summary here.

Back in February New York City released (over the opposition of teachers’ unions) the value-added scores of some 18,000 teachers. Here’s coverage from the Times on the release and reactions.

Gary Rubinstein, an education blogger, has done some analysis of the data contained in the reports and published five posts so far: part 1, part 2, part 3, part 4, and part 5. He writes:

For sure the ‘reformers’ have won a battle and have unfairly humiliated thousands of teachers who got inaccurate poor ratings. But I am optimistic that this will be be looked at as one of the turning points in this fight. Up until now, independent researchers like me were unable to support all our claims about how crude a tool value-added metrics still are, though they have been around for nearly 20 years. But with the release of the data, I have been able to test many of my suspicions about value-added.

I suggest reading his analysis in full, or at least the first two parts.

For me one early take-away from this — building off comments from Gelman and others — is that an assessment might be a useful tool for improving education quality overall, while simultaneously being a very poor metric for individual performance. When you’re looking at 18,000 teachers you might be able to learn what factors lead to test score improvement on average, and use that information to improve policies for teacher education, recruitment, training, and retention. But that doesn’t mean one can necessarily use the same data to make high-stakes decisions about individual teachers.

1 Comments Add Yours ↓

The upper is the most recent comment

  1. Jacob Hartog #
    1

    There are a host of arguments against these models in general: Jesse Rothstein’s paper discussing the biases introduced by dynamic tracking, in which he showed that a 5th grade teacher’s value added score was correlated with a student’s progress in 4th grade, was perhaps the most dramatic.

    What I think has been inadequately discussed is the use of individual specifications, rather than the zone of agreement across a broad swath of specifications, to assess teachers’ efficacy. For example, the model used by NYCDOE doesn’t just control for a student’s prior year test score (as I think everyone can agree is a good idea.) It also assumes that different demographic groups will learn different amounts in a given year, and assigns a school-level random effect. The result is that, as was much ballyhooed at the time of the release of the data, is that the average teacher rating for a given school is roughly the same, no matter whether the school is performing great or terribly. The headline from this was “excellent teachers spread evenly across the city’s schools,” rather than “the specification of these models assume that excellent teachers are spread evenly across the city’s schools.”

    To be partisan for a moment, imagine using a multi-level model to assess the efficacy of basketball players that imposed a team-level random effect: we might easily ‘discover’ that the average player on the Charlotte Hornets was as good as the average player on the Chicago Bulls, when really a)that is an effect of the model design, and b)good players are what makes a team good, just as good teachers are by-and-large what makes a school good.

    If I was more sophisticated, I’d try to extend Rothstein’s paper to show that dynamic sorting of teachers into high and low-functioning schools messes up the models just as badly as dynamic sorting of students does.

    I should add that as someone who taught in NYC schools for 8 years, there’s nothing wrong with measurement. The pre-existing observation and evaluation system was completely terrible, and it is totally reasonable to combine evaluations with test-based measurement in making decision. Analytically privileging the results of tests over other forms of observation, let alone assigning a percentile to the coefficient on a single (remarkably complex) model and publishing it in the newspapers, is absolutely bonkers.

    I’d appreciate anyone’s comments showing me where I’m wrong.

    P.S. Rubinstein wrote a very funny and valuable book called “Reluctant Disciplinarian” about his early years teaching, that’s very much worth a read if you’re going to teach K-12. I’m not totally sold on his blog, though.

    P.p.s. While all of us are somewhat comfortable with the idea that Google or Netflix are using inscrutable algorithms to determine what search results or movies we want to see, there really is something funny about using inscrutable algorithms as a chief means of workplace evaluation- rather than teachers being “behind the curve” of private industry in this regard, teacher value-added models are endeavoring to impose a hitherto Taylorization of white-collar work in a way that has not been previously implemented. There are ample reasons this is happening now, and happening first to teachers– kids’ test scores offer a large and relatively homogeneous data source, school districts are large enough and statistical computing power cheap enough to make implementing these analyses relatively easy, and perhaps most of all the political and economic climate have made teachers a promising target for reduced professional privilege and autonomy. Will we all, within and without of education, find our work product reduced to a formula almost none of us understand? Perhaps the robots will be grading us even before they take our jobs.

    An interesting if somewhat “look at those crazy equations!” discussion of these issues appeared soon after NYC implemented its value added model, and prior to the release of ratings to the public:
    http://www.nytimes.com/2011/03/07/education/07winerip.html?pagewanted=all

    P.p.p.s. I can see much more argument for using these measures for evaluating teachers than I can for release them to the papers. What possible benefit is there to telling a 3rd grader his teacher is below average, midway through the year? Talk about removing the locus of control away from the student…



Your Comment