What's the Value of Growth Measures?

K-12 Testing

FairTest Examiner - July 2007

The upcoming reauthorization of ESEA/NCLB will almost certainly expand state options to use "growth or "value added" testing models. Tracking student learning across the grades certainly makes sense. But the current "growth" models all use no more than standardized test scores as the measurement tool. FairTest supports growth models if they incorporate multiple sources of evidence of student learning. Without that, they will perpetuate narrow teaching to the test. They must also be used properly. In this article, measurement and research expert Jerry Bracey discusses other limitations and unknowns in the current "growth models."

Evaluating Value-Added
By Gerald W. Bracey

In the current version of No Child Left Behind, all children in grades 3-8 and one grade in high school are tested once each year. The students who enter third grade next year must score higher than those who entered this year in order for the school/district to make Adequate Yearly Progress (AYP). Because next year's third graders are different children than this year's, they might well differ in achievement levels for reasons that have nothing to do with curriculum or instruction.

A value-added model of evaluation attempts to avoid this problem by tracking individual students as they progress through the grades. The models test students at the beginning of the year and at the end. The change in test scores over the year is the "value" that has been added to the child's achievement level. The question then becomes: how much of this added value does the teacher (or school) account for (as opposed to what is added by parents, community, etc.)?

The idea of value-added models (VAM) or Value Added Assessment (VAA) has created a stir in a number of places and on occasion, has been oversold. Research has provided some good evidence on the capabilities and limits of VAA. Here is a summary of key findings:

  • VAA makes more sense than the current successive-cohorts system for determining AYP. It makes more sense to follow kids over time, although if the goal remains 100% proficiency the whole operation remains nuts.
  • VAA is circular: it defines effective teachers as those who raise test scores, then uses test score gains to determine who's an effective teacher.
  • Value-added might not give stable results. An article by J. R. Lockwood and colleagues in the Spring, 2007 issue of the Journal of Educational Measurement finds that, using a test that tests mathematical procedures, they could generate a list of effective teachers. Using a test of math problem solving they could generate a list of effective teachers. But they weren't the same lists! There was more variability in effectiveness within teachers than across teachers. People have assumed that the trait of "effective" had some stability within a given teacher. This research calls that into question.
  • The calculations for VAMs are simple subtractions from year to year. But what if mathematics in one year is mostly about fractions and decimals and the next year mostly about geometry and statistics? Does subtracting year one's score from year two's make any sense? Achievement occurs in difference dimensions as a student moves through a career in school. Problems such as this have yet to be attended to by the modelers.
  • Aside from William Sanders and his Tennessee Value Added Assessment System (TVAAS), those working in VAA (Henry Braun, Howard Wainer, Dan McCaffrey, Dale Ballou, J. R. Lockwood, Haggai Kupermintz, for example) acknowledge that it cannot permit causal inferences about individual teachers. At best, it is a first step toward identifying teachers who might need additional professional development or low performing schools in need of technical assistance.
  • The model also presumes that the teacher "effect" persists-like a diamond, it lasts undiminished forever. This has not been independently demonstrated.
  • VAA is regressive in that it reinforces the idea that schools consist of teachers in boxes with 25 or so kids. Sanders claims his technique can deal with team-taught classes, but even if that is true it misses the dynamic of schools. As Kupermintz put it, "The TVAAS model represents teacher effects as independent, additive and linear. Educational communities that value collaborations, team teaching, interdisciplinary curricula and promote student autonomy and active participation may find [it of little use]. It regards teachers as independent actors and students as passive recipients of teacher 'effects'…" In fact, as class size gets smaller, the TVAAS makes it harder for a teacher to look outstanding or ineffectual. [Author's note: TVAAS is now known as EVAAS. T was for Tennessee where the model was developed and first applied. E is for Educational].
  • Sanders' model (and others) improperly assumes that educational tests form equal-interval scales, but they do not. No amount of finagling with item response theory will fix that. On a thermometer, a true equal interval scale, the amount of heat needed to get from 10 degrees to 11 is the same as that needed to go from 110 to 111. On a test, it might require very different amounts of "achievement" to get from one point to another on different parts of the scale. Sanders believes that using normal curve equivalents (NCE's) cures this. But NCE's are just transformations of percentiles. They have not been validated by any external criteria as representing equal amounts of "achievement" across their range.
  • And, perhaps most crucially, it presumes that students and teachers are randomly assigned to classes and overlooks that they are not. Many people choose a school by choosing where to live and within districts they sometimes choose a school other than the neighborhood school. Teachers with seniority often get to choose what school or what classes they teach. They don't usually choose hard-to-teach kids or low-performing schools. And parents exert pressure-here in Alexandria, VA, parents fight to get their kids into Pat Welsh's high school writing classes. Big changes in test scores might well reflect these deviations from randomness as much as anything teachers do in their classrooms. Value-added models typically act as if deviations from random assignment aren't important. They are.
  • Finally, value-added is being oversold. At the Battelle for Children website, one can read, "Combining value-added analysis and improved high school assessments will lead to improved high school graduation rates, increased rigor in academic content, high college going rates, less college remediation and increased teacher accountability." How many validity studies support these assertions? The answer, without doubt, is zero.

Gerald W. Bracey is an independent research, policy analyst, and writer who authors the Research column for Phi Delta Kappan. He blogs at www.huffingtonpost.com/gerald-bracey

The following quotations are from articles that support the points made in "Evaluating Value-Added."

We find that the variation in estimated effects resulting from the different mathematics achievement measures is large relative to the variation resulting from choices about model specification, and that the variation within teachers across achievement measures is larger than the variation across teachers.
Across a range of model specifications, estimated VAM teacher effects were extremely sensitive to the achievement outcome used to create them.
- J. R. Lockwood, et alia, Journal of Educational Measurement, Spring 2007

In practice, psychometricians usually act as if ability scores are on an equal-unit scale (or in technical terms, an "interval" scale). But this is merely an assumption of convenience. As prominent psychometricians have pointed out, many of the usual procedures for comparing achievement gains yield meaningless results if the ability scales lack the (equal interval) property.

Do we possess [an equal interval scale] for difficulty, or are we merely able to determine the order of difficulty, assigning higher numbers to items judged to be harder?
The latter account of the matter is the correct one. And because ability is measured on the same scale as difficulty, the same holds true of it.
- Dale Ballou, Education Next, 2002

There are a number of technical concerns, which lead many measurement specialists to counsel caution. These concerns are not just academic niceties-they are central to the proposed used of VAM, especially those related to accountability. Indeed, given the complexity of academic settings, we may never be satisfied that VAM can be used to appropriately partition the causal effects of the teacher, the school and the student on measured changes in standardized test scores.
- Henry Braun and Howard Wainer, "Value-added Modeling," in Handbook of Statistics, 2007. Emphasis added

The TVAAS model represents teacher effects as independent, additive, and linear. Educational communities that value collaborations, team teaching, interdisciplinary curricula, and promote student autonomy and active participation in educational decisions may find little use for such information. A model that regards teachers as isolated, independent actors and students as passive recipients of teacher "effects" may not be adequate in some contexts. When the fit between the model and the phenomenon it seeks to represent is poor, validity is threatened.
- Haggai Kupermintz, Educational Evaluation and Policy Analysis, Fall 2003

Rather than attempting to conduct a study of the validity of estimated VAA school or teacher effects, which are likely to be impossible to obtain because of nonrandom assignment of students to schools and classrooms and the difficulty in distinguishing between school and teacher contributions to learning, Rubin and colleagues suggest estimating the effect of having a VAA program on student achievement.
- Daniel McCaffrey and Laura Hamilton, Unpublished paper, March 20, 2007