High Stakes Tests Do Not Improve Student Learning

High Stakes Tests Do Not Improve Student Learning

A FairTest Report by Monty Neill, Acting Executive Director
January 1998

A common assumption of standards and tests-based school reform is that high-stakes testing, such as having to pass an exam for high school graduation, will produce improved learning outcomes. This view is found in the grading formula used in Quality Counts (1998), the recent Education Week report in which states receive points for having high-stakes tests.

If the assumption were true, students in states with high-stakes tests should perform better than other students on neutral measures of academic performance. But independent evidence shows just the opposite is true: students in states without high-stakes tests perform better than those in states with them. While this correlation by itself does not “prove” that high-stakes testing damages student learning, it certainly contradicts the underlying assumption of those who advocate test-driven instruction.

A FairTest review of published data from the National Assessment of Educational Progress (NAEP) reveals that students were less likely to reach a level of “proficient” or higher on the NAEP math or reading tests in states which had mandatory high school graduation tests. Those states also had more students who failed to reach NAEP’s “basic” level. In addition, states with high school graduation tests were less likely to show statistically significant improvement in their students’ scores than were states without such tests.

In essence, then, proponents of high-stakes tests expect states to adopt the assessment strategies used in the weakest performing states. Given the NAEP data, there is no reason to believe such a reform strategy will achieve success.

Math Performance

FairTest compared the results from the NAEP 1996 grades 4 and 8 math tests (Quality Counts, 1998) with whether a state required its students to pass a test to obtain a high school diploma in 1994-95 and 1995-96 (Bond, et al., 1996a, 1996b). The data reveal a strong and clear negative relationship between having a mandatory high school graduation test (HSGT) and having a greater than the national average percentage of students reaching the “proficient” level or better. A clear and strong positive relationship is apparent between having a high- stakes graduation test and having fewer than the national average students attain even the “basic” level on the math tests. (The four levels on NAEP assessments are “advanced,” “proficient,” “basic,” and “below basic”). That is, states with high-stakes graduation tests have fewer students who reach “proficient” and more students who fail to reach the “basic” level on NAEP math tests.

Forty-three states participated in the 1996 grade 4 NAEP math test (Quality Counts, 1998). At the national level, the proportion of students reaching at least the “proficient” level was 20 percent. Of the 24 states in which 20% or more of the students reached “proficient” or higher, only 5 — about one-fifth — had a HSGT in 1994-96. Of the 19 states which had fewer than 20% of the students reach “proficiency,” 11 — over half — had a HSGT. Rank-ordered, only two of the top 15 states had a HSGT.

On the other hand, of the 18 states which equalled or did worse than the national average of 38% in the proportion of students failing to reach “basic,” 12 (or two-thirds) had a HSGT. But of the 25 states which had fewer than the national average of students falling below the “basic” level, only five (or one-fifth) had a HSGT.

If high-stakes tests really helped improve learning, then one might expect the impact to be more visible in grade 8 than grade 4, particularly since graduation tests are commonly given in grade 10. This is not the case.

Forty states participated in the 1996 grade 8 NAEP math test (Quality Counts, 1998). Ranked by percentage of students reaching “proficient” or higher, none of the top 17 states had a HSGT. (See Table 1). But of the 22 states scoring worse than the national average of students, 13 had a HSGT.

Similarly, among the 21 states equalling or exceeding the national average percentage of students scoring below “basic,” 14 (or two thirds) had a HSGT. None of the 19 states which did better than the national average at educating all students at least the “basic” level had a HSGT. (See Table 1.)

Quality Counts also cited states which made statistically significant gains in their math scores. Of the 14 state-level cases of significant gains in NAEP math scores from 1992 – 1996, only 4 were in states with a HSGT, or 28.5%. Overall, 17 of 50 states had a HSGT in this time period, or 34%. That is, states with HSGTs were less likely to make statistically significant gains on NAEP than were states without HSGTs. This certainly does not support the claim that states with high stakes tests will improve most quickly.

Reading Performance

The 1994 NAEP reading assessment yielded scores for 39 participating states (Campbell, et al., 1996). FairTest compared the test results with whether a state had a HSGT in 1993-94 (Roeber, et al., 1994). (See Table 2.)

Of the 19 states which had 28% or more students reach “proficient” or higher (the national average; Quality Counts, 1998), two had a HSGT in 1993-94. None of the top ten states had a HSGT. Of the 20 states which performed below the national average, 14 had a HSGT.

On the other hand, of 17 states which did better than the national average of 41% of students not reaching “basic,” only one had a HSGT. Of the 22 states which performed equal to or worse than the national average, 15 had a HSGT.

As for improvement, there was none, as compared with 1992. Of 37 states participating in 1992 and 1994, 8 had statistically significant declines, and 4 of those had HSGTs (Campbell, et al., 1996, p. 25). Thus, while states without a HSGT were more likely to make statistically significant gains in math at grade 4, states with a HSGT were more likely to have a statistically significant decline in reading at grade 4.

Analysis and conclusions

NAEP results do not support the claim that having high-stakes tests leads to higher educational quality. Rather, it appears that proponents have based their rationale for high-stakes testing on ideology, not evidence. This false rationale, in turn, is being used to engineer a sweeping change in test use in the U.S. According to FairTest’s recent survey, Testing Our Children (Neill, et al., 1997), at least five more states plan to introduce graduation tests — even though there is no evidence they will help improve student learning.

Proponents of high-stakes testing have argued further that it is students who have done least well in schools who will most benefit from test-driven change. However, on all three NAEP exams, few or none of the states with HSGTs bettered the national average for proportion of students at the below “basic” level. These same low-performing states have done a weaker job of significantly improving their outcomes.

Why does test-driven instruction fail? It may well be that when teachers teach to one, inevitably narrow test, they do not provide students with the rich education they deserve. Inflated test scores, the famous “Lake Wobegon” effect, in which virtually all states were “above average” on norm-referenced tests, have been well documented (Haladyna, 1991; Linn, et al., 1990; Shepard, 1990). The test score gains which appear on the state test may not show up on other assessments whose content is not drilled, such as NAEP. In practice, then, the use of tests for accountability can actually undermine real improvement in student achievement, or at least inhibit it, because they narrow curriculum and instruction (Madaus, 1988 ; Madaus, et. al., 1992; Smith, 1991).

Support for test-driven reform is not found in NAEP’s analysis of the 1994 reading assessment (Campbell, et al., 1996). Lower NAEP scores are associated with relying primarily on basals instead of trade books; with using a worksheet almost every day rather than less than weekly; with reading less material; and with students having less opportunity to discuss their own interpretation of what they read (pp. 68ff). That is, lower scores are associated with practices commonly found to accompany test-driven schooling (Smith, 1991). Similar results have been found in an analysis of the results of the Third International Math and Science Study (Stigler Hiebert, 1997).

It is rather strange that a strategy which lacks demonstrated success and which is based on the practice of the educationally weakest region of the U.S. (the south, where most states have HSGTs) should have emerged as the primary national strategy for improving schooling. It is a strategy which reinforces drill-and-kill instructional methods which both national and international assessments have shown not to work.

Like any seller of a “new and improved” product, proponents of test-driven “reform” maintain this will all change with new tests. However, as FairTest’s detailed evaluation of assessment programs in all 50 states, Testing Our Children, has shown, the so-called new tests are, for the most part, more of the same old thing (Neill, et al., 1997). Passing scores have been set higher, but the tests are mostly unchanged.

Additionally, while many states claim that their new tests are based on new state standards, this too is misleading. In many cases, the tests were developed before the standards. In many others, important parts of the standards, particularly the parts which expect students to learn to think in the various subject areas, are simply not tested (Neill, et al., 1997).

States with HSGTs are least likely to have moved away from a total or near-total reliance on multiple-choice testing. Multiple-choice tests are poor measures for whether a student can think and problem-solve in a subject area. NAEP tests have become half or more non-multiple-choice, which may explain in part why states with high-stakes tests perform less well on those exams. Unlike NAEP, Clinton’s proposed national exams will be three-quarters multiple-choice or fill in the blank. Rather than developing truly better assessments, most states are making only minor revisions, and states with high-stakes tests are making the least changes.

In sum, NAEP results show that test-driven change is a profoundly mistaken approach to improving schooling. States should look at the actual results of this approach, not the ideological claims of test-proponents, and draw the conclusion that fewer, lower stakes is the preferred approach.

Note

Robert Schaeffer and Jennifer Griffis of the FairTest staff participated in writing this report.

Bibliography

Bond, L., Roeber, E., and Braskamp, D. (1996, Fall). Trends in State Student Assessment Programs. Washington, D.C.: Council of Chief State School Officers.

Bond, L., Braskamp, D., and Roeber, E. (1996, May). The Status of State Student Assessment Programs in the United States. Washington, D.C.: Council of Chief State School Officers.

Campbell, J., Donahue, P., Reese, C., and Phillips, G. (1996). NAEP 1994 Reading Report Card for the Nation and the States. Washington, D.C.: National Center for Education Statistics, U.S. Department of Education.

Linn, R., Graue, M. E., Sanders, and N. M. (1990). Comparing State and District Results to National Norms: The Validity of the Claims that “Everyone Is Above Average.” Educational Measurement: Issues and Practice, 9(3), 5-14.

Haladyna, T., Nolen, S., and Haas, N. (1991). Raising Standardized Achievement Test Scores and the Origins of Test Score Pollution. Educational Researcher (20) 5, 2-7.

Madaus, G. F. (1988). The Influence of Testing on the Curriculum. In Tanner, L. N. (Ed.), Critical Issues in the Curriculum, 83-121. 87th Yearbook of the National Society for the Study of Education, Part I. Chicago: University of Chicago Press.

Madaus, G. F., West, M. M., Harmon, M. C., Lomax, R. G., Viator, K. A. (1992). The Influence of Testing on Teaching Math and Science in Grades 4-12 (SPA8954759). Chestnut Hill, MA: Boston College, Center for the Study of Testing, Evaluation, and Educational Policy.

Neill, M., and the Staff of FairTest. (1997). Testing Our Children: A Report Card on State Assessment Systems. Cambridge, MA: National Center for Fair Open Testing.

Quality Counts (1998). Washington, D.C.: Editorial Projects in Education

Roeber, E., Bond, L., and van der Ploeg, A. (1994). State Student Assessment Programs Database, 1993-1994. Washington, D.C.: Council of Chief State School Officers.

Shepard, L. (1990). Inflated Test Score Gains: Is the Problem Old Norms or Teaching the Test. Educational Measurement: Issues and Practice, 9(3), 15-22

Smith, M. L. (1991). Put to the Test: The Effects of External Testing on Teachers. Educational Researcher, 20(5), 8-11.

Stigler, J., and Hiebert, J. (1997). Understanding and Improving Classroom Mathematics Instruction: An Overview of the TIMMS Video Study. Phi Delta Kappan, 79(1), 14-21.