No Safety in Numbers

K-12 Testing

A national spate of test-scoring errors by the country’s three largest test publishers - CTB/McGraw Hill, Harcourt Brace, and Riverside - should send a resounding message to policy makers and the public: a single standardized test should never be relied upon to make an important decision about students or schools.


The most dramatic test-maker foul-up occurred this fall in New York City where the CTBS "Terra Nova" exam was used for the first time to determine placement in summer school and grade promotion. A scoring mistake by CTB/McGraw Hill caused 8,600 students to be mistakenly ordered to attend summer school, at the end of which over 3,000 were told they would have to repeat a grade. Based largely on CTBS scores, School Chancellor Rudy Crew removed several district superintendents last spring.


However, the test scores of thousands of children as well as the district were actually higher than reported. Crew has now called for an independent review of the test company’s procedures and lauded the improvement shown in schools’ test scores. Students who in fact met the test score requirements now will be promoted. However, Crew did not reinstate any fired district superintendent. The city is currently revising its promotion policy (see related story).


But Crew’s fury, the fate of thousands of children and the threat of a independent review apparently did not motivate the company to clean up its act. Less than two months later, CTB/ McGraw Hill goofed again, printing the state’s eighth grade math test scores on the forms for the reading exam. District personnel across the state received forms which described how students performed in reading - not math. The error was caught after the reports were distributed to all districts. A company spokesperson commented, “there were no inaccuracies in the scores,” but the flaw was due to a printing error.


Scoring Errors
The NYC error - application of the wrong norming table to translate students’ raw scores into comparison or percentile scores - resulted in incorrect results being released to at least 5 other jurisdictions across the country, including Nevada, South Carolina, Indiana, Wisconsin and Tennessee. In Nevada, schools were erroneously placed on probation due to faulty scores. CTB has also admitted to previous scoring errors in Missouri and Florida.


Earlier, the company had made a similar error in tabulating scores on a statewide test administered to 3rd, 6th, 8th and 10th graders in Indiana (see Examiner, Winter ‘98-99). That error was caught by the Fort Wayne school superintendent, who ultimately convinced state education officials to order an official review. As in New York City, the test manufacturer initially maintained there was no mistake. Now, the state superintendent of education and one gubernatorial candidate are calling for a suspension of the test and an investigation of the company’s quality-control procedures.


Scoring mistakes were not confined to tests made by CTB/ McGraw Hill. Earlier this year California state officials caught a major error by Harcourt Brace Educational Measurement (HBEM). That company miscategorized a quarter million English speaking students as “English language learners,” thus artificially inflating the latter group's average scores. The faulty figures were used by opponents of billingual education to support the states’ anti-bilingual education law. Other errors were later found in the scores for 190,000 students in year-round schools. The state has penalized HBEM $1.1 million penalty for mismanagement.


The contractor for Washington’s statewide test, Riverside Publishing of Chicago, and a subcontractor, NCS of Iowa, recently accepted responsibility for mis-scoring 410,000 student essays on last spring’s test. Misinterpretations of the grading guidelines by scorers was blamed for the mistakes, along with a tendency for readers to “drift” from the established grading standards. Riverside estimated that rescoring the essays will cost $600,000.


Measurement Error
While most reported errors can be traced to human blunders, they don’t account for the measurement error inherent in all standardized tests. In a new report about testing reliability, Stanford University statistician David R. Rogosa says his calculations show that two students with identical “real achievement” have a strong chance of scoring more than 10 percentile points apart on the same standardized test. For two 9th graders who are at the 45th percentile, there is a 57 percent chance for this discrepancy, Rugosa says, and in 4th grade reading, a probability of 42 percent.


When “cut scores” -- minimum test performance requirements -- are used to make high-stakes decisions such as grade promotion, such differences can have a dramatic effect upon the lives of students. In New York City, students scoring just a fraction below the 15th percentile cut-off score were ordered to attend summer school. Measurement error is a major psychometric reason why the joint Standards for Educational and Psychological Testing states, “In elementary or secondary education, a decision or characterization that will have a major impact on a test taker should not automatically be made on the basis of a single test score” (Standard 8.12).


CTB, HBEM and Riverside all acknowledge this limitation of tests and advise score users not to base important decisions on results from their tests and to use other sources of information. Once the error was caught in New York City, for example, CTB prominently displayed a warning on its Web site advising that “no single test can ascertain whether all educational goals are being met.” However, FairTest knows of no case in which a K-12 test publisher has refused to sell its products to a district or state which misuses the test to make high-stakes decisions.


The poor track record of testing companies in providing accurate scoring combined with the inherent limitations of all standardized tests provides strong evidence that extra measures are needed to safeguard children from damage caused by test misuse. Public protection starts with limiting the role of tests in decisions such as student promotion and graduation.


Unfortunately, test makers whose products are used to determine the work or educational opportunities of millions of people a year face no independent scrutiny. An independent federal agency, the equivalent of the Federal Drug Administration, or non-profit, like the publisher of Consumer Reports, should exist to ensure that the development and scoring of tests meet high standards of accuracy, reliability and validity.