In the case against the Texas TAAS test (see related story), plaintiffs presented research showing that standard test-construction methods built in racial bias. The judge concluded he “cannot quarrel” with that finding. Groups concerned about civil rights should use this argument to oppose the use of tests to make high-stakes decisions, such as school graduation or grade promotion.
According to a declaration by Prof. Martin Shapiro of Emory University, who is both a lawyer and a psychologist, Texas uses “point-biserial correlations” in deciding which items to use and which questions to discard as the test is assembled from field-tested questions. Items with high biserial correlations are those generally answered correctly by test-takers who score high on the test overall. Items which many low-scoring students get right have lower correlations.
To obtain higher consistency (and hence technical reliability) on the test, Texas follows the typical practice of using items with the highest correlation values. This procedure means that on items covering the same materials, the ones with the greatest gaps between high and low scorers will be used. Because minority group students typically perform less well on the test as a whole, the effort to increase reliability also increases bias against minorities.
According to other research, items which facilitate ranking and sorting are often items which, perhaps unintentionally, factor non-school learning and social background into the questions. Such items help create consistency in test results, but they often are based on the experiences of white middle-to-upper class children, who also typically have access to a stronger academic education.
This test assembly approach was developed in large part to help obtain consistency on tests designed to rank and sort students, such as IQ tests, the SAT, or national, norm-referenced achievement tests (NRTs). While procedures which reinforce racial and other biases should not be used in any case, tests such as TAAS now rely on biserial correlations — even though the TAAS supposedly is not intended to sort students but to determine whether they have met specified levels of achievement. By using this method of item selection, the TAAS, like many other state “criterion-referenced” or “standards-based” exams, is actually constructed to resemble an NRT.
This common test development procedure exacerbates the existing inequities of schooling. When used in high-stakes testing, biserial correlation helps ensure that at least some students who know the material and ought to pass the tests do not. Those students are overwhelmingly low-income, of color, with English as a second language, or have special needs.
Alternatives do exist. One is to use different item selection procedures. For example, the “Golden Rule” bias reduction process involves selecting items with the smallest racial gaps (see Examiner, Winter 1991-92) from a content or skill area. Given more than two racial groups to balance, however, this might not be a simple process.
Another alternative would be to reject the underlying assumption of “unidimensionality,” on which point-biserial correlations rest. In this approach, test-makers assume that test-taker performance can be described by a single underlying ability. This approach has two major problems. First, it assumes that performance on the many sub-topics in a subject, such as algebra and geometry in math, is due to one, not multiple and possibly different, abilities.
Second, standards in various subjects assume students will learn cognitively complex skills. Unidimensionality in test items, however, assumes test-takers employ only one cognitive process to solve a problem, rather than multiple modes of thinking which can interact in varied ways, and items are rarely written to include multiple aspects of a subject. For example, a math test specification for the National Assessment of Education Progress “portrays a student’s performance on any individual mathematics item as the pairing of a single process with a single content category” (Paul Nichols and Brenda Sugrue, “The Lack of Fidelity Between Cognitively Complex Constructs and Conventional Test Development Practice,” Educational Measurement: Issues and Practice, Summer 1999). Thus, typical test-development procedures contradict the standards on which many new tests are intended to be based.
Human development and use of cognitive processes can vary based on social and cultural background. If a test assumes unidimensionality of cognitive processes and the “acceptable” processes include only some culturally-based approaches, then the test becomes culturally biased. Bias review techniques which utilize point-biserial correlations will be unable to detect this flaw. Further, to the extent that tests affect curriculum and instruction, test-development procedures can undermine multi-cultural approaches to teaching and learning.
In sum, Dr. Shapiro’s review of the construction of the TAAS test reveals very serious problems in typical test construction methods. These problems can impact teaching and learning for all students but have the most damaging effects on students from low-income and minority-group backgrounds.