Instructionally Supportive Assessment: A Reply to the ISA Commission Report

by Monty Neill, Ed.D., Executive Director, FairTest

Five major national education organizations convened the Commission on Instructionally Supportive Assessment (2001 - find links there), chaired by James Popham and including prominent measurement and education experts. The Commission's report, Building Tests To Support Instruction and Accountability: A Guide for Policymakers, offers some sound advice which, if followed, would substantially improve state testing programs. However, the Commission also makes a fundamental error in maintaining that large-scale, state exams can play a primary role in providing instructionally supportive assessment. The Commission's recommendations are therefore not adequate for the goal of developing an assessment system that can support high-quality assessment and accountability – though the recommendations are valuable for any standardized testing that might be included within a broader assessment system.
The Introduction to the report explains its goals and purposes, including language such as:
We accepted the coalition's request for assistance because we believe the increased focus on educational testing offers an exceptional opportunity to create assessments that can help the nation's children learn better. If tests help teachers do a better job in the classroom, then they will truly be instructionally supportive. Moreover, we believe that such assessments can provide policymakers with the kind of meaningful evidence needed to satisfy today's educational accountability demands. Consequently, we have written this report specifically for state policymakers to help them establish educational policies that will lead to the development of tests supportive of both instruction and accountability.

The Commission appears to have looked at "instructionally supportive assessment" from the point of view of testing. We could readily envision a different commission, focusing on how to improve instruction and what makes a high-quality school, and from these deriving public reporting and "accountability" provisions that might produce quite a different proposal. Below, we offer an alternative conception of accountability, rooted in the thinking of practitioners and educational reform activists and researchers who collaborated to draft a Call for an Authentic Statewide Assessment System.

In the introduction, the Commission terminology bounces between "assessment" and "test." While sometimes these words are used interchangeably, they should not: a test is one form of an assessment; to treat the two interchangeably only confuses this essential distinction, which is vital to understanding the limits of tests, especially large-scale tests, and the possibilities of assessment.
Our report contains nine requirements that must be met to ensure a responsible educational assessment system for the improvement of learning.
In our view, many of the Commission requirements would contribute toward a helpful, responsible system, but they remain inadequate in fundamental ways and also are misleading in some crucial aspects.
The report assumes not only the existence of, but also the need for and value of having state-centralized, standardized testing as the primary, largely controlling, measure of student learning and even, it would appear, the primary means of formative assessment (that is, of assessment intended to shape subsequent instruction for the particular student(s)). However, because of the serious limits inherent in any feasible large scale, standardized testing, such tests should not play the primary role in measuring learning outcomes and cannot adequately play much of a role at all in formative assessment - which is the more important assessment task (Black and William, 1998). (The report itself does not use the term formative assessment, but the "Illustrative Language for an RFP To Build Tests to Support Instruction and Accountability" that accompanies the report does, albeit only in passing and within the confines of a statewide Request for Proposals (RFP) process.)
Rather than state exams playing the central role, classroom based assessment must occupy the central place in the process of assisting and evaluating student learning and should provide the core information for evaluating school progress.
Before continuing, we would point out in support of the Commission's proposals that:1) they can be used to critique existing state tests in a powerful fashion; and 2) in the context of a looming federal mandate that all states test all students in grades 3-8 in reading and math, which will create a situation in most states of more but worse testing, the requirements could help educators, parents and others pressure the state to make the state tests at least less damaging. Thus, despite fundamental flaws in the Commission's recommendations, in the unfortunate political context of the moment the recommendations can make a positive contribution.
The bulk of the document is its nine "requirements" for state testing programs. I will analyze most of them, utilizing brief quotes from the Commission (2001) report. The FairTest response relies heavily on two documents: the Call for an Authentic Statewide Assessment System of the Massachusetts Coalition for Authentic Reform in Education (CARE; 1999), which provides an alternative model of a state assessment program; and the Principles and Indicators for Student Assessment Systems of the National Forum on Assessment (1995), which were signed by three of the conveners of the Commission report and which provide detailed criteria for developing an instructionally supportive assessment system.
Requirement 1: A state's content standards must be prioritized to support effective instruction and assessment.
Comment: The report provides no discussion of the role standards should play in schools. While almost all states now have standards, that does not mean this is the best way to proceed to reform schools to attain high-quality education for all (Kohn, 1999; Ohanian, 1999, for critiques). There is no discussion as to whether or to what extent standards should be mandatory or voluntary, or the extent to which schools or districts should have flexibility to adapt or modify state standards.
The Commission argues that the standards are too many and should be prioritized. CARE, in discussing the Massachusetts standards, recommends that state standards be brief and limited. Massachusetts had begun its standards development process by writing a brief statement of a Common Core of learning that all students should attain. CARE concluded that it is too brief and general - but that standards should be far closer in length and specificity to the Core than to the typical enormously detailed (and impossible to teach) standards of most states. CARE argues that building curriculum and instruction beyond a basic core, as all schools should do, should be a school matter. Standards can vary and still be high quality. This would not preclude organizations or even states from developing exemplary standards that schools can select from - there is no need for every school to reinvent every wheel, though there is a pressing need for the educators in all schools to engage in continuing conversation and thought about standards.
The Commission recommendation to prioritize the standards does address the same problem by calling for "a limited, manageable set of standards." The Commission then makes a potentially fundamental error: "The purpose of this prioritization is to identify a small number of content standards, suitable for large-scale assessment, that represent the most important or enduring skills and knowledge students need to learn in school." Only standards that are suitable for large-scale assessment should be prioritized, and implicitly those standards that are suitable for large-scale assessment "represent the most important or enduring skills and knowledge." Both statements are educationally dangerous, at least in the context of any plausible or feasible large scale testing program.
To make this point clear, we first need to recognize the limits of a large-scale testing program in practice. We will use one of the longest state tests, Massachusetts Comprehensive Assessment System (MCAS), touted by Achieve as a model, to exemplify the limits First, a test cannot take too many hours. At more than 20 hours, the MCAS tests in grades 8 and 10 stretch (if not break) the limits of reasonable testing time – too much time on testing raises hackles not only of teachers, but parents as well. (Massachusetts reduced testing in grade 4 by moving tests to other grades to address the length issue in that grade. Also, the MCAS is not "comprehensive," it really is only tests, it has no other assessments save for a portfolio as an alternative assessment for students with disabilities who cannot take the tests even with accommodations). The 20-plus hours covers four subjects: English Language Arts (with a writing sample); math; science and technology; and history. Each test has both multiple choice and open response (written) items; in many cases, the open response items require fairly lengthy responses – which is why the testing time is so long.
Despite this inordinate amount of testing time, significant aspects of the standards are not evaluated.. I will take language arts as the example. (The problems are worse in science and history, perhaps less so in math, which might be a subject that can be assessed somewhat adequately using a large-scale assessment, in ways that cannot be done in any other subject area.) In language arts classes in high school, most of the time is taken up (or should be) by reading and discussing literature - novels, short stories, plays and poetry. But the test does not, indeed cannot, assess anything substantive or concrete about literature because the state does not have a state curriculum mandating the reading of specific works. (This is not a recommendation to do so.) Thus, most of what students actually do in class is not assessed by the state test. Meanwhile, in 2000 about one in six of the multiple-choice questions asked students to identify whether the phrase was a metaphor or a simile. Not an irrelevant aspect of understanding literature – but surely not one-sixth of the important knowledge and skills. It is simply unfeasible, absent a state curriculum, to have large scale tests in language arts be rooted in actual literature, and the major consequence is to have mostly questions of "skills," some important and some trivial, that may be difficult questions but are usually not intellectually substantive.

Let us also assume that in an ELA class students will engage in some forms of in-depth and extended works, perhaps through writing or perhaps through other means of communication. But doing in-depth reports or projects is beyond the realistic capacity of state exams.

Compared to most states, the writing sample in the MCAS ELA test is lengthy. In 2000, the grade 10 prompt asks the writer to identify a "character, other than the main character" in a work of literature, and to "explain why this character is important." However, the scoring guide for writing is all about form--idea development and conventions. But to really know if the development of the argument makes sense, it helps to know the character and thus the work on which the writing sample is based – which is not feasible unless students are all writing about the same few characters. The privileging of form over substance is ironic given the heavily conservative impetus behind the standards movement, but it appears to be the inescapable result of a large-scale generic test.
The writing also is generic in the sense that one prompt is deemed adequately for assessing a great range of students. But some will understand or identify with the prompt more than will others, and some students are better at the "school game" of responding to things one has no interest in, or of manufacturing an ersatz, temporary "interest" for the sake of the test. Research into student "resistance to schooling" suggests that it is middle to upper class students who best understand and can play this game (Neill). Additionally, Linda Mabry (1999) has lucidly dissected the dangers of "teaching the prompt" rather than teaching writing - another consequence of standardized tests that are important (as the Commission understands the tests will be).
Thus, due to the inherent limitations of large-scale testing, standards should not be prioritized on the basis of measurement feasibility using such instruments. Much of what is truly important will not be measured and as a result runs the risk of not being taught.
For ELA in general, many kinds of student work can be and are done and assessed in classrooms by teachers – far wider and deeper in scope than can be done with standardized tests. This work should be the primary basis for any evaluation of student learning in ELA - and in other subjects (or multi-disciplinary work).
Since such an approach makes vastly more educational sense, we must ask why it is rarely treated as an important, indeed central, form of assessment by states. The answer, fundamentally, is that teachers are not trusted. However, it is not possible to avoid teachers, to create "teacher proof" instruction that has any real educational quality. Thus, in the end teachers must be trusted. That said, some students leave school having been poorly educated – for many different reasons. It is important to determine why and to take action to prevent such problems. One reason can be that teachers are not good at their work. In any event, checking up on school quality makes sense. The CARE proposal is, at root, an assessment system which focuses on student work and teacher evaluations of that work as the primary means of checking up.
In the Illustrative RFP, a model assessment task for history proposes multiple methods: an essay, an oral presentation, a short open answer, and a fairly trivial multiple-choice question (trivial in light of what else is expected). In this instance, the teacher – or at least a person at the school – would have to score the oral response. If that door is opened, clearly teachers could score essays locally as well. This approach would certainly allow a richer array of assessments. However, within this model the assessments remain "large scale" in that all students must engage in the same tasks, all are scored using a common rubric. If there are to be a large number of such tasks, the time and logistics will become a nightmare. (An effort to accomplish this in Britain failed largely for these reasons.) If the state uses only a few tasks, the generalizability from the tasks to the broader knowledge domain will be weak. Again, the only reasonable solution is to place classroom assessment in the center.
So: the first danger in the Commission report is narrowing the prioritized standards to fit what can be measured by large-scale tests when such tests cannot adequately assess to standards if the standards are any good. If there are to be state standards, they should be brief and essential, and they must be selected and prioritized without any concern for whether they can be assessed using large scale assessments. Rather than assessment technology driving the standards – as is now the case in practice and would appear to be the case under the Commission's recommendations – the standards would be driven by decisions about what is most important to learn. (This will of course be a very difficult educational and political job, perhaps not feasible at this time at the state level without ensuring continuous educational warfare over the content of the standards.)
The second danger is that whatever standards may be chosen, if state exams are the primary means of evaluation, then in practice only the standards that can be measured with the state test will "count." The example of MCAS shows that this, too, leads to educational trivialization as teachers teach to the test and as "professional development" of teachers is tailored to preparing students for the tests.
The Commission does recognize that focusing on a few standards could lead to an undesirable narrowing of the curriculum. The solutions they suggest at Requirements 4 and 5 are helpful, but again not sufficient or adequate, as we will discuss below.

Requirement 2: A state's high-priority content standards must be clearly and thoroughly described so that the knowledge and skills students need to demonstrate competence are evident.

This is sound advice for any important standards used by schools, not only the ones prioritized by the state. Of course, if only standards amenable to large-scale exams are prioritized, then only skills and knowledge assessable with standardized tests will be prioritized, with the narrowing consequences discussed above. Not only should the descriptions be "educator friendly," they should be understandable by students, and plentiful and varied examples should be available of the kinds of work that indicates that students have met the learning goals embodied in the standards.
Requirement 3: The results of a state's assessment of high-priority content standards should be reported standard-by-standard for each student, school, and district.
If we really want to know how well students are doing, whether by standard or some other way, we will need more than the relatively few items that will be on a state exam, particularly if we want enough information to actually guide instruction. And that information has be promptly available, which will not happen with a large-scale assessment, particularly one which includes open ended questions that must be scored by people. Which is to say that even with several items per standard, which might produce statistically reliable results, the results will not be very useful instructionally because they are not helpful for formative assessment – neither sufficiently detailed nor sufficiently timely.
Further, good teachers already know what their students do and do not understand. For them, the test results will almost always be largely redundant. If the teachers do not know what their students have learned, they need professional development – but the test won't help much there, either. It could of course be one indicator of problems. But if the point of the test is simply to be an indicator of possible problems, then it should not carry the impossible expectations of directing instruction.
While the report seems to assume the instructional utility of large-scale assessment, it then acknowledges that, well, the test won't really be very helpful:
This information is likely to be less than reliable, and may be a less than accurate measure of a student's true knowledge and skills. Teachers, especially, must bring additional sources of classroom-based information to their evaluation and intervention decisions for individual students.
This puts the issue upside down and backwards: it is not classroom-based information that is "additional," for good teaching it is the tests that are "additional" - and not often useful. The real question that educators (among others) need to address is how to improve classroom assessment in ways that improve curriculum and instruction and that can provide information for public reporting. The CARE proposal offers an initial answer this question.
Requirement 4: A state must provide educators with optional classroom assessment procedures that can measure students' progress in attaining content standards not assessed by state tests.
Requirement 5: A state must monitor the breadth of the curriculum to ensure that instructional attention is given to all content standards and subject areas, including those that are not assessed by state tests.
We address these two together as they are presented in the report primarily as a means of assessing the standards that would not be assessed by the large scale exam. As we explained earlier, many important standards cannot feasibly, if at all, be assessed with state tests. They can with classroom-based assessments.
It is significant that the Commission emphasizes the need for high quality classroom assessment. There is good reason for a state – or for many other possible entities – to develop banks of high quality assessments that teachers and schools can use. These can include tasks and projects, records such as the Learning Record or the Work Sampling System, portfolio procedures, exhibitions, and more. Teachers cannot reinvent every wheel in every school, though they should be regularly engaged in discussions about assessment as part of their staff work in the school.
FairTest also agrees that local, classroom-based assessments should be part of accountability. But here we face another danger: tailoring instructionally useful, formative assessments to the needs of state accountability programs is likely to undermine the usefulness of the formative assessments. For example, formative assessment necessarily must be particularized: if several students do not understand a science concept, it is not necessarily true that each has the same misunderstanding. The language in the Illustrative RFP presents the pre-test/post-test model, but that is far from the only useful type of formative assessment.
The assessment model presented by the Commission is essentially that of a few discrete events, far more summative in nature than formative. Good assessment often has characteristics more akin to flow. Not that there are no discrete events, but that the events are often small, slowly accruing, so that the "particles" often may seem more like a "wave." Of course, good "particles" are necessary, and it would be no small step forward to have rich banks of such "particles." However, the most critical factor is not the specific assessments that might be available, but that teachers know how to use many assessment methodologies, not only those made available by the state. This, too, is acknowledged by the commission – yet the commission privileges state-made assessments over strengthening the overall assessment capacity of the teachers.
The danger then is that a model derived from technical measurement, the large-scale assessment, becomes the model for classroom assessments, so that classroom assessments would end up with similar characteristics. For example, all writing in a classroom, or all portfolios, might be subject to a rubric for scoring in order to produce numbers to feed into a numerical state accountability program. Yet as Mabry (1999) eloquently pointed out, once strong importance is given to a rubric in writing, the danger is that the teachers teach the rubric, not writing or the child. Standardization supplants standards. By starting from state tests and accountability programs, not from teaching, the Commission runs the risk of undermining its larger goals of ensuring that assessment supports instruction.
It is reasonable that states monitor to ensure breadth of curriculum as well as quality of instruction and of outcomes as well as the overall health of our schools. The model that centralizes state tests makes this a more difficult process than it ought to be. The CARE model would solve this problem.
Requirement 6: A state must ensure that all students have the opportunity to demonstrate their achievement of state standards; consequently, it must provide well-designed assessments appropriate for a broad range of students, with accommodations and alternate methods of assessment available for students who need them.
Designing assessments with a full range of students in mind is a very good idea. In addition to students with special needs or disabilities and English Language Learners, the assessments need to respond to the variety of ways in people learn and demonstrate their knowledge, and to the cultural variations that exist within our society (National Forum on Assessment, 1995). The Commission makes note of using "alternative" assessments such as portfolios for use with students with special needs. Again, this is backwards: locally-shaped portfolios – selections of student work - should be the heart of an assessment system, as it is studying actual student work that can best tell us what the students have learned and what the curriculum and expectations and teaching modes are within the school. Tests should be a supplemental source of information, not the central source.
On multicultural sensitivity, one story about the late California Learning Assessment System is instructive (Oakes; Epstein). In responding to some of the reading passages on the CLAS, low-income minority students performed unusually well. Those passages were strong stories that connected to the real, often difficult circumstances of these children's lives; but those same stories were often objectionable to white suburbanites. In standardizing what is to be read and responded to across cultural lines, there is simply no way to ensure equal fairness to all. Things that appear neutral, bland enough to be inoffensive, may in fact disadvantage students who find it hard to respond to such denatured material.
Requirement 8: A state must ensure that educators receive professional development focused on how to optimize children's learning based on the results of instructionally supportive assessments.
As discussed above, such professional development needs to ensure that teachers are good assessors "in the flow," not just that they can use ready made state assessments. The classroom assessments described in the Commission report are inadequate as classroom assessments, so professional development geared toward them also would be inadequate.
Requirement 9: A state should secure evidence that supports the ongoing improvement of its state assessments to ensure those assessments are (a) appropriate for the accountability purposes for which they are used, (b) appropriate for determining whether students have attained state standards, (c) appropriate for enhancing instruction, and (d) not the cause of negative consequences.
As already discussed, state assessments can at best play a secondary role in enhancing instruction and determining whether students have met state standards (if the standards are any good). Assessments must be not only appropriate, they must be adequate and sufficient, which large-scale tests are not. Thus, large-scale assessments should play a secondary role in accountability, or they will ensure that accountability undermines the more important goals of instruction and will indeed be the cause of many negative consequences. While tests meeting the Commission's requirements may be a "new generation of state tests" they will retain the more significant, dangerous limitations of standardized tests.
Continuing study to determine the consequences of state tests is warranted. But there is no reason to conclude that state tests ever have or will promote a rich education toward high quality standards for all children. Placed within a more powerful context of classroom-based assessment and data, they might play a useful backup role in a process of checking up on school-based assessment processes, information and uses. To ask them to do more than that is to ask bicycles to fly: it probably will not happen; and if it did, no one would want to be on the bicycle.
In conclusion: while the Commission report is useful for pointing out just how bad a job state tests do and to suggest some ways in which state tests could improve, the Commission has made assumptions about large-scale tests that are unsupportable. If implemented, the resulting state tests still would suffer from fundamental limitations of standardized tests that render them unfit as tools for adequately supporting high quality instruction. If the tests remain central to accountability and the main model for local, classroom assessments, the results will be the continued narrowing of curriculum and instruction.
The CARE Plan for Authentic Accountability
We believe there is a better way, as exemplified in the CARE (1999) Call for an Authentic Statewide Assessment System. That call is appended, and here we will simply highlight the key elements of the CARE plan.
The CARE plan is based on four key points: 1) if you want to know how students are learning, look at the work they do and the teacher assignments; 2) teachers together reviewing student work and using that information to think about school improvement is essential for staff development and school improvement; 3) local schools know their students far better than the state possibly can; and 4) the state's job is not to make decisions about individuals but to ensure that schools are educating all children well and to provide the necessary resources to enable schools to do so.
The CARE plan builds on the state's Common Core of Learning, a very brief statement of essential learning goals for all children. CARE calls for expanding the Core to define "core competencies" that are leaner than the long, detailed, and complicated curriculum frameworks.
The key elements of the CARE proposal are:
1) Local authentic assessments. They will be based on the new "competencies" and a school's own goals. Each school will have an assessment and accountability plan--approved by the local school council, the state and the district--which explains how it will assess students, how it will make decisions such as graduation and grade promotion, how it will use information about student work to improve teaching, and how it will report accountability information to parents, students, teachers, the community and the state. Graduation will be decided by the school, not by the state.
2) Limited standardized testing, in literacy and numeracy only. These tests will not be used to make decisions about students but will be an additional source of data about schools and students.
3) Annual school reporting. Each school will report on the progress or lack of progress toward its goals and the state standards, and how it is using evaluation of teacher assignments and student work to improve the school. The report will be based on the local assessments and include standardized test results. Reports also will include outcomes by race and ethnicity, gender, low-income status, special needs, and limited English proficiency. The reports will include other information about the school, such as attendance, promotion, graduation and dropout data; survey results (such as school climate surveys); teacher qualifications; and adequacy of resources. The reports will be reviewed by the local school council, parents and other community members, the district, and the state. When needed, the state or district can send in teams to verify the accuracy of a school's report.
4) School Quality Reviews (SQR). Every 4-5 years, each school will do a detailed self-study. Then an expert team will conduct a several-day visit to the school, interviewing students, educators, and parents, sitting in on classes, looking at examples of student work, etc. The team will present a detailed report to help guide the school in its annual planning and reporting. The teams might be organized by the Dept. of Education or be developed by the regional accreditation association.
In this plan, much more information will be available than is provided by state testing programs. No one test will determine the fate of a student or a school. The plan builds in a process of continuous improvement. The state will have sufficient information to intervene in a school or district which has adequate resources but does not perform well and does not improve.
For this plan to function, strong classroom assessment is fundamental. Research has indicated that formative assessment can have very powerful effects on student learning (Black and William, 1998). To succeed, not only must strong professional development be available, but time must be reorganized to allow teachers to work collaboratively as part of their normal, paid work. Partnerships with parents and communities must also be strong.
If we are correct, the most important role for assessment is to support student learning (National Forum, 1995), and such assessment must be first and foremost a matter for teachers in classrooms. Systems for improving schools and reporting to the public should be based on this fundamental understanding, with other assessments - tests and school quality reviews – playing a secondary, supportive role.
Next Steps
We hope that leading education organizations, such as those who sponsored the Commission, would join with researchers and others, such as the members of the Commission, to recast the practice of assessment and accountability away from the centrality of large-scale, standardized tests and toward making classroom-based assessment truly central.
We welcome comments and encourage discussion on the Commission report and on the FairTest analysis. I can be reached at monty@fairtest.org or by phone at (857) 350-8207.
References
Black, Paul, and Dylan Wiliam. 1998. Inside the Black Box: Raising Standards Through Classroom Assessment, Phi Delta Kappan, Oct., p. 139; http://www.pdkintl.org/kappan/kbla9810.htm. (The authors's full study is available in Assessment in Education, Vol. 5, No. 1, http://www.carfax.co.uk/.)
Coalition for Authentic Reform in Education (CARE). 1999. Call for an Authentic Statewide Assessment System. Cambridge: FairTest. 
Commission on Instructionally Supportive Assessment. 2001. Building Testing to Support Instruction and Accountability: A Guide for Policymakers and "Illustrative Language for an RFP To Build Tests to Support Instruction and Accountability." American Association of School Administrators, National Association of Elementary School Principals, National Association of Secondary School Principals, National Education Association, National Middle School Association. I was able to access the main report at http://www.aasa.org/ but not the RfP, and the RfP but not the main report at www.nea.org. Try also www.nmsa.org; www.principals.org ; or www.naesp.org .
Epstein, Kitty. Personal communication.
Kohn, Alfie. 1999. The Schools Our Children Deserve. Boston: Houghton Mifflin.
Mabry, Linda. 1999. Writing to the Rubric: Lingering Effects of Traditional Standardized Testing on Direct Writing Assessment, Phi Delta Kappan, May, p. 673; http://www.pdkintl.org/kappan/kmab9905.htm
National Forum on Assessment. 1995. Principles and Indicators for Student Assessment Systems. Cambridge, MA: FairTest. See esp. Principles 1, 2 and 3. Among the signers to this report are the NAESP, the NASSP, and the NEA. Staff from all three organizations participated actively in developing the Principles.
Neill, Monty. 1995. Some Prerequisites for the Establishment of Equitable, Inclusive, Multicultural Assessment Systems. In M.T. Nettles & A.L. Nettles, Equity and Excellence in Educational Testing and Assessment. Boston: Kluwer Academic.
Oakes, Jeannie. Personal communication.
Ohanian, Susan. 1999. One Size Fits Few. Portsmouth, NH: Heinemann.