ADFL Bulletin
22, no. 3 (Spring 1991): 33-38
To the Editor Search

Table of Contents
Previous Article Next Article
Works Cited

Foreign Language Testing, Part 1: Its Breadth


John W. Oller, Jr.


IN THIS first installment of a two-part essay I advocate a broad view of foreign language testing and a comprehensive philosophy for meaningful, pragmatic tests. The next part describes the theoretical underpinnings of the approach suggested here. Part 1 looks to the breadth of foreign language testing while part 2 focuses on its depth. Together they express an outlook that has been developing for more than a quarter of a century—longer if we trace historical roots. Both parts of the essay are practical— pragmatic , in the common and classic sense of the term. They are concerned with practice and with the reasons for it. My approach is not uncontroversial and I do not aim for a middling consensus, still less an indiscriminate eclecticism.

A Road Less Traveled

While our path will sometimes coincide with popular thinking, the goal is a more radical coherence and comprehensiveness than are usual in applications of the language sciences. What languages are, how they are acquired, and how human beings communicate, I believe, outrank the more commonly discussed issues of what a language test is, how to standardize procedures, gauge the difficulty of material, assess curricular relevance, evaluate reliability, score the test, interpret the scores, and so on. While the topic here is foreign language testing, this is somewhat incidental. It is an accident of history that I have been asked by the ADFL Bulletin editors to address this particular topic, and what I have to say could as easily have been developed with respect to any one of a dozen other topics that will inevitably come up along the way. As soon as we begin to appreciate the richness of the topic at hand it will become apparent how intricately it is related to the whole question of foreign language educations. The road traveled here, however, is not the common route.

Some practitioners and researchers who see phonic correspondences, word forms, syntactic patterns, relative clauses, pseudoclefts, lists of ways of apologizing, requesting, taking leave, refusing an invitation, saying thanks, not to mention those much feared dangling modifiers, comma splices, strings of prepositional phrases, and so forth as instructional ends in themselves may see no special need to integrate the theories comprehended in the phrase “language sciences.” Readers apt to appreciate the discussion will be those who correctly view such elements of surface form as subordinate to deeper questions concerning meaning—the originator's intentions, the consumer's comprehension, and the coherence of text or discourse.

Toward a Richer View of Testing

For most of us, historically, language testing has been linked with grading. The function of testing therefore sometimes reduces to the formula:

TEST -> GRADE

Often, as Jim Cummins pionts out (also see Hamayan and Damico), we see testing and grading in terms of a seductive medical analogy: the test is a diagnostic tool and the student is the patient. This metaphor—whether consciously or not—probably appeals to educators because it make us out to be the doctors. From the students' view, however, tests are like scalpels. Nobody wants to feel the knife—nor is it a sufficient comfort to know that a correct diagnosis is a step toward a cure. However, when we look at the metaphor closely, it turns out to be misleading. Unlike urine test for diabetes or a blood test for AIDS, the foreign language test (or any educational test) is as much an evaluation of the doctor (i.e., teacher) and the hospital (school system) as it is of the patient (student).

Educational testing, it seems, ought to aim for more than merely evaluating the student. It should verify the effectiveness of teaching as well. Beyond these purposes, testing is also a means of evaluating the curriculum itself. If large numbers of students are failing to achieve the defined goals of the curriculum, something is wrong beyond the student's desk. If nearly all students master the content of the curriculum as determined by fair, valid, comprehensive, and challenging tests, if follows that nearly everyone in the system can share the joy.

But this expansion of the role of testing will meet resistance from those who deny the possibility of valid testing. If they were merely consistent, they would also have to say that success in foreign language teaching either cannot be achieved or cannot be validly assessed. But they are wrong. Valid testing is to some degree a real possibility (for relevant research see Bachman; also Oller, Language Tests ). However, to see how more or less valid testing is possible, we need a better idea of what language tests are.

Just for the sake of discussion, suppose we agree that a language tests is any authentic use of language that teachers can reliably grade or evaluate. On the basis of this definition, prefabricated, standardized, published tests can hardly be recommended. Though there may be elements of some standardized tests that qualify for some uses, usually these will be outside the classroom. In the classroom tests need to be adapted to specific activities, to the curriculum, and even to the strengths and weaknesses of particular teachers and students. Therefore, given our definition for what follows, a great variety of procedures will qualify as sources for more or less valid language tests. Moreover, these valid tests will be, in principle, limitless in number, so no teacher will ever suffer for the lack of a sufficient number of tests from which to choose.

In order of increasing difficulty, procedures qualifying by our definition might include: translation from the nonprimary language (L2) to the primary language (L1), imitation of L2 material (repeating elements of a text or discourse), copying a text or discourse, writing from a spoken model as in a traditional dictation, filling in blanks as in a cloze exercise, reading aloud, summarizing, retelling, paraphrasing, answering questions, conversing, taking part in drama, writing an essay, participating in an oral interview, translating from L1 to L2, and so on. All these procedures suggest limitless numbers of actual language tests geared to particular classroom activities and specific curricular objectives.

Tests based on coherent materials and meaningful activities assure us a certain degree of a prior validity. In fact, coherence always involves implicit evaluation. In the normal monitoring of text production or interpretation, the fundamental question is coherence. Am I making sense? This question holds equally for the producer and interpreter, regardless of whose production or interpretation is on the line. The originator asks, Does this bit of text or speech say what I mean? Will I be understood? And, the interpreter asks, Is my understanding of what was heard, read, said, or thought correct? Did I understand the intention? The evaluative questions of the producer or interpreter always implicitly include a look at the text or discourse act itself. Does this particular use of language effectively achieve its purpose? Does the text express the intended meaning? Does the representation (or interpretation) coincide with the intended meaning? Any authentic use of language involves just this sort of implicit evaluation of meaningfulness. These questions are implicitly asked in every expressive or communicative act just to the extent that the act has any claim to being meaningful. Therefore, any meaningful use of language already is a kind of language test. At the same time, less meaningful uses of language are bound to result in less valid bases for language tests. The extent of their failure will be exactly proportionate to their lack of authenticity as uses of language. Therefore, meaning is the key to any sort of valid language testing.

When the case for meaning is put in this way, few, if any, language teachers will disagree. At least they will not disagree in principle. However, when it comes to practice, as surprising as it may seem, there are two kinds of approaches to teaching and testing: ones that are relatively full of meaning and ones that are relatively empty. The less meaningful approaches, it seems, are due to the alluring tendency to emphasize surface forms of language—for example, phonemes, morphemes, lexical, items, syntactic structures, ways of apologizing and requesting—above the deeper and more important pragmatically motivated intentions and interpretations. Comprehension-oriented approaches, however, direct attention toward the meaningful intentions that are revealed in language use.

Surface-Oriented Tests

The word test tends to make most foreign language teachers think of the sort of discrete, surface-oriented test items they experienced as students. Or they may think of the similar tests they require their own students to take. To most foreign language teachers, a test consists of something with “items.” One expert editorial consultant who read an earlier draft of this essay, for instance, complained that it didn't tell teachers how to write “items” and that too few “items” were exemplified. The reason was twofold: first, I don't want teachers to “write items,” and second, the few “items” exemplified were intended exclusively as examples of what not to do.

However, to highlight what is not recommended, here are some commonly used item types. Though the examples here are in English, foreign language teachers can readily imagine appropriate parallel items in Spanish, French, German, or whatever language they may teach: “George_____to the store yesterday,” where the student must supply the correct past-tense form of a verb (e.g., went, drove, walked, or the like). Variations on this theme focus on some surface aspect of syntactic, morphological, or phonological form. Other common items include vocabulary lists (e.g., hand, watch, run ) to be matched with a scrambled list in the target language (e.g., correr, mano, reloj ). Variations require the student to choose synonyms or antonyms for given words in the target language, select the best of several definitions, provide L1 equivalents, and so on. A description of such surface-oriented tests could never be complete without the ever-popular minimal-pair contrasts. In this kind of item, the teacher (or tape) might say, “It's a ship ,” and then question students whether the word was ship or sheep . The target item will vary: “It's a pool ” (not a pull ); “It's a peck ” (not a pick ). The possibilities are limitless in principle and yet no one is able to say when the number of such items is sufficient. There are as many different kinds of surface-oriented tests items as there are different ways of analyzing surface forms, and yet they all, in principle, examine distinct aspects of language proficiency. An interesting popular surface-oriented item is the blue-sky speech act where the student is told to do something like “Ask me a question using if ” or, in slightly improved versions, “Tell your friend he is annoying you”; “Ask the bank clerk, who is about your age, to open the window”; “Apologize to your boss for being late to work”; “Ask permission of the school principal to go to the post office”; and so on ad infinitum.

For surface-oriented items of the foregoing types, there is no principled way of determining a stopping point. There is no way to say how many items are enough or even which kinds ought to be included in a given test. While a fairly high level of reliability may be achieved with a sufficient number of any type of discrete items we may choose, the fundamental validity questions remain unanswered: What types of items should be included? In what proportions? For these questions there are no principled answers. Someone might suggest looking to the curriculum, but this will not help at all. The curriculum is subject to the same questions. To look there is merely to go around in a circle. We need a principled basis for deciding on what kinds of tasks to teach and test. Here, in part 1 of my essay, I must beg the reader's indulgence. For now, let's just suppose that meaningfulness is the key and that a progression from simple to more complex, but always meaningful and richly organized, texts or discourses in the target language will assure us of the sort of validity we seek both in the curriculum per se and in the tests. In part 2 we will return to the deeper question of validity and try to justify the solution that is merely recommended here for consideration.

The surface-oriented items of language tests owe their existence to the structural linguistics of the 1930s through the 1950s. In the 1960s such items were transformed, so to speak, into a host of dazzling new syntactic possibilities. By the 1970s and 1980s, with the advent of the still-sought-for notional-functional syllabus—a hypothetical offspring of speech-act theory—“sociopragmatic” competence (see Harlow and references therein) came into view, and now in the 1990s we see test items aimed at hypothetical acts of apologizing, requesting, thanking, and so on but in such artificial contexts that none of the performers can be said to have any of the requisite motives.

All discrete-point items that are dropped from the blue sky into whatever contexts may obtain have a fatal flaw traceable in part to the narrowness of traditional approaches to linguistics—speech-act theory included. The discrete-point items tend to reduce the object of study to mere surface forms or categories of them. It follows from this thinking that language acquisition and knowledge are merely about ever so many independent, unrelated, separate, and singular bits of information. This proposition is false. Experience is anything but disconnected in the manner of discrete-item tests. Even analyzed bits of knowledge are, in real-life settings, always linked to time, place, motives, plans, and an extraordinary rich context of relationships.

Indeed, as John Dewey (1859–1952) argued and as his mentor, Charles S. Peirce (1839–1914), had brilliantly demonstrated, every single bit of knowledge absolutely and inevitably implies in the most rigorous logical sense possible the whole continuum of experience. For this reason, the real, analyzable elements of experience and knowledge are never legitimately separated from the contexts in which they have been, are being, or will be observed, noticed, discovered, and so on. To pretend that they can be separated in the manner of discrete-point test items (and a lot of linguistic analysis) is just that, pretense. (For the detailed argument, see selected writings by Peirce, Dewey, Einstein, and others in Oller, Language and Experience. )

Discrete-item tests suggest a categorical orderliness and an episodic randomness that are exactly the reverse of normal experience. Actual, ongoing experience is always episodically structured: it is nonrandom and strictly conformed to—though perhaps not determined by — principles of logic, physics, physiology, and so on. The categories of experience almost never appear in numerical, alphabetical, or any other sort of neat and already analyzed order. Discrete-item tests disregard or overthrow everything we know of ordinary experience.

Meaning-Oriented Language Tests

Competing with the prevailing prestige of structural linguistics in the definition of language tests, there has been a long-standing concern for meaning. As a result, many language teachers—even during the heyday of structural linguistics—were never quite convinced that surface-oriented discrete-item tests could really do the job. They could see—and experience proved them correct—that their students might do well on the surface-oriented items and still be lousy communicators in the target language.

These teachers knew that communication skills come from successfully performing communicative tasks—ones that require comprehension. They understood intuitively, almost without theoretical guidance, that normal uses of language are meaningful. People speak, listen, read, write, think, compose, edit, exaggerate, describe, promise, extol, lie, pray, curse—all in order to achieve deeper, life-oriented goals. We talk to establish relationships, to get needs met, to meet the needs of those we love, to fight against things we hate, to complain about injustice, to praise excellence, to worship, to condemn, to stay alive, to escape pain. We do not, however, ordinarily use language just to reveal subtle phonological contrasts or to perform syntactic manipulations, to try out lexical items, to list ways to apologize, and so on.

Therefore, instructional procedures (including tests) that focus on forms (phonemes, lexical items, syntactic patterns, speech-act functions) without attention to actual connections with the world of experience are doomed to fall short of the mark. They fail to reflect the common purposes of language. They disregard the real world of experience. While no one can deny that linguistic analysis is one possible purpose for using language and that just acting silly may be another, the analysis of surface forms will generally rate as esoteric relative to the vast range of possible language uses. Foreign language testing, curricula, and so on ought to address the more common uses.

The Purpose of Tests

At a very low level, as noted above, foreign language tests may be viewed as the basis for determining grades in foreign language classes. For many, who see grading as a necessary evil, any test is apt to be seen as an evil as well. This elementary view may be summed up in the phrase “guilt by association.” It is capsulized in the title by John Upshur: “ Test is a Four-Letter Word”. A mature view of foreign language testing sees it is an integral part of the whole foreign language curriculum and its ongoing management. Tests, by this higher view, are as essential to instruction as accounting is to business. But, of course, language testing can be much more than a mere method of accounting. Language tests in the classroom should serve many purposes that are ultimately indistinguishable from instruction itself.

Within a practical, comprehensive philosophy of language instruction and testing, every test becomes a natural rung in the ladder toward the instructional goal—that is, toward some desired degree of proficiency in the target language—and every instructional activity in which students participate becomes a language-testing activity. In such a comprehensive theory, tests express the essence of the instructional process as well as, or perhaps better than, any other activity. In other words, teaching itself is a testing procedure as much as it is an instructional one.

Tests in any classroom setting have a variety of functions that we must understand before we choose and administer any test. Language tests in the classroom may serve the following purposes:

Inappropriate language testing may result in, reflect, or even constitute ineffective language teaching. From such a pragmatic perspective, it might be argued that language testing is language instruction and, conversely, that language instruction is language testing.

Teaching or Studying the Tests

The very idea of studying any test itself—that is, of regarding a test as part of the curriculum per se—is frequently preached against in educational circles, but what can be wrong with studying or teaching the material that appears in any good test? The traditional objection is valid only if the testing itself is weak or invalid. To the extent that the testing is valid, it will fulfill all the purposes mentioned above—instructional, managerial, motivational, diagnostic, curricular—and no doubt others besides.

If surface-oriented tests are used, however, the traditional objection to “teaching to the tests” is valid. The reason is that it is possible to do well on surface-oriented tests and simultaneously to do poorly as a language student. Such weak and incomplete testing, based on inadequate theories and teaching methods, sometimes gives the illusion of progress where there hasn't been any. In those cases, teaching to the tests is undesirable for all the reasons that the tests themselves are inadequate. However, if the testing is valid, there is nothing wrong with teaching to the tests. Even studying the tests themselves will present no special dangers.

The Episodic Basis. Tests drawn from any episodically organized text of sufficient richness can help is structuring a whole curriculum. Of course, there is no reason that only one textual basis must be selected. Several might be used. But for the sake of argument, suppose we use a series of episodes such as those found in a full-length feature film, a novel, a soap opera, or a TV serial.

Such texts or discourses are episodically organized in two ways. For one, they are logically structured. The characters progress from some crisis near the beginning to a resolution near the end. Throughout the sequence, the casual series consists of transforming states of affairs that lead from somewhere to somewhere else. If the story is a good one, there is another aspect to its episodic character: it is motivated by a conflict that captures our interest and carries us along, sustaining interest throughout.

Suppose a reasonable selection of textual material can be made. How this can be done is a story already told in substantial detail elsewhere (see Oller and Richard-Amato; also Richard-Amato). At a minimum, we begin with some idea of what our students will need or want to do with the target language—to understand it in some range of contexts, to read and write it perhaps at a somewhat higher level of comprehension, and to converse with some defined level of proficiency.

Rough-Tune the Teaching and Testing. It is always possible to condense or expand, summarize or elaborate a text or discourse. Therefore, nearly and sufficiently rich narrative may be simplified so as to become intelligible to the rank beginner or complicated to become challenging to the near native. The task of making the adjustments may be difficult, but no one said curricular design or modification was necessarily easy; nor is language acquisition something that occurs overnight.

Bearing in mind the sorts of language uses our students will be expected to manage after they complete the course of study (our curriculum), we may begin to define tasks ranging from simple ones for beginners to more difficult ones for advanced students. From the beginning we spiral outward from a small nucleus that expands over time. This approach is the one recommended by competent theorists at least from the time of Rousseau. For example, for beginners, we might select (or create) a portion of discourse that involves a simple exchange of greetings. The students' first task might be to realize that the interlocutors are greeting each other. Next they might try to catch the names of the participants, to guess their relationship: are they friends, lovers, family, acquaintance, or what? From the initial understanding, we may lead them to ever broader and also deeper understandings of the text or discourse and eventually of the target language as a whole. Normal language acquisition, however, always begins with a particular, real context that is rich in pragmatic possibilities and constraints. It does not begin in the blue sky.

Establish the Facts. For the language student, the first task in following a story is to know who and what it is about. This is not as easy as it sounds. If the story is written, the teacher may need to dramatize or visually illustrate the facts to establish who's who and what's going on. Presenting the story in an audiovisual format—say, on film—simplifies the teacher's task. The students will get some of the basic facts just by watching the film—for instance, the exchange of greetings will probably be transparent to them just by seeing it. For beginning language student, establishing the facts is the necessary first step in the pragmatic mapping process (Oller, Language and Experience ). Without it, target-language acquisition will have no adequate ground on which to develop.

A point that many theorists neglect altogether is that there must be some facts. It is impossible to fix the meaning of any text if there are no facts or if they are unknown or nonexistent. If there is to be any episodic organization, there must be facts that can be agreed upon—either Jack is a boy or he isn't, but let's not pretend that it can be any which way. Either Jack and Joan did embrace or they did not. It is not good to pretend, as many materials writers and testers seem to think, that we can change the facts however we like as we go along. When this is done, no one has any confidence about what is meant no matter what may be said. And if we don't know with any certainly what we are talking about, how will our students find out what we mean? Our attitudes toward the facts may change and our interpretations of them may change, but the facts must be fixed if we are to have any hope of comprehending them. Some wag, I expect, will suppose that this means the facts must be static. Of course, it doesn't mean that. The facts may be as complex and dynamic as life, death, and hell itself. They just have to be determinable to some degree. This is what is meant by “fixing the facts.”

Teachers can test comprehension of the established facts by asking questions: perhaps the simplest form is the yes-no format. “Is the woman named Joan?” “Her friend's name is Sam, right?” Wrong, it's Jack. Or even easier ways of putting these test questions can be conceived. For instance, we may point to a picture of Joan and say “Joan?” to which a satisfactory response might be a nod of the head.

Spiraling Outward and Upward. From the simplest yes-no questions about the facts, we may progress to questions that require students to supply more information. We may use elicited imitation and require them to say what the characters in the story say in the exchange of greetings. Each of these steps may be viewed either as a test or as a teaching strategy. Each step tests comprehension and provides the basis for higher and better comprehension as we proceed. At a higher remove, we might ask a couple of students to dramatize the exchange of greetings or to greet one another. Or, skipping over some of the intermediate steps, we might present them with a written version of the same dialogue that they have now learned to comprehend and to utter, and we might ask them to read it aloud. Or we might ask them to copy what they have read or to write from dictation what they have studied in its written form. Or we might have them fill in blanks (cloze format) in the dialogue.

Proceeding a bit farther down the road, we might modify the dialogue and convert it to indirect narrative form: “Joan says hello to Jack. Jack also says hello to Joan. He hugs her.” This indirect form expands upon the direct greetings and by doing so offers additional structural complexity while remaining within the reach of the student's limited range of comprehension (near i + 1 , to use term from Krashen—i.e., the learner's next natural step or what Vygotsky calls the “proximal zone of development”). From there it is a relatively easy step to a higher level of complexity in the greater narrative itself and also to a wider range of language testing and teaching activities. For example, we can proceed with reading and writing activities based on the indirect quotation of dialogue forms or, later (again skipping over some of the intermediate steps), we may go on to full-fledged narrative: “Jack and Joan are lovers who meet unexpectedly on a dark street at night. Each shows surprise at seeing the other. They greet each other and embrace. …”

Each step in the instructional process assesses progress to that point and ensures further movement toward the ultimate goal—whether it is near-native competence or some lesser objective. By employing a rich and varied range of activities (tests and teaching activities), each of which in its turn is based on a rich textual source, the teacher may be confident of consistent progress toward native proficiency. A well-rounded range of activities should probably include listening, speaking, reading, and writing, and each of these should always, I believe, be based on facts contained in some established narrative, activity, or experience. The teaching and testing activities—all based on the established facts at hand—might well include yes-no questions, content questions, elicited imitation, reading, copying, writing form dictation, cloze tasks with various focal points (e.g., elements of surface grammar, lexical items, idioms, quirks of convention, and cultural differences), oral and written question-answer exercises, dramatization, narration (including summarization and expansion), improvisation (at more advanced stages), essay writing, and so forth.

Given a richly organized episodic basis and a variety of testing procedures reflecting the wide-ranging uses of language, teachers can bring language testing into the heart and soul of language teaching. Testing can become such an integral part of the instructional process that teaching and testing will no longer be easily distinguishable. Perhaps the essential insight of a quarter of a century of language testing (both research and practice) is that good teaching and good testing are, or ought to be, nearly indistinguishable.


The author is Professor of Linguistics at the University of New Mexico. This paper was prepared by the author in conjunction with a course he taught at the 1989 Summer Linguistic Institute cosponsored by the Modern Language Association and the Linguistic Society of America at the University of Arizona.


Works Cited


Bachman, Lyle F. Fundamental Considerations in Language Testing. New York: Oxford UP, 1990.

Cummins, Jim. Bilingualism and Special Education: Issues in Assessment and Pedagogy. Clevedon, Eng.: Multilingual Matters, 1984.

Hamayan, Else, and Jack S. Damico, eds. Limiting Bias in Bilingual Assessment. Houston: Pro-Ed. 1991.

Harlow, Linda L. “Do They Mean What They Say? Sociopragmatic Competence and Second Language Learners.” Modern Language Journal 74 (1990): 328–51.

Krashen, Stephen D. The Input Hypothesis: Issues and Implications. London: Longman, 1985.

Oller, John W., Jr., ed. Language and Experience: Classic Pragmatism. Lanham: UP of America, 1989.

———. Language Tests at School: A Pragmatic Approach. London: Longman, 1979.

Oller, John W., Jr., and Patricia Richard-Amato, eds. Methods That Work: A Smorgasboard of Ideas for Language Teachers. New York: Newbury, 1983.

Richard-Amato, Patricia. Making It Happen: Interaction in the Second Language Classroom. New York: Longman, 1988.

Upshur, John A. “ Test Is a Four-Letter Word.” Meeting of the EPDA Inst., Univ. of Illinois, Urbana, 1969.


© 1991 by the Association of Departments of Foreign Languages. All Rights Reserved.

ADFL Bulletin 22, no. 3 (Spring 1991): 33-38


Table of Contents
Previous Article Next Article
Works Cited