Kamis, 23 Desember 2010

Using a Computer in Foreign Language Pronunciation Training: What Advantages?

 


Maxine Eskenazi
Carnegie Mellon University
Abstract:
This paper looks at how speech-interactive CALL can help the classroom teacher carry out recommendations from immersion-based approaches to language instruction. Emerging methods for pronunciation tutoring are demonstrated from Carnegie Mellon University's FLUENCY project, addressing not only phone articulation but also speech prosody, responsible for the intonation and rhythm of utterances. New techniques are suggested for eliciting freely constructed yet specifically targeted utterances in speech-interactive CALL. In addition, pilot experiments are reported that demonstrate new methods for detecting and correcting errors by mining the speech signal for information about learners' deviations from native speakers' pronunciation.
INTRODUCTION
The ever growing speed and memory of commercially available computers, coupled with decreasing price, is making feasible the idea of creating computer-assisted language learning (CALL) that is speech-interactive. Even though the hardware conditions for an ideal automatic training system exist, can the same be said of state-of-the-art automatic speech recognition (ASR) and of our knowledge of the variability of the speech signal--the main stumbling block to higher quality speech recognition? Has the technology come far enough for systems to be able to teach pronunciation effectively? To answer these questions, we will first specify what is believed to contribute to successful language learning under a direct approach, drawing largely from principles described by Celce Murcia and Goodwin (1991). We will then list pedagogical recommendations following from this approach, such as providing language samples from many different speakers. Next, we will look at what speech-interactive CALL can do to help the classroom teacher carry out these recommendations. We illustrate with emerging methods for pronunciation tutoring from the FLUENCY project at Carnegie Mellon University (CMU) (Eskenazi, 1996), methods that support both articulation of phonemes and use of prosody—the intonation and rhythm of speech. The emphasis here is on pronunciation in the context of overall language learning. Proficient pronunciation is essential to language learning because below a certain level of pronunciation, even if grammar and vocabulary have been mastered, communication obviously cannot take place.
WHAT CONTRIBUTES TO SUCCESS IN TARGET LANGUAGE PRONUNCIATION?
Conditions for Success and Pedagogical Recommendations Based on Immersion
Many foreign language instructors agree (Celce Murcia & Goodwin, 1991) that living in a country where the target language is spoken is the best way to become fluent—a total immersion situation. They also generally agree (Kenworthy, 1987; Laroy, 1995; Richards & Rodgers, 1986) on which conditions of living abroad are critical to effective language learning:
• Learners hear large quantities of speech.
• Learners hear many different native speakers.
• Learners produce large quantities of utterances on their own.
• Learners receive pertinent feedback.
• The context in which the language is practiced has significance.
These conditions cover the external environment of language learning. From each we can extract recommendations for how to learn language under less than total immersion conditions.1 These recommendations cannot always be carried out in classroom contexts, thus presenting opportunity and motivation for ASR technology to complement teaching.
Recommendation 1
Learners hear large quantities of speech. For language learners who are not living in the country of the target language, immersion
courses consisting of six to eight hours daily are often the best alternative for exposing learners to the language. An ideal ratio of one student-one teacher would provide maximum speaking and feedback time. This situation is not always feasible. On the one hand, most students have other daily activities and, on the other, employing human teachers for eight hours a day is expensive (Bernstein, 1994). Moreover, immersion classes usually have five to ten students, and attending to individual needs reduces the amount of time the teacher speaks to the class.
Recommendation 2
Learners hear many different native speakers. This recommendation implies employing many native teachers with a diversity of voice types and dialects. However, the variety of native speakers available locally is limited, as is the number of people that a school can afford to hire. Traditional educational materials that promote wider exposure, such as audio and video cassettes, tend to be non-interactive, and their audio quality can degrade over time.
Recommendation 3
Learners produce large quantities of utterances on their own. Ideally, the student is in a one-on-one setting where the teacher encourages short conversations, constantly eliciting the student's speech. In reality, students in the classroom share the teacher's attention. The amount of time they spend individually producing speech and participating in conversation is thus reduced.
Recommendation 4

Learners receive pertinent feedback. In immersion contexts feedback that leads to correction of form or content may occur in two ways. Implicit feedback comes when speaker and listener realize that the message did not get across. A clarification dialogue usually takes place ("I beg your pardon?" "What did you say?"), ending with a corrected message that is understood. Less often, when culture and interpersonal context permit, the listener offers explicit correction, such as pointing out the error or repeating what the speaker said but with correction. In the ideal classroom, teachers offer implicit and explicit feedback at just the right times, keeping a balance between not intervening too often, to avoid discouraging the student, and intervening often enough to keep an error from becoming a hard-to-break habit. Expert teachers adapt the pace of correction—how often they intervene—to fit the student's personality. In reality, however, not all teachers use the same techniques and, in the classroom, are not always able to adapt these techniques to individuals. When class size increases, the amount of feedback to the individual student decreases.
Recommendation 5
The context in which the language is practiced has significance. Living in the country where the target language is spoken gives learners the practical need to speak. Their utterances have immediate significance. To accommodate this recommendation, the ideal language classroom includes fast-paced games and everyday conversations that create meaningful contexts (Bowen, 1975; Brumfit, 1984; Crookall & Carpenter, 1990). The student has to respond rapidly and utter new terms in these contexts. In reality, classroom size again reduces the individual learner's time for participating in such activities.
Conditions for Success and Pedagogical Recommendations Based on Structured Intervention
There are two additional conditions that appear critical for learning pronunciation but that do not follow from immersion—indeed, they follow from an assumption of structured intervention that departs from pure immersion: 1) Learners feel at ease in the language learning situation. Whereas the very young language learner perceives and tries out new sounds easily, older learners lose this ability. Embarrassment or fear may inhibit the learner from trying new sounds or even from speaking, whether in a total immersion or a classroom environment (Laroy, 1995). 2) There is ongoing assessment of learners' progress. Language learning appears most efficient when the teacher constantly monitors progress to guide appropriate remediation or advancement.
These conditions lead to pedagogical recommendations that may be particularly hard to carry out in the classroom.
Recommendation 6
Learners feel at ease. A key dimension of the learner's "internal" environment is self-confidence and motivation. Although there are techniques to boost student confidence in the classroom (Laroy, 1995; Krashen, 1982)—such as correcting only when necessary, reinforcing good pronunciation, and avoiding negative feedback—these may not overcome learners' inhibitions. Laroy (1995) finds that when students are asked in front of peers to make sounds that do not exist in their native language, these students tend to feel ill at ease. As a result, they may stop trying completely or may only make sounds from their native language. One-on-one teaching is important at this point, allowing students to "perform" in front of the teacher alone, not in front of a whole class, until they are comfortable with the newly learned sounds.
Recommendation 7
There is ongoing assessment. To adapt training to individual needs, the teacher ideally monitors each student's moment-by-moment progress, assessing strong and weak points, and judges where to focus effort next. The effective teacher takes into account what the student feels is useful, thus keeping students involved in their own progress (Celce Murcia & Goodwin, 1991; Laroy, 1995). In reality, classroom teachers cannot maintain steady monitoring of each student at this level of detail.
WHERE CAN SPEECH-INTERACTIVE CALL MAKE A CONTRIBUTION?
It is not feasible to carry out these seven recommendations fully in the traditional language classroom, given constraints on teaching time and materials. The ideal CALL system could help toward realizing these recommendations by providing individualized practice and feedback in a safe environment and sending back regular progress reports to the teacher (Wyatt, 1988). The human teacher must still do the high-level, subtle work of creating a positive atmosphere for the production of new sounds and stress patterns, explaining fine conceptual differences between a student's native language and the target language, and exploring cultural differences (Bernstein, 1994).
For each of our recommendations we will consider where automatic functions, in the form of both ASR and CD-ROM, can support the classroom. We draw examples from the FLUENCY project and from other systems featured in this volume.
CALL Can Help Learners Hear Large Quantities of Speech
With the decreasing cost and increasing capacity of computer memory and storage, CALL can offer users a choice of many prerecorded utterances. CD-ROMs afford high-quality sound and video clips of speakers, giving learners a chance to see articulatory movements used in producing new sounds (e.g., LaRocca, 1994). The teacher no longer has to find or record native speakers, although tools can be provided for teachers to add new speakers to the data set. The highly available digitized speech supplements the teacher's speech without incurring additional cost at each use. It also allows individualized access to particular samples of speech.
 ASR-Based CALL Can Help Learners Produce Large Quantities of Utterances on Their Own
Limitations of Traditional ASR-based CALL
A major problem in speech-interactive CALL, in commercial products especially, is that learners remain relatively passive (Wachowicz & Scott, this issue). Although learners may be asked to voice an answer to a question, this by design involves either parroting an utterance just presented or reading one of a small set of written choices (Bernstein, 1994; Bernstein & Franco, 1995). Learners get no practice in constructing their own utterances (i.e., choosing vocabulary and assembling syntax). The commercially available AuraLang package (Auralog, 1995), for example, is an appealing language teaching system that feeds to ASR the user's pronunciation of one of three written sentences. Each choice leads the dialogue along a different path. A certain degree of realism is attained, but students do not actively construct utterances.
Techniques for Extending the Limitations: Sentence Elicitation
The FLUENCY project has developed a technique that enables users of speech-interactive CALL to participate more actively in constructing utterances (Eskenazi, 1996). In traditional speech-interactive CALL, ASR works well because the system "knows" what a speaker will say and matches exemplars of the phones; it expects (pre-stored in memory) against the incoming signal (what was actually said). The technique developed in FLUENCY, by contrast, makes it possible to predict enough of what the speaker will say to satisfy the needs of the recognizer while giving speakers apparent freedom to construct utterances on their own. The technique is based on sentence elicitation, modeled on the drills used in the once prevalent Audio-Lingual Method (Modern Language Materials, 1964) and the British Broadcasting Company tutorials (Allen, 1968).
Several studies have addressed whether specifically targeted speech data can be collected using sentence elicitation (Hansen, Novick, & Sutton, 1996; Isard & Eskenazi, 1991; Pean, Williams, & Eskenazi, 1993). Results confirm that a given prompt sentence in a carefully constructed exercise elicits at most one to three distinct response sentences from normal speakers. Students can practice constructing answers to the same elicitation sentences as often as they wish, at no additional cost in teacher time and materials. Availability and patience are other qualities that enable the system to support our recommendation of having learners produce large quantities of utterances on their own.
ASR-Based CALL Can Provide Learners With Pertinent Corrective Feedback
Teachers often ask what type of corrective feedback speech recognition can furnish. This section will address two aspects of the question: whether and what types of errors can be detected successfully, and what methods are effective in telling students about errors and showing them how to make corrections.
Can Errors Be Detected? Phone Errors Versus Prosody Errors
Error detection procedures differ as follows. Phone-based errors are identified in forced alignment mode. Given an expected utterance, the recognizer takes the actual utterance and returns the placement in time of phones and words on the speech signal. By this method the learner's recognition scores can be compared to the mean recognition scores for native speakers—all uttering the same sentence in the same speaking style—and the learner's errors can thereby be identified and located (Bernstein & Franco, 1995). For prosodic errors, however, only duration can be obtained from the output of the recognizer. That is, when the recognizer returns the phones and their scores, it can also return the duration of the phones. Frequency and intensity, on the other hand, are measured on the speech signal before it is sent to the recognizer but after it is preprocessed. Intensity is usually obtained by using a technique known as cepstral analysis. Fundamental frequency is obtained from an algorithm that detects peaks in the signal and measures the distance between them. Speakers as individuals vary greatly on the three components of prosody. For example, some people speak louder or faster in general than do others. Thus, it is important that measures of the three be expressed in relative terms, such as the duration of one syllable compared to the next.
Phone Error Detection: A Pilot Study of ASR-based Comparisons of Native and Nonnative Speakers
Although researchers have been cautious about using ASR to pinpoint phone errors, recent work in the FLUENCY project shows that the recognizer can be used in this task if the context is well chosen (Eskenazi, 1996). Demonstrating this is a pilot study of native and nonnative speakers uttering responses in elicitation exercises.
Method
Ten native speakers of American English were recorded (5 male and 5 female) and 20 speakers of other languages (one male and one female from each of the following L1s: French, German, Hebrew, Hindi, Italian, Korean, Mandarin, Portuguese, Russian, Spanish).2 Expert language teachers were asked to listen to the sentences recorded by each speaker and to judge where there was an error, what it was, and how (and when) they would intervene to correct it. Teachers marked these judgments on phonemically labeled copies of the target sentences. The agreement between human teachers and ASR detection was used as a preliminary indication of the validity of automatic error detection. n speech-interactive CALL. The SPELL foreign language teaching system (Rooney, Hiller, Laver, & Jack, 1992) addresses both fundamental frequency, or pitch, and duration. Pitch detection, like speech recognition, is by no means a perfected technique. But Bagshaw, Hiller, & Jack's (1993) work on better pitch detectors for SPELL shows that algorithms can be made more precise within a specific application. This work compared the student's pitch contours to those of native speakers to demonstrate the informativeness of pitch detection. Pitch detection was incorporated into SPELL and the output interpreted in visual and auditory feedback for the student. SPELL assumes that suprasegmental (prosodic) aspects of speech should be tied to segmental (phonemic) information—for example, by showing pitch trajectories (contours over segments) and pitch anchor points (centers of stressed vowels). SPELL also addresses speech rhythm, showing segmental duration and acoustic features of vowel quality (predicting strong vs. weak vowels).
Tajima, Port, and Dalby (1994) and Tajima, Dalby, and Port (1996) have addressed duration. They studied how timing changes in speech affect the intelligibility of nonnative speakers and created remedial training supported by ASR. By using speech that is practically devoid of segmental content (ma ma ma Ö), they separate the segmental and suprasegmental aspects of the speech signal to focus on one aspect—temporal pattern training.
The FLUENCY project has looked at how to detect changes in duration, pitch, and intensity to find where a nonnative speaker deviates from acceptable native values. Prosody training in FLUENCY is linked to segmental aspects, with students producing meaningful phones. We aim to detect deviations independently of L1 and L2 so that if a learning system is ported to a new target language, its prosody detection does not have to be changed fundamentally. We have promising results from a pilot study, reported below, using hand-labeled features of the spectrogram.
Prosody Error Detection: A Pilot Study of ASR-based Comparisons of Native and Nonnative Speakers
Method
For the English sentence data recorded in the pilot study on phones, we additionally asked human teachers to mark the location and type of prosodic errors of each speaker on transcriptions of the sentences. We first examined the speech signal to determine whether the information used by teachers to detect errors could be characterized in the spectrogram. After examining phone-, syllable-, and word-sized segments, we developed three measures, one for each component of prosody. We compared these with human teachers' judgments of places where prosody needed improvement in each sentence and refined the measures until they showed close agreement with human judgments. These measures then define the features we want to extract automatically from the speech signal to diagnose where students need improvement.
Duration Results
The first measure was duration of the speech signal, measured on the waveform. The results of the duration comparisons are given in Figure 2. The duration of one voiced segment was compared to the duration of the preceding one ("ratio of seg1/seg2" on the vertical axis) to make the observations independent of individual variations in speaking rate. Pitch Results
The second measure we developed was the total number of pitch peaks present in the speech signal, calculated for each segment.4 Again, results were compared between neighboring segments. We were able to detect pitch deviations related to duration as well as independent of it. For example, mfrc raised pitch much higher on /EHK/ in "extra" than on the following vocalic segment /STRA/, probably because /EHK/ is also longer (see Figure 2). However, the speaker mpeg varied pitch independently of duration.
Intensity Results
The third measure developed, for intensity, was the average of all the cepstral values over a given vocalic segment. To address relative rather than absolute intensity, we compared these values segment-to-segment with those of neighboring vocalic segments, as with duration and pitch. The resulting curves and spread of speaker space, shown in Figure 3, differ in general aspect from the results in Figures 1 and 2. Outliers were indicated that matched teachers' judgments about relative stress centers in utterances. For example, msjh shows stress displaced within the "I/did/want" region, mbob displaced stress within "did/want/to," and msjh, among others, within the "ex/tra/in" region. The speakers' changes in amplitude appeared to be independent of duration and pitch.
Two-by-Two Comparison of Average Intensity (Amplitude) on Voiced Segments
0x01 graphic
Implications
Our pilot study suggests that the spectrogram can be mined for measures of speech prosody that have diagnostic value and are consistent with what expert teachers say they would detect and correct. We are now rendering these measures automatically detectable. Being separate from one another, the three measures of prosody, once analyzed in an utterance, could be expressed in visual displays for the learner that show pitch, duration, or amplitude. A learner's utterance could then be compared with a native speaker's utterance on each dimension to illustrate differences. Our results suggest that the components of prosody are not totally independent of each other. We saw this particularly in the dependency of pitch on duration. We suggest that correction first address the three components separately, then address their combined effect. Instruction could begin by exercising pitch and duration changes independently, then give practice on changing pitch and duration together.
An Argument for Early Prosody Instruction
Early prosody instruction, starting the first year of language study, could be a boon to learning both syntax and phone articulation. Because speakers prepare the syntax of a sentence they want to say at about the same point as they prepare prosody, incorrect word order will not fit the "song" that it is to be sung to. Self-correction then comes into play as students rearrange syntax to give a better fit to prosody. (Because the "song" is considered as a whole and the syntax as a concatenation of elements, the student should tend to rearrange syntax and not prosody.) Phones may benefit from early prosody training, for example, in the case of stressed and unstressed vowels in English. If a target vowel is unstressed and the Spanish speaker uses a tense (stressed) vowel that is close to the target in articulatory space, self-correction should follow because the speaker's longer tense vowel will not "fit the song" well. For example, the unstressed "this" in the sentence "I want this present" is shorter and softer than the surrounding vowels. Practice of correct prosody in this sentence should aid pronunciation of "this" by lessening emphasis on and shortening the / IH/ sound. Follow-up exercises could put "this" into new contexts, such as "This is yours," where the word is not so short and the speaker must make more effort to retain the shortened form just learned.
Effective Correction in Speech-interactive CALL
Learners' difficulties with phones and prosody, which our pilot studies suggest can be readily detected in the speech waveform, become targets for focused correction in CALL. The system that only detects pronunciation errors (e.g., parts of TriplePlayPlus by Syracuse Language Systems, 1994) is of limited aid. Learners will make random, trial-and-error attempts to correct the reported error. There may be little true amelioration and even negative effects if learners make a series of poor attempts at a sound. Such unsupervised repetitions could reinforce poor pronunciation to the point of becoming a hard-to-correct habit (Morley, 1994).
Effective correction requires that recognizer results be interpreted, as by putting them into a visually comprehensible form and comparing them to native speech. Our work in FLUENCY suggests that how recognizer results are best interpreted for instruction differs between phone correction and prosody correction. This suggestion stems from the fact that phones are different from one language to another while prosody is produced in the same way across languages. Whereas students must be guided as to tongue and teeth placement for a new phone, they don't need instruction on how to increase pitch if they have normal hearing: They only need to be shown when to increase and decrease it, and by how much.
Correcting Phone Errors
There has been some success in using minimal pairs—contrasting sounds in context in the target language, such as "I want a beet"/"I want a bit" (see Dalby & Kewley-Port, this issue; Wachowicz & Scott, this issue). Effective teachers often go further, with instructions on how to change articulator position and duration. This kind of instruction is important because if a sound does not already belong to a learner's phonetic repertory, the learner will associate it with a close speech sound that is in the repertory. For example, anglophones beginning to speak French typically hear and pronounce the French sound /y/ (in tu) as the English sound /u/ (in "too"); but they can be taught to use liprounding to approximate French /y/.
Automatic systems can teach articulator placement for new sounds, adding graphical views, for example, of the inside of the mouth (LaRocca, 1994). This instruction can be likened to gymnastics; the learner "feels" when the articulators are correctly in place and practices with the recognizer to confirm this. Learners can train their ears to recognize the new sounds and relate them to what they feel their muscles doing. Akhane-Yamada et al. (1996) suggest that learning to perceive sound distinctions helps in their production.
Phone articulation training can be L1-independent. A target vowel, for example, can be taught by starting with a close cardinal vowel (e.g., /a/, / i/, and /u/ have a high probability of existing in most L1s). A better solution, requiring more computer memory and linguistic knowledge, is to start with a close vowel in the learner's particular L1. Taking into account the learner's L1 can help anticipate errors and point to pertinent articulatory hints (Kenworthy, 1987). Thus, knowing that French has no lax vowels lets teachers of English to French speakers focus on how to go from a tense vowel to a close lax vowel ("peat" to "pit").
Correcting Prosody Errors
Based on work in FLUENCY, we propose that the visual display more than oral instructions will be critical to prosody correction. The key is for learners to see where the curve representing their production differs from the native speaker's curve. Prosody displays can benefit from the wealth of work on automated systems that teach the deaf to speak. For example, Video Voice (Micro Video, 1989) uses histograms to represent intensity (over time) and xy curves for pitch (over time). Duration is implicit in the time axis of the intensity histogram. Video Voice compares what the student says to a native speaker's prerecorded exemplar. For pitch the student sees the two frequency curves and, guided by hints, tries to increase
or decrease pitch at relevant points to come closer to the exemplar. Trials within the FLUENCY project confirm the importance of visual details to help learners understand the display, for example, using a continuous line as opposed to a divided contour for pitch.
ASR-based CALL Can Provide Significant Contexts for Language Practice
CALL can simulate authentic contexts using multimedia and multimodal displays in ways discussed elsewhere in this volume (e.g., Rypa & Price; Wachowicz & Scott). Learners can participate in one-to-one conversations with one or more simulated or videotaped interlocutors. The cue for the student to speak can be realistic, such as having a character on the screen turn head and eyes toward the user (or the camera).
ASR-Based CALL Can Put Learners at Ease
The computer can prove the ideal partner for putting a language learner at ease in speaking. Whereas the human teacher judges the student's production, the computer can be viewed as neutral. It can support continual practice of unusual sounds until students have enough confidence to go before others. The system becomes what Wyatt (1988) calls a collaborative tool rather than a facilitative one, with students assuming the role of judges of their own productions. This role not only has pedagogical backing (Celce Murcia & Goodwin, 1991) but can also benefit system performance. For example, if an exercise requires making a fine phonetic distinction that the recognizer detects poorly, the system can mislead and frustrate the student by giving errant pronunciation scores and, on that basis, deciding what to present next. However, if the system simply displays recognition results without pronunciation scores and allows students to decide whether they did well or need further practice, then ASR-derived error is less problematic. The student gains a sense of control over the chain of events but the teacher can still intervene to insist on more practice.
ASR-Based CALL Can Provide Ongoing Assessment
CALL today can enable rapid, constant assessment of the learner. The system can provide more details more rapidly than a teacher grading tests (Bernstein & Franco, 1995). The feedback given to the teacher can go beyond pronunciation scoring. In traditional computer-aided instruction, learners are scored right or wrong on a given question and the scores tallied at the end of the session. But for a system that gives visual data to help learners decide where to correct themselves, feedback to the teacher can include learners' own decisions as to their strong and weak points. For example, in a lesson on how to emphasize content words in utterances, if the learner decides to work on duration rather than pitch or amplitude, we can assume either that duration presented more of a problem or that the learner did not have time for the other two aspects. In any case, the teacher who receives the system's report can immediately test progress in the aspect the learner worked on and recommend what to work on in the next session.
Latency of response can also be measured (Bernstein & Franco, 1995) to obtain an even clearer view of where learners are having difficulties. Responses that took more time to formulate can be noted, as can progress in decreasing latencies over a session.
CONCLUSION
Speech-interactive CALL brings to pronunciation instruction a wealth of new, sometimes unforeseen, techniques. Increases in computer memory and storage for expanded exposure to many speakers and for multimedia corrective feedback can reproduce some of the advantages of total immersion learning. There is still much to be done. Teachers and computer scientists need to collaborate more closely to refine ASR-based tools and to invent and validate new teaching methods to build on the advantages of the new medium.


Tidak ada komentar:

Posting Komentar