Nie obrażaj więc mojej inteligencji poprzez czynione na pokaz zaniżanie własnej.
Using Linguistic Annotations in
Statistical Machine Translation of Film Subtitles Christian Hardmeier Fondazione Bruno Kessler Human Language Technologies Via Sommarive, 18 I-38050 Povo (Trento) hardmeier@fbk.eu Martin Volk Universitat Z urich Inst. f ur Computerlinguistik Binzm uhlestrasse 14 CH-8050 Z urich volk@cl.uzh.ch Abstract Statistical Machine Translation (SMT) has been successfully employed to support translation of film subtitles. We explore the integration of Constraint Grammar corpus annotations into a Swedish–Danish subtitle SMT system in the framework of factored SMT. While the usefulness of the annotations is limited with large amounts of parallel data, we show that linguistic an- notations can increase the gains in transla- tion quality when monolingual data in the target language is added to an SMT system based on a small parallel corpus. 1 Introduction 1 million subtitles of the training corpus used by Volk and Harder was morphologically annotated with the DanGram parser (Bick, 2001). We in- tegrated the annotations into the translation pro- cess using the methods of factored Statistical Ma- chine Translation (Koehn and Hoang, 2007) im- plemented in the widely used Moses software. Af- ter describing the corpus data and giving a short overview over the methods used, we present a number of experiments comparing different fac- tored SMT setups. The experiments are then repli- cated with reduced training corpora which contain only part of the available training data. These se- ries of experiments provide insights about the im- pact of corpus size on the effectivity of using lin- guistic abstractions for SMT. In countries where foreign-language films and se- ries on television are routinely subtitled rather than dubbed, there is a considerable demand for effi- ciently produced subtitle translations. Although superficially it may seem that subtitles are not ap- propriate for automatic processing as a result of their literary character, it turns out that their typi- cal text structure, characterised by brevity and syn- tactic simplicity, and the immense text volumes processed daily by specialised subtitling compa- nies make it possible to produce raw translations of film subtitles with statistical methods quite ef- fectively. If these raw translations are subse- quently post-edited by skilled staff, production quality translations can be obtained with consider- ably less effort than if the subtitles were translated by human translators with no computer assistance. A successful subtitle Machine Translation sys- tem for the language pair Swedish–Danish, which has now entered into productive use, has been pre- sented by Volk and Harder (2007). The goal of the present study is to explore whether and how the quality of a Statistical Machine Translation (SMT) system of film subtitles can be improved by us- ing linguistic annotations. To this end, a subset of 2 Machine translation of subtitles As a text genre, subtitles play a curious role in a complex environment of different media and modalities. They depend on the medium film, which combines a visual channel with an audi- tive component composed of spoken language and non-linguistic elements such as noise or music. Within this framework, they render the spoken di- alogue into written text, are blended in with the vi- sual channel and displayed simultaneously as the original sound track is played back, which redun- dantly contains the same information in a form that may or may not be accessible to the viewer. In their linguistic form, subtitles should be faith- ful, both in contents and in style, to the film dia- logue which they represent. This means in partic- ular that they usually try to convey an impression of orality. On the other hand, they are constrained by the mode of their presentation: short, written captions superimposed on the picture frame. According to Becquemont (1996), the charac- teristics of subtitles are governed by the inter- play of two conflicting principles: unobtrusive- ness (discretion) and readability (lisibilite). In KristiinaJokinenandEckhardBick(Eds.) NODALIDA2009ConferenceProceedings,pp.57–64 ChristianHardmeierandMartinVolk order to provide a satisfactory experience to the viewers, it is paramount that the subtitles help them quickly understand the meaning of the dia- logue without distracting them from enjoying the film. The amount of text that can be displayed at one time is limited by the area of the screen that may be covered by subtitles (usually no more than two lines) and by the minimum time the subtitle must remain on screen to ensure that it can actually be read. As a result, the subtitle text must be short- ened with respect to the full dialogue text in the actors’ script. The extent of the reduction depends on the script and on the exact limitations imposed for a specific subtitling task, but may amount to as much as 30 % and reach 50 % in extreme cases (Tomaszkiewicz, 1993, 6). trusiveness and readability, they are not very fre- quent. It is worth noting that, unlike rule-based Ma- chine Translation systems, a statistical system does not in general have any difficulties translat- ing ungrammatical or fragmentary input: phrase- based SMT, operating entirely on the level of words and word sequences, does not require the input to be amenable to any particular kind of lin- guistic analysis such as parsing. Whilst this ap- proach makes it difficult to handle some linguistic challenges such as long-distance dependencies, it has the advantage of making the system more ro- bust to unexpected input, which is more important for subtitles. We have only been able to sketch the character- istics of the subtitle text genre in this paper. Dıaz- Cintas and Remael (2007) provide a detailed intro- duction, including the linguistics of subtitling and translation issues, and Pedersen (2007) discusses the peculiarities of subtitling in Scandinavia. As a result of this processing and the consid- erations underlying it, subtitles have a number of properties that make them especially well suited for Statistical Machine Translation. Owing to their presentational constraints, they mainly consist of comparatively short and simple phrases. Current SMT systems, when trained on a sufficient amount of data, have reliable ways of handling word trans- lation and local structure. By contrast, they are still fairly weak at modelling long-range depen- dencies and reordering. Compared to other text genres, this weakness is less of an issue in the Sta- tistical Machine Translation of subtitles thanks to their brevity and simple structure. Indeed, half of the subtitles in the Swedish part of our par- allel training corpus are no more than 11 tokens long, including two tokens to mark the beginning and the end of the segment and counting every punctuation mark as a separate token. A consider- able number of subtitles only contains one or two words, besides punctuation, often consisting en- tirely of a few words of affirmation, negation or abuse. These subtitles can easily be translated by an SMT system that has seen similar examples be- fore. 3 Constraint Grammar annotations To explore the potential of linguistically annotated data, our complete subtitle corpus, both in Danish and in Swedish, was linguistically analysed with the DanGram Constraint Grammar (CG) parser (Bick, 2001), a system originally developed for the analysis of Danish for which there is also a Swedish grammar. Constraint Grammar (Karls- son, 1990) is a formalism for natural language parsing. Conceptually, a CG parser first produces possible analyses for each word by considering its morphological features and then applies constrain- ing rules to filter out analyses that do not fit into the context. Thus, the word forms are gradually disambiguated, until only one analysis remains; multiple analyses may be retained if the sentence is ambiguous. The annotations produced by the DanGram parser were output as tags attached to individual words as in the following example: The orientation of the genre towards spoken lan- guage also has some disadvantages for Machine Translation systems. It is possible that the lan- guage of the subtitles, influenced by characteris- tics of speech, contains unexpected features such as stutterings, word repetitions or renderings of non-standard pronunciations that confuse the sys- tem. Such features are occasionally employed by subtitlers to lend additional colour to the text, but as they are in stark conflict with the ideals of unob- $- Vad[vad]<interr>INDPNEUSNOM@ACC> vet[veta]<mv>VPRAKT@FS-QUE du [du]PERS2SUTRSNOM@<SUBJ om [om]PRP@<PIV det[den]<dem>PERSNEU3SACC@P< $? In addition to the word forms and the accompany- ing lemmas (in square brackets), the annotations 58 UsingLinguisticAnnotationsinStatisticalMachineTranslationofFilmSubtitles contained part-of-speech (POS) tags such as INDP for “independent pronoun” or V for “verb”, a mor- phological analysis for each word (such as NEUS NOM for “neuter singular nominative”) and a tag specifying the syntactic function of the word in the sentence (such as @ACC> , indicating that the sentence-initial pronoun is an accusative object of the following verb). For some words, more fine- grained part-of-speech information was specified in angle brackets, such as <interr> for “interrog- ative pronoun” or <mv> for “verb of movement”. In our experiments, we used word forms, lemmas, POS tags and morphological analyses. The fine- grained POS tags and the syntax tags were not used. language requires a different word order, reorder- ing is possible at the cost of a score penalty. The translation model has no notion of sequence, so it cannot control reordering. The language model can, but it has no access to the source language text, so it considers word order only from the point of view of TL grammaticality and cannot model systematic differences in word order between two languages. Lexical reordering models (Koehn et al., 2005) address this issue in a more explicit way by modelling the probability of certain changes in word order, such as swapping words, conditioned on the source and target language phrase pair that is being processed. In its basic form, Statistical Machine Transla- tion treats word tokens as atomic and does not permit further decomposition or access to single features of the words. Factored SMT (Koehn and Hoang, 2007) extends this model by represent- ing words as vectors composed of a number of features and makes it possible to integrate word- level annotations such as those produced by a Con- straint Grammar parser into the translation pro- cess. The individual components of the feature vectors are called factors . In order to map be- tween different factors on the target language side, the Moses decoder works with generation mod- els , which are implemented as dictionaries and ex- tracted from the target-language side of the train- ing corpus. They can be used, e. g., to generate word forms from lemmas and morphology tags, or to transform word forms into part-of-speech tags, which could then be checked using a language model. 4 Factored Statistical Machine Translation Statistical Machine Translation formalises the translation process by modelling the probabilities of target language (TL) output strings T given a source language (SL) input string S , p ( T|S ), and conducting a search for the output string T with the highest probability. In the Moses decoder (Koehn et al., 2007), which we used in our exper- iments, this probability is decomposed into a log- linear combination of a number of feature func- tions h i ( S,T ), which map a pair of a source and a target language element to a score based on differ- ent submodels such as translation models or lan- guage models. Each feature function is associated with a weight l i that specifies its contribution to the overall score: T =arg max T log p ( T|S ) 5 Experiments with the full corpus =arg max T å i l i h i ( S,T ) We ran three series of experiments to study the effects of different SMT system setups on trans- lation quality with three different configurations of training corpus sizes. For each condition, sev- eral Statistical Machine Translation systems were trained and evaluated. In the full data condition, the complete system was trained on a parallel corpus of some 900,000 subtitles with source language Swedish and target language Danish, corresponding to around 10 mil- lion tokens in each language. The feature weights were optimised using minimum error rate train- ing (Och, 2003) on a development set of 1,000 subtitles that had not been used for training, then the system was evaluated on a 10,000 subtitle test The translation models employed in factored SMT are phrase-based. The phrases included in a translation model are extracted from a word- aligned parallel corpus with the techniques de- scribed by Koehn et al. (2003). The associated probabilities are estimated by the relative frequen- cies of the extracted phrase pairs in the same cor- pus. For language modelling , we used the SRILM toolkit (Stolcke, 2002); unless otherwise specified, 6-gram language models with modified Kneser- Ney smoothing were used. The SMT decoder tries to translate the words and phrases of the source language sentence in the order in which they occur in the input. If the target 59 ChristianHardmeierandMartinVolk set that had been held out during the whole de- velopment phase. The translations were evalu- ated with the widely used BLEU and NIST scores (Papineni et al., 2002; Doddington, 2002). The outcomes of different experiments were compared with a randomisation-based hypothesis test (Co- hen, 1995, 165–177). The test was two-sided, and the confidence level was fixed at 95 %. The results of the experiments can be found in table 1. The baseline system used only a transla- tion model operating on word forms and a 6-gram language model on word forms. This is a stan- dard setup for an unfactored SMT system. Two systems additionally included a 6-gram language model operating on part-of-speech tags and a 5- gram language model operating on morphology tags, respectively. The annotation factors required by these language models were produced from the word forms by suitable generation models. In the full data condition, both the part- of-speech and the morphology language model brought a slight, but statistically significant gain in terms of BLEU scores, which indicates that abstract information about grammar can in some cases help the SMT system choose the right words. The improvement is small; indeed, it is not re- flected in the NIST scores, but some beneficial ef- fects of the additional language models can be ob- served in the individual output sentences. One thing that can be achieved by taking word class information into account is the disambigua- tion of ambiguous word forms. Consider the fol- lowing example: ditional language models helped to rule out this error and correctly translate mitt emot as over for , yielding a much better translation. Neither of them output the adverb lige ‘just’ found in the reference translation, for which there is no explicit equiva- lent in the input sentence. In the next example, the POS and the morphol- ogy language model produced different output: Input: Daliga kontrakt, dalig ledning, daliga agenter. Reference: Darlige kontrakter, darlig styring, darlige agenter. Baseline: Darlige kontrakt, darlig forbindelse, darlige agenter. POS: Darlige kontrakt, darlig ledelse, darlige agenter. Morphology: Darlige kontrakter, darlig forbindelse, darlige agenter. In Swedish, the indefinite singular and plu- ral forms of the word kontrakt ‘contract(s)’ are homonymous. The two SMT systems without sup- port for morphological analysis incorrectly pro- duced the singular form of the noun in Danish. The morphology language model recognised that the plural adjective darlige ‘bad’ is more likely to be followed by a plural noun and preferred the correct Danish plural form kontrakter ‘con- tracts’. The different translations of the word ledning as ‘management’ or ‘connection’ can be pinned down to a subtle influence of the generation model probability estimates. They illustrate how sensitive the system output is in the face of true ambiguity. None of the systems presented here has the capability of reliably choosing the right word based on the context in this case. In three experiments, the baseline configuration was extended by adding lexical reordering mod- els conditioned on word forms, lemmas and part- of-speech tags, respectively. As in the language model experiments, the required annotation fac- tors on the TL side were produced by generation models. The lexical reordering models turn out to be useful in the full data experiments only when con- ditioned on word forms. When conditioned on lemmas, the score is not significantly different from the baseline score, and when conditioned on part-of-speech tags, it is significantly lower. In this case, the most valuable information for lexical re- ordering lies in the word form itself. Lemma and part of speech are obviously not the right abstrac- tions to model the reordering processes when suf- ficient data is available. Input: Ingen vill bo mitt emot en ismaskin. Reference: Ingen vil bo lige over for en ismaskine. Baseline: Ingen vil bo mit imod en ismaskin. POS/Morphology: Ingen vil bo over for en ismaskin. Since the word ismaskin ‘ice machine’ does not occur in the Swedish part of the training corpus, none of the SMT systems was able to translate it. All of them copied the Swedish input word liter- ally to the output, which is a mistake that cannot be fixed by a language model. However, there is a clear difference in the translation of the phrase mitt emot ‘opposite’. For some reason, the baseline system chose to translate the two words separately and mistakenly interpreted the adverb mitt , which is part of the Swedish expression, as the homony- mous first person neuter possessive pronoun ‘my’, translating the Swedish phrase as ungrammatical Danish mit imod ‘my against’. Both of the ad- 60 UsingLinguisticAnnotationsinStatisticalMachineTranslationofFilmSubtitles Table 1 Experimental results full data symmetric asymmetric BLEU NIST BLEU NIST BLEU NIST Baseline 53.67 % 8.18 42.12 % 6.83 44.85 % 7.10 Language models parts of speech ? 53.90 % 8.17 ? 42.59 % 6.87 44.71 % 7.08 morphology ? 54.07 % 8.18 ? 42.86 % 6.92 ? 44.95 % 7.09 Lexical reordering word forms ? 53.99 % 8.21 42.13 % 6.83 44.72 % 7.05 lemmas 53.59 % 8.15 ? 42.30 % 6.86 44.71 % 7.06 parts of speech 53.36 % 8.13 ? 42.33 % 6.86 44.63 % 7.05 Analytical translation 53.73 % 8.18 ? 42.28 % 6.90 ? 46.73 % 7.34 ? BLEU score significantly above baseline ( p<. 05) BLEU score significantly below baseline ( p<. 05) Another system, which we call the analytical translation system, was modelled on suggestions by Koehn and Hoang (2007) and Bojar (2007). It used the lemmas and the output of the morpholog- ical analysis to decompose the translation process and use separate components to handle the transfer of lexical and grammatical information. In order to achieve this, the baseline system was extended with additional translation tables mapping SL lem- mas to TL lemmas and SL morphology tags to TL morphology tags, respectively. In the target lan- guage, a generation model was used to transform lemmas and morphology tags into word forms. The results reported by Koehn and Hoang (2007) strongly indicate that this translation approach is not sufficient on its own; instead, the decomposed translation approach should be combined with a standard word form translation model so that one can be used in those cases where the other fails. This configuration was therefore adopted for our experiments. The analytical translation approach fails to achieve any significant score improvement with the full parallel corpus. Closer examination of the MT output reveals that the strategy of using lemmas and morphological information to trans- late unknown word forms works in principle, as shown by the following example: Input: Molly har visat mig br ollopsfotona. Reference: Molly har vist mig fotoene fra brylluppet. Baseline: Molly har vist mig br ollopsfotona. Analytical: Molly har vist mig bryllupsbillederne. In this sentence, there can be no doubt that the out- put produced by the analytical system is superior to that of the baseline system. Where the base- line system copied the Swedish word brollops- fotona ‘wedding photos’ literally into the Dan- ish text, the translation found by the analytical model, bryllupsbillederne ‘wedding pictures’, is both semantically and syntactically flawless. Un- fortunately, the reference translation uses different words, so the evaluation scores will not reflect this improvement. The lack of success of analytical translation in terms of evaluation scores can be ascribed to at least three factors: Firstly, there are relatively few vocabulary gaps in our data, which is due to the size of training corpus. Only 1.19 % (1,311 of 109,823) of the input tokens are tagged as un- known by the decoder in the baseline system. As a result, there is not much room for improvement with an approach specifically designed to handle vocabulary coverage, especially if this approach itself fails in some of the cases missed by the base- line system: Analytical translation brings this fig- ure down to 0.88 % (970 tokens), but no further. Secondly, employing generation tables trained on the same corpus as the translation tables used by the system limits the attainable gains from the out- set, since a required word form that is not found in the translation table is likely to be missing from the generation table, too. Thirdly, in case of vo- cabulary gaps in the translation tables, chances are that the system will not be able to produce the optimal translation for the input sentence. In- stead, an approach like analytical translation aims 61 |
Menu
|