intTypePromotion=1 Tuyển sinh 2024 dành cho Gen-Z

Báo cáo khoa học: Preliminary Report on the Insertion of English Articles in RussianEnglish MT Output""

Chia sẻ: Nghetay_1 Nghetay_1 | Ngày: | Loại File: PDF | Số trang:0

lượt xem
  Download Vui lòng tải xuống để xem tài liệu đầy đủ

Research on a non-statistical scheme for the insertion of English articles in machine-translated Russian is described. Ideal article insertion as a goal is challenged as unreasonable. Classification of English nouns, simple syntactic criteria, and multiple printout are the scheme's main features.

Chủ đề:

Nội dung Text: Báo cáo khoa học: Preliminary Report on the Insertion of English Articles in RussianEnglish MT Output""

  1. [Mechanical Translation, vol. 8, No. 1, August 1964] Preliminary Report on the Insertion of English Articles in Russian- English MT Output* by G. R. Martins, Technical Staff, Bunker-Ramo Corporation Research on a non-statistical scheme for the insertion of English articles in machine-translated Russian is described. Ideal article insertion as a goal is challenged as unreasonable. Classification of English nouns, sim- ple syntactic criteria, and multiple printout are the scheme's main features. the human translator, and which appear to be well be- One of the most discussed problems in the automatic yond the reach of MT machines as presently pro- translation of Russian documents into English is the grammed. insertion of English articles in the output. Approaches These considerations have led me to the conclusion, to the solution of this problem, where it has been con- surprising perhaps to some, that it is both impossible sidered at all, are as varied as the basic MT programs and undesirable to attempt the automatic determina- in use by the different teams engaged in this work. tion of a single English article appropriate to the oc- Most projects, however, either use statistical criteria in currence of every nominal encountered in the output the determination of English articles to the exclusion of text. Which is to say that we should be prepared to do all other considerations, or use a combined syntactico- without articles altogether, or to accept alternative statistical method; the aim of all such routines is the articles in the final printed translation. The former selection of one and only one of the four articles (a, solution, presently in use by some teams, is not quite an, the, Ø). None of the solutions presented to date in so harmless as it appears, for the reason that Ø is as the literature is entirely satisfactory. legitimate an English article as arc the, a, and an, to Two kinds of ambiguity present themselves as obsta- my way of thinking. The decision (or pseudo-decision) cles to the successful determination of English articles to do without articles altogether, then, amounts to a in automatically translated Russian. The first derives decision to select everywhere the article Ø , and this is from the structure of the Russian language, in that it scarcely more defensible than to select everywhere the does not employ any simple elements isomorphic with (which is statistically much more common). English articles as adjuncts to nominal phrases—there are no elements in Russian text which may be corre- The decision to print out alternative articles in some lated strongly with the English articles. This kind of instances is tantamount to passing on a portion of the ambiguity is not always formally resolvable since it translation function to the reader, of course. While this often raises the particular question: "What did the hardly fulfills the idealists' goal for MT, it is not an author mean in this instance?" In such instances, even indefensible solution; the same default of function can with his immense reservoir of repertorial and contextual be imputed to every MT program which permits mul- clues, the human translator can only make an educated tiple printout as a solution to very complex problems guess, and the machine, with its drastically limited set of polysemy—and this includes every existing program. of potential determiners, cannot do better. And, so long as (a) we do not simply print out all four possible articles in every case, and (b) we do not fail Rut another kind of ambiguity arises from the side to include among the output alternatives a/the "cor- of the English output itself. Situations are frequently rect" article, we have made a net gain in quality of encountered in which various articles may be inserted translation. What is more, the task of final article selec- without doing violence to the text, and occasionally tion might, in most cases, better be assigned to the without altering in any simply statable way the intuitive reader, knowledgeable of the field of discourse and meaning of the passage. In: "He is working on —— possibly even familiar with the stylistic peculiarities of analysis of English verbs." we may read an, the, or Ø, the author, than to the machine. with appropriate intonations, and get reasonable Eng- This point of view not only enables us to proceed in lish sentences which differ in meaning, if at all in any spite of the ambiguities mentioned above, it gives us systematic way, very slightly indeed. The question: at the same time one of the distinctive characteristics "What is the preferred English article?" in these situa- (multiple printout) of the system we have been looking tions is not easily answered, and it does not seem a for as a solution to the article problem. reasonable hope to look for a single arbitrary choice It may legitimately be asked at this point whether which will work in every case. the net translation quality gain obtained even from Here we are faced with two kinds of overlapping the best of multiple-article-printout schemes justifies ambiguity, neither of which is easily resolved even by the research and programming effort required for its * This research is being carried out under the sponsorship of the implementation. From the point of view of a produc- National Science Foundation. 2
  2. tion MT organization, this question is meaningful only teria just developed. In no case was an unacceptable in terms of the incrementing of consumer appeal of the result obtained from this brief test. product, and it would be difficult to answer without After this, the nouns occurring in the first half of the research in that very area. From the point of view of corpus but not in the second (and therefore not tabu- an MT research group, the implementation of such an lated) were listed and each was classified intuitively article insertion program as that discussed here is justi- as a member of one of the five article-pattern classes. fied as a test of the program's inherent merits and also Once again the first half of the corpus was tested, and as a means of facilitating research into the question of again no unacceptable results were obtained. It is consumer reaction to it. worth noting here that noun tokens occurring in special With these thoughts in mind, a close examination of word combinations or idiomatic expressions were not several texts, in English, was undertaken to determine taken into consideration; no particular problems are something about the patterns of occurrence of the arti- presented by such occurrences since our present MT cles. Some simple contextual criteria were sought which program takes such constructions into account already would enable us accurately to predict the human trans- for other purposes. lator's selection of an article; at this point, our attention Other syntactic criteria, of the most obvious kind, focused on English texts translated from the Russian, were taken into account during these tests; these do and the matching Russian texts, rather than on random not seem to be of such great interest as to warrant dis- English texts. Decision criteria were sought in both cussion at length. Typical of these criteria is: 0 with languages in the hope that this would improve the odds all nouns preceded by a possessive pronoun, or by a on our success. demonstrative, or by the interrogative "WHICH" or Early in the study one criterion of great promise "WHAT", or by "EACH" or "EVERY" or "ANY" or "SOME". came to light. For each English noun token in the text Another example is: THE before a superlative modifier we asked the question: "Is its Russian equivalent, in (and before a preceding adverbial, if such is present) *. the matching Russian text, followed by a syntactically I am pleased with the results of these early tests of linked genitive block?" More obvious, of course, but the article determination procedure for several reasons. of great importance, was another criterion: "Is the First of all, it seems reasonable to think that a success- English noun token singular or plural?" To test the ful article determination program would be based upon significance and power of these two criteria, and to a classification of English nouns and upon certain gauge the strength of additional criteria that might be rather simple syntactic criteria; this is the approach necessary, the following test was devised. hinted at by the Milan MT team, although their re- A machine-translated corpus, taken from Pravda, port is distressingly vague and little more can be got was treated in the following way: (a) the corpus was from it than the fact that they are thinking in terms of eight noun classes, not five.1 divided roughly into two halves, (b) all English noun tokens in the final half were marked to indicate The intuitively satisfying homogeneity of the con- whether or not the Russian equivalent was followed by tents of each noun class leads me to suspect that such a linked genitive block, (c) all articles already present classification as we are undertaking could have some in the English were deleted, (d) appropriate article relevance outside the restricted domain of MT. A re- tokens were then inserted in the English by hand, with lated consideration is the apparent success of attempts multiple entries being made where no clear decision to classify nouns intuitively; this not only raises certain could be made on the basis of individual sentence con- mildly interesting questions about the grammar of tent alone, (e) each noun from the text was then listed English, but it greatly enhances the feasibility of car- along with indications of the article patterns occurring rying out such classification in extenso. with it (note that here two separate entries in the tab- To make clearer some details of the scheme, I will ulation were made for a noun if it had occurred in give here a set of noun-classification rules put to- the text both with and again without a following geni- gether earlier in our study to serve as a research tool. tive block behind its Russian equivalent), and (f) the The following rules are suggestive rather than strictly tabulation was examined for possible clues to additional prescriptive in nature. It is hoped that rules of this criteria. kind will enable linguistically unsophisticated person- Encouragingly, it turned out that the English nouns nel to carry out successful classification operations on could be grouped into five classes according to the pat- the membership of large noun lists without time-con- tern of article occurrence indicated for them in the suming context consultation and/or revisions based tabulation. This was regarded as encouraging because, upon hindsight. A small burden is deliberately placed first of all, three of the classes were quite small com- upon the worker's imagination, and it is presumed that pared to the others, and secondly, each class seemed the worker is a native speaker of English. These restric- to have its own intuitive internal homogeneity. tions are felt to be justifiable for two reasons: (a) we The first half of the corpus then had its articles de- thus avoid the premature elaboration of very complex leted throughout, and, for each noun in the tabulation, * The obvious exceptions to a rule of this kind for mathematics texts articles were inserted with reference only to the cri- are now under study. 3 E NGLISH ARTICLES IN RUSSIAN-ENGLISH
  3. rules, and (b) the worker’s imaginative burden dimin- ishes rapidly with experience in this kind of coding operation. The rules take the form of simple questions, answer- able with either "YES" or "NO". Coding indications de- pend upon these answers. 1. Can the noun, in the singular, begin a sentence of the type: "—— is necessary." etc.? YES: See rule 2 NO: See rule 3 2. Can the noun, in the singular, ever take the article "A/AN"? YES: Class 3 NO: Class 2a 3. Does this noun, in the singular, always require "THE"? Explanation: YES: Class la NO: See rule 4 English nouns are classed by membership in one of the 4. Is the meaning of this noun intuitively more abstract five classes listed in the leftmost vertical column of than concrete, or is its meaning vague? the diagram; a very small number of special nouns are YES: Class 2, tentatively not so classified, but are covered by individual rules NO: Class 1 (e.g., "mankind"; NO ARTICLE). The categories The diagram in the next column, with an accompany- "Singular" and "Plural" refer to the noun token itself. ing explanation, shows the relationships between the The indication "gen. block" means "noun token is fol- noun classes thus established and the article selection lowed (in the Russian) by a linked genitive block"; routines. "no gen. block" is the negation of "gen. block". The listing of two forms in a section of the diagram means that both are to be printed out as alternative readings. Reference Where 0 occurs alone, nothing is to be printed; where 1. J. Barton. The Application of the Article in English. it occurs as an alternative reading, an indication of the Proceedings of the 1961 International Conference on alternative article-less reading is to be printed along Machine Translation of Languages and Applied Lan- guage Analysis (Teddington), Vol. I, Her Majesty's Sta- with the given article. tionery Office, London, 1962, pp. 111-121. Unquestionably, the simplicity of the single major syntactic criterion (relating to following genitive blocks) will have to be weakened in favor of more sophisticated criteria; but it is interesting how much of the problem can be managed with no more than this. A program is now in preparation which will per- mit large-scale testing of these proposals on a variety of corpora automatically; we are looking forward eagerly to these results of those tests. 4 MARTINS



Đồng bộ tài khoản