intTypePromotion=1
zunia.vn Tuyển sinh 2024 dành cho Gen-Z zunia.vn zunia.vn
ADSENSE

Báo cáo khoa học: "The Use of Statistics in Language Research"

Chia sẻ: Nghetay_1 Nghetay_1 | Ngày: | Loại File: PDF | Số trang:7

59
lượt xem
2
download
 
  Download Vui lòng tải xuống để xem tài liệu đầy đủ

The literature concerning the application of statistics to linguistic problems and in particular to mechanical translation is reviewed. The conclusion is that much of the work done is of little direct use for mechanical translation, and that some of it is based on a misapprehension of what statistical techniques can in fact do.

Chủ đề:
Lưu

Nội dung Text: Báo cáo khoa học: "The Use of Statistics in Language Research"

  1. [Mechanical Translation, vol.5, no.2, November 1958; pp. 67-73] The Use of Statistics in Language Research A. F. Parker-Rhodes, Cambridge Language Research Unit, Cambridge, England The literature concerning the application of statistics to linguistic problems and in particular to mechanical translation is reviewed. The conclusion is that much of the work done is of little direct use for mechanical translation, and that some of it is based on a misapprehension of what statistical techniques can in fact do. Statis- tical methods can play a useful part in the development of mechanical translation procedures once these have been well established, but have little to contribute at the present stage of the work. f ormation have a statistical aspect whenever THERE ARE many ways in which statistical code-compression is employed. 5) Crypto- techniques might be pressed into the service of graphy: a peripheral subject, but perhaps worth language research, and in particular the theory inclusion. o f mechanical translation and information re- trieval. Most of these have had their advocates, Applications to Lexicography The purpose of this paper is to review briefly the literature of the subject, and to draw conclu- A good deal of theoretical work has been done s ions as to how much of this work can be re- on statistical techniques of a kind which could g arded as a legitimate use of statistics, and as o r might be applied to the study of word fre- to how relevant it is to the progress of language- quency. The general problems are of a kind of processing technology. There appear to be five main topics covered. frequent occurrence in biology, and so have received some attention from that quarter. Of First, I shall enumerate these, and then I shall this general kind is the work of Good.1 More r efer seriatim to the works available in the specifically concerned with language problems C.L.R.U. library upon each of them. 1) Lexi- are the contributions of Mandelbrot 2,3 on cography: this includes the methods and tech- word-frequencies. This author points out that niques of compiling lexical information, whether this takes the form of a dictionary of a more or a knowledge of word-frequency distributions could be useful to the lexicographer, but he is l ess conventional character, or a thesaurus. not himself concerned to make this application. 2) Approximative Methods: these are methods In fact, no one seems to have done so, except o f machine translation which aim to rely on Koutsoudas,4 who in fact concludes that the so- keeping errors below a preconceived threshold c alled Zipf and Joos laws are insufficient to of tolerance; they use statistics mainly to pre- dict how little work need be done to achieve this. give reliable predictions of the size of diction- aries needed in machine translation, and con- 3) Economics: included here are applications sequently recommends the accumulation of o f statistics to ascertain the size of computers further empirical material with this end speci- needed, the time taken to operate programs, fically in view. etc. 4) Coding: the problems of coding of in- 3. B. Mandelbrot, "Structure formelle des 1. I. J. Good and G.H.Toulmin, "The number of new species and the population coverage, textes et communication," Word, 10, pp. 1-27 when a sample is increased, " Biometrika, 43, (1954). pp. 45-63 (1956). 4. A. M.Koutsoudas and R.E. Machol, "Fre- 2. B. Mandelbrot, "Linguistique statistique quency of occurrence of words; a study of Zipf's macroscopique: Theorie mathematique de la law with application to mechanical translation, " loi de Zipf," Institut Henri Poincare, Seminaire University of Michigan, Engineering Research de Calcul des Probabilites, (June 13, 1957). Institute, Publication 2144-147-T (1957).
  2. 68 A.F.Parker-Rhodes of the subject-literature." This major effort Koutsoudas' statistical techniques are appar- h as to be done before one can begin to apply ently adequate for his purpose, and he has com- one's statistical methods; Luhn himself makes piled the required data and analyzed them. No no pretence of actually doing any statistics. On one else has apparently taken statistical meth- the other hand Gould, 8 who also considers the- o ds as seriously as this, and most references saurus methods, presents the appearance of t o the subject merely suggest that an applica- statistical computation. His problem is the tion of statistics to dictionary making should be made,5 or even in one case that no dictionary translation of Russian mathematical texts into English, and he is concerned to assess the mag- could be made without previous statistical analysis.6 nitude of the problem of 'multiple meaning' by statistical means. He defines an 'index of mul- The use which most of these authors have in tiplicity' in algebraic formulae, and evaluates mind is to find out how large a dictionary must i t for various word-classes (according to the b e in order to contain, with a given fiducial system of Fries 9), and presents numerical probability, all the words of particular kinds tables of the result. Actually the figures are not o f text. A secondary application is in finding s tatistical in the strict sense, since no signifi- some way of arranging the entries of a diction- cance tests are done (nor is it shown that his ary which will reduce searching time by making index is a sufficient statistic), and the tables t he most frequent words come up before the only show such facts as, for example, that less frequent ones. Much more sophisticated p repositions are particularly liable to have is the idea behind compiling a thesaurus. In a multiple meanings. It cannot therefore be said t hesaurus we have not merely a list of words that Gould's use of figures has added to what a with coded information upon them, but a mathe- d iscursive argument could have more lucidly m atical system whose elements represent sets put across. of words, so arranged that, ideally, every word One must conclude, from the few attempts in the system can be defined by listing the sets which have been made actually to use statistics in which it occurs. If this were done properly, for lexicographic purposes, that in this field, a it should be possible to find a word, or at least valid application exists only after the lexico- m ost words, by specifying not all t he sets in graphic data have been compiled. The same is which it occurs, but only some of them; thus, true, whether the compilation takes the form of i t might be possible to specify a set of sets by a dictionary or a thesaurus. Given these data, considering the context of a given word, as well one can assess its adequacy, and even propose as itself, which would be enough to identify the s pecific improvements of a major or minor g iven word as exactly as we might wish, pro- k ind, as a result of statistical analysis of its vided our thesaurus contained enough informa- performance. But before the lexicographer tion suitably organized. has done his work, the statistician has nothing Obviously, the success of such a scheme is a to use as data. m atter which could be statistically assessed, and in some measure no doubt statistically pre- dicted. Thus, those who have considered the Approximative Methods use of a thesaurus in MT have not been slow to One answer to the difficulties raised by the appeal to statisticians for help in the very con- attempt to reduce translation to a mathemati- siderable labor of compilation involved. How- cally definite procedure is to base one's proce- ever, in fact, they have not progressed very far. As Luhn7 puts it, "the formation of no- dure on the opposite conception, namely that tational families (his name for thesaurus heads) is a major intellectual effort, to be undertaken by experts familiar with ..........the special field 7. H. P. Luhn, "A statistical approach to mech- anized encoding and searching of literary in- formation, " IBM Journal of Research and De- velopment, vol.1, no.4, pp. 309-317 (Oct. 1957). 5. N. Chomsky, Syntactic Structures, Mou- ton and Company, The Hague (1957). 8. R.Gould, "Multiple correspondence," MT, 6. V.A.Oswald and S.L.Fletcher, "Proposals vol. 4, no. 1/2, pp. 14-27 (Nov. 1957). for the mechanical resolution of German syntax patterns," Modern Language Forum, vol. 36, 9. C. C. Fries, The Structure of English, no. 3-4. Harcourt, Brace and Company, New York (1952).
  3. Statistics in Language Research 69 ( and thus sooner tried) renderings of a given that instead of mathematical definiteness one word or phrase by successively less probable should aim at acceptable approximation to the best that a human translator can do. In that ones. Once again, the conclusion seems to be case, it becomes important to know how much that an acceptable amount of computation work work must be directed to removing the errors l eads to a still unacceptably erroneous result, present in too crude a procedure, in order to t hough this no doubt depends on the purpose reduce the remaining errors to a point below governing our choice of method. some given threshold of tolerance. This is a The nature of approximative methods of trans statistical problem familiar in industry and in lation is seen at its clearest when the attempt military applications. There seems good rea- is made to get at the true meaning of a word by son to expect that, if the approximative approach comparing it with successively wider areas of t o MT is accepted as a useful one, it will rest ' context.' The idea is that if the word itself largely on a statistical foundation. is not sufficiently determinate to be translated A good example of the kind of work which is by one-one equivalence, it may be that compar- relevant to this viewpoint is that of Yngve10 on ing it with the next word, or the last word, will 'gap analysis'; even though this is not oriented suffice to reduce its possible equivalents to one directly to MT application. This aims to sup- failing that, we try two neighboring words, and plement syntactic analysis of a text by a statis- so on till the desired result is achieved. This tical procedure designed to reveal discontinu- of course is a very crude model of what context ities between pattern-groups (of words) previ- r eally is, and, as I have stated it, depends on ously established by analysis of a sufficiently the untenable view that each word has a definite l arge corpus of texts. Insofar as the results number of 'meanings', one of which has to be of such analysis can be regarded as an accept- selected as its translation in the given context. able model of actual linguistic analysis, the These are just the assumptions made by p rocedure is perfectly sound and, it must be Kaplan, 13 who made a statistical study of the admitted, highly ingenious. It is not like the problem; he collected his data by asking human deceptive figuring which we sometimes meet informants to write down how many 'meanings' u nder the guise of statistics in language re- of selected words occurred to them, when the search. Most often, however, approximative said words were presented in company with var m ethods are directed to eliminating errors of ying numbers of neighboring words. His con- a lexicographic kind. For example, Glazer 11 clusions were not very detailed, largely becaus h as tried to work out the statistics necessary his informants were too few to provide a really to permit the insertion of English articles into adequate sample, but they showed clearly enough a translation from the Russian. He makes no that indeterminacy of meaning was a decreasing g reat claims for the result but it is at least f unction of size of context. There would be a pparent from his work that the amount and s cope for a similar study, on a larger scale detail of the statistical information required to a nd with more powerful statistical methods, 'solve' this problem, even within the frame- u sing a realistic model of what constitutes work of the approximationist philosophy, would context and a realistic measure of the indeter- b e very considerable. In fact, it is unclear minacy of semantic content; this would however w hy it should be supposed any 'easier' than b e difficult to do. Like most applications of using real linguistics to do the job. statistics to MT it would only really give use- A better case is made out by King and Wiesel- ful results when applied to an already mecha- man,12 who have made some useful estimates n ized translation procedure. It would be far of the work involved in progressively improving t oo slow and laborious to constitute an aid to a crude translation by replacing more probable constructing a mechanized procedure. 10. V. H. Yngve, "Gap analysis and syntax," 12. G. W. King and I. L. Wieselmann, "Sto- Transactions IRE, vol.IT-2, no. 3, pp. 106-112. chastic methods of mechanical translation," MT, vol. 3, no. 2, pp. 38-39 (Nov. 1956). 11. S. Glazer, "Article requirements of plural nouns in Russian chemistry texts," Georgetown 13. A. Kaplan, "An experimental study of am- University, Institute of Languages and Linguis- biguity and context," MT, vol. 2, no. 2, pp. t ics, Seminar Work Paper MT. 42 (1957). 39-46 (Nov. 1955).
  4. 70 A. F. Parker-Rhodes Less specific to our immediate subject are Application to the Economics of the methods, many of them well known, for Language Processing compressing alphabetic codes. Quite powerful methods are possible here because of the very I t may be objected that it is still much too great redundancy in alphabetic writing. They early to embark on a serious study of the eco- are discussed, in general terms and without nomic aspects of MT. It is necessary, how- statistical analysis, by Mukhin16 and Panov.17 e ver, from time to time to reassure those con- In general it may be said that none of this work c erned that the scale of the enterprise is not is either controversial or novel; but the statis- wholly disproportionate to the sums which its tics of code compression in thesaurus systems ultimate users will be prepared to devote to the is still (as far as published work goes) an un- necessary equipment. It can hardly be said that explored field. adequate data yet exist on which to base an in- formed answer to the question, "How big a Cryptography computer must one have to do mechanical trans- l ation properly?" The question is of course a A s for coding problems, there is a large lit- s tatistical one and in this sense is relevant to erature on cryptography and code design which t he present enquiry but it need not detain us I do not intend to explore. There are however l ong. Several workers have referred to the problem, but only Yngve14 has given any de- some special points of contact between crypto- graphy and language research in which statistics tailed estimates. Their worth is somewhat de- could play a part. Yngve18 has written an in- pendent on accepting a particular view of the t eresting paper in which he treats of the trans- nature of the MT procedure but may be accepted lation problem (especially translation out of un- t o an order of magnitude, at least until more known languages) as a special case of the prob- substantial data are available. lem of decoding a message without the advantage o f a complete code-book to do so. The ap- Coding and Code Compression p roach potentially involves the use of statis- tics, and, while Yngve does not carry the anal- In large measure the coding problems arising y sis far enough to make actual calculations it i n MT and in library work are the same as is clear that this could be done. The difficulty those occurring in other branches of communi- is that the analogy between translation and the cation engineering. The need for code compres- decipherment of a coded message is really sion perhaps arises more urgently in MT, be- more metaphorical than strictly formal. It is c ause of the great bulk of the material to be t herefore unclear how far the results of such s tored, but the mathematical problems it pre- investigations will really be relevant. s ents are the same as in other fields, except where, as in the use of thesaurus methods, the mathematical structure of the information to be General Commentary coded imposes special restrictions. I d o not intend to refer to the already con- Of the two main ways in which statistics can siderable literature on code compression. be applied to scientific enquiry, the observa- S pecific applications to MT have been dis- t ional and the predictive, only the first has cussed by Mooers.15 This work however de- pends on using a tree-type semantic classifica- tion, as has hitherto been done in most informa- t ion retrieval systems. The statistics of the 16. I. S. Mukhin, An Experiment in Machine p rocess would be appreciably different in a Translation Carried out on the BESM, Aca- lattice system. demy of Sciences of the USSR, Moscow (1956). 17. D. Panov, Concerning the Problem of Ma- 14. V. H. Yngve, "The technical feasibility of chine Translation of Languages. Academy of translating languages by machine," Transac- Sciences of the USSR, Moscow (1956). tions AIEE, Paper 56-928 (1956). 18. V. H. Yngve, "The translation of languages 15. C. N. Mooers, "Zatocoding and develop- by machine," Information Theory, (Third Lon- ments in information retrieval," Aslib Pro- don Symposium), Butterworth's Scientific Pub- ceedings, vol. 8, pp. 3-19 (1956). lications (London), pp. 195-205.
  5. Statistics in Language Research 71 really been explored in our field. Observa- This indeed is largely true of the whole field. tional statistics requires that there be a popu- There has been far more written about statisti- lation of entities of which we cannot hope to ac- cal work in translation and information retrieval quire a complete knowledge, although we can than actual work done. Apparently no one has obtain such knowledge of small samples of the y et clearly stated the very limited nature of population. These samples have to be taken the applications possible, but many have borne subject to certain rather rigid precautions and witness to it by inaction. Broadly speaking, the in most statistical work are either created by populations which it would be valuable to have carefully designed experiments or obtained by information upon are those provided by mechan- properly planned observations on the population ically translated texts themselves, and the as it exists in nature. reason that we want to have the information is so as to be able to spot what is wrong with the translation procedure used. Human texts are I n the lexicographic applications these pre- not suitable material for the statistician because r equisites are not very well met. When the the information we can hope to get from them is p opulation is the words in a dictionary, it is e ither already available or is more efficiently not a population of which our knowledge is frag- extracted by the methods of the linguist than by mentary in the sense required. On the contrary, those of the statistician. we already know (or someone must know) every- thing about them that we shall ever discover by our analysis, else the dictionary could not have The indeterminacy which does exist in lan- been written. When the population is composed guage is the indeterminacy which arises from of words in a text, we are in no better position, t he mapping of a continuous territory onto a f or although here a real population exists, we chart with a finite resolving power; it is not either sample the whole population, in which the result of an intrinsically indeterminate use case what we do is not really statistics but of a discrete set of symbols however compli- census-taking, or we postulate the existence cated. This being so, language can certainly of a population of which our text is a sample. be described in statistical terms. But there is This is in fact what most of the workers along no point in describing it, because the object of this line appear to do, but it embodies a statis- the translator (human or mechanical) is instead tical fallacy, namely, that of creating a sample t o use it, in the same sense that one uses a by definition. It is legitimate to define a popu- mathematical system to calculate with. Since l ation, ostensively or otherwise, and then set we shall never do this 'perfectly,' it will always about obtaining samples from it, for then the be worth while to estimate the gravity of our legitimacy of the sampling procedure is open failures and this will be a large enough field for to test and discussion; it is not legitimate to the statistician for a long time. But this acti- ostend a sample and say "let there be a popu- vity will only begin when the output of failures lation of which this is a sample," for then there becomes copious enough to provide the statisti- is no sampling procedure, and the assumptions cian with large populations and the opportunity of probability theory, on which the analysis of of applying proper sampling methods to them. the results must be based, will not be correct. T his has not yet happened. Many of those who have written on this sub- The same objection does not apply to the ap- j ect seem to have the unexpressed belief that plication of statistics to the study of approxi- t here is in language, or our use of it, some- mative methods of translation. Here the criti- thing essentially indefinite which can be dealt cism which suggests itself, against all the work with mathematically only in statistical terms. i n this field, is the very artificial character of I f this were so, the conveyance of precise in- the systems studied. One feels it would hardly formation by talking would be impossible. To be worth while to do very much calculation on some extent the area of possible meanings of a such systems. In fact, hardly any has been remark can be regarded as a probability distri- done. Many have said that they recognize the bution, but it is of the kind that is almost every- problem as statistical, but even those who, like Kaplan,13 actually set out figures do not actual- where zero and has a finite value only within a restricted region. If we deal in 'areas of mean- l y subject them to real statistical analysis. ing' instead of in point-like 'right' and 'wrong' The application of statistics to these approxi- meanings, there are indeed definite rules which mative methods is still more a potentiality than tell us what remarks do not mean. Deliberately a f act.
  6. 72 A. F. Parker-Rhodes t hat case, perhaps a statistical approximation a mbiguous statements can be made in all lan- t o the desired translation would be a next-best. g uages, but even these can be recognized as B ut it is a substitute, not the real thing. s uch by the rules. The problem for the trans- l ator is to find out the rules of the languages T his paper was written with the support of the concerned and to apply them. It is conceivable National Science Foundation, Washington, D. C. t hat this is too difficult for a machine to do; in The following comments were received from people whose work is mentioned in t he preceding article. These comments are published with the permission of those concerned. F rankly, I'm not sure that I understand what I a gree with the point of view expressed in h e is objecting to. t his paper by Parker-Rhodes, but I fail to see t he relevance that he notes of my work on gap H e did not challenge the accuracy or useful- a nalysis to the approximative approach to MT. n ess of the principle of article insertion I pro- T he gap analysis procedures were intended as p osed or even fault the statistical methodology, a tool for the linguist who wants to discover non- a s far as I could make out. May I add, for what approximative methods in MT. i t may be worth, that I submitted my paper in I w ould like to see a clear distinction made be- a dvance of delivery to a professor of statistics t ween analysis of a language for the purpose of from Stanford, who found my approach wholly d educing its rules or structure, and analysis of acceptable. In the semi-public demonstration a s entence to obtain its structure for possible of the Lukjanow code-matching technique held u se when translating it by machine. We may in Washington on August 20th, the percentage n ot be able to mechanize the former as easily o f correct article placement (in some 300 sen- a s the latter. These two kinds of analysis are t ences, including those in the random text) tal- a s different as the science of chemistry, aiming l ied perfectly with the percentage mentioned in to discover the general laws of chemical compo- m y paper. Parker-Rhode's statement "It is s ition and reaction, and the analysis of an un- unclear why it should be supposed any 'easier' known compound of mixture for its ingredients t han using real linguistics to do the job" (p. 6) and their mode of combination. i s particularly baffling. Since the article study V . H. Yngve originated with and was based wholly on an ana- l ysis primarily of English usage and possible Russian morphologico-syntactic decision points, F ootnote 5, and the accompanying sentence in a nd various counts made afterwards o nly to as- t he text (page 2, second paragraph) should be de- certain whether the formulation provided "use- l eted, as factually inaccurate. No such state- f ul" predictability, the implication that the tail m ent is made in Syntactic Structures. S tatistics w agged the dog is certainly unwarranted. is discussed only on pp. 16,17, — lexicography i s not mentioned at all. I t was not my intention to use statistics to Noam Chomsky " solve" the problem; rather to indicate that the f ormulations suggested permit mechanical in- I a m sorry to say that the wide range of items s ertion or omission of articles with a fairly c overed by Parker-Rhodes and the (to me) ex- h igh degree of accuracy. I can't see how statis- cessive economy of words made it difficult to t ics as such are useful in MT except as indica- f ollow him in several places, including the sec- t ors of the validity of a proposed solution. tion where he deals with my own piece on "Ar- In my view there is no single solution of a for- t icle Requirements of Plural Nouns in Russian e ign text. Some 15 years experience as a trans- C hemistry Texts." l ation editor, translator (both of scientific and
  7. Statistics in Language Research 73 p urely literary works), and student of the art of t ime and the process is far from complete. Un- t ranslation have led me t o believe that there are d er the present influence of the radio and, parti- likely to be as many versions or solutions of a c ularly, the press, with its emphasis on con- t ext (with varying quality, of course) as there c iseness, there seems to be a trend away from a re translators. The acceptability of a given t he article in certain types of constructions, e.g. t ranslation rests with the individual reader whose w ith abstract nouns in possessive phrases. Else- reactions are dictated by his background know- where speakers not infrequently have a choice ledge of the Subject, sensitivity to the nuances b etween "a" and "the", etc., with faint seman- o f his native language, and the use to which he t ic or even idiomatic difference between either. i ntends to put the translation. That is why I am How much precision can we (or should we try a proponent of "approximationism" in language to) build into a /the translation machine ? w hich I think reflects the reality of the human Sidney Glazer potential, however weak, rather than the ideal, however desirable. D r. Gould's untimely and tragic death in the A lps last summer precludes a personal com- W hat is needed now as far as the articles are m ent on his part. I feel sure, however, that c oncerned is not more statistical information he would wish simply to let his published work p er se b ut greater insight into the way they are s peak for itself. behaving today. As you know, English article Anthony G. Oettinger u sage has been evolving over a long period of
ADSENSE

CÓ THỂ BẠN MUỐN DOWNLOAD

 

Đồng bộ tài khoản
2=>2