Báo cáo khoa học: " THE DISTRIBUTION OF WORD LENGTH IN TECHNICAL RUSSIAN"
lượt xem 3
download
IN the course of an analysis of several samples of technical Russian undertaken as part of a study in mechanical translation, a number of statistical data reflecting the structure of these samples were compiled. One of these, the distribution of word length, is presented here as Fig.
Bình luận(0) Đăng nhập để gửi bình luận!
Nội dung Text: Báo cáo khoa học: " THE DISTRIBUTION OF WORD LENGTH IN TECHNICAL RUSSIAN"
- [ Mechanical Translation, vol.1, no.3, December 1954; pp. 38-40] THE DISTRIBUTION OF WORD LENGTH IN TECHNICAL RUSSIAN A nthony G. Oettinger C omputation Laboratory, Harvard University I N the course of an analysis of several sam- a mination of the texts indicates that these dif- ples of technical Russian undertaken as part of ferences can safely be attributed to differing a s tudy in mechanical translation, a number of s ubject matter and styles. However, all distri- s tatistical data reflecting the structure of these butions are bimodal, perhaps trimodal, and cut s amples were compiled. One of these, the dis- o ff at k=18. The mode about k= 7 is attributable tribution of word length, is presented here as t o the large number of different words used to Fig. 1. d efine the particular subject of each text. The T he theoretical interest of this distribution p eaks at k= 1 and at k= 3 are due to a small a rises from the possibility of using it as a n umber of very frequent "grammatical words," b asis for an operational definition of words in t hat is, prepositions, conjunctions, etc. The p rinted texts. If texts are considered purely as f ive most frequent words of length 1, 2, and 3 s equences of symbols including the letters, i n the total sample are listed in Table 1. This p unctuation marks, and space, the resulting se- t able shows that the most frequent two letter quences are of a length which no practicable w ords are consistently less frequent than three m achine can manage. A study of the distribu- l etter words of similar rank. One and two letter tion of the number of symbols between pairs of w ords are exclusively grammatical; 90% of the s uccessive symbols of certain classes would be t hree letter words are also grammatical, o ne way to reveal structural characteristics of l eaving 10% dependent on the subject matter. t he text sequences potentially useful toward the T he words of length 4 are nearly all inflected. d efinition of manageable and significant T he fact that only very few Russian words have s ubsequences. The subsequences included be- s tems of three or less letters probably accounts tween successive occurrences of letter pairs f or the valley at k= 4. Indications thus are that h ave not been investigated. Those included be- t he modal and cut-off structure of the distribu- tween successive pairs of periods, exclamation tions are functions of the structure of the Rus- p oints or question marks can be identified with sian language, while variations within these t he classical sentence, and finally, those s tructures are characteristic of individual au- i ncluded between successive pairs of punctua- thors. For those who might wish to draw their tion marks or spaces can be identified with o wn conclusions, the raw data is given in Table w ords. The length distribution of the latter 2 , and the sources of the samples are listed in s ubsequences has the desirable property, not T able 3. Letter, diagram and suffix distribu- s hared by the others, of being concentrated at tions compiled from the same samples may be r elatively low values of length, and of having f ound in the reference. n o elements exceeding a certain length (Fig. 1). W ords, defined in this fashion, can readily be TABLE 1 i dentified by a machine and they are of limited v ariety, so that their listing in a dictionary is v 210 na 86 pri 93 p racticable. i 165 iz 57 dlja 72 F rom the practical point of view, the distri- bution is useful in planning input and storage f acilities in experimental translating equip- s 91 po 46 chto 50 ment. T he samples used were relatively small, and k 43 ot 28 kak 29 F ig. 1 should therefore be interpreted with g reat caution. The bar graph represents the a 21 ne 26 ili 22 d istribution of a sample totalling 6,486 words. P oints are used to indicate the distributions o btained from smaller constituents of the total. T he scattering is such as to indicate that sam- ples 1, 2, and 3 differ significantly among each o ther in details of their distributions. An ex- 38
- THE DISTRIBUTION OF WORD LENGTH IN TECHNICAL RUSSIAN 39 k (LENGTH in LETTERS) Figure 1
- 40 ANTHONY G. OETTINGER TABLE 2 Word Frequency length Sample Sample Sample Sample Total 1 2 3a 3b 1 67 204 178 88 537 2 36 147 114 54 351 3 40 170 148 80 438 4 43 130 107 45 325 5 74 203 183 117 577 6 61 258 161 99 579 7 89 332 245 129 795 8 49 209 212 121 591 9 49 209 211 88 557 10 31 281 138 67 517 11 17 208 118 66 409 12 25 127 98 47 297 13 18 94 72 41 225 14 20 50 29 10 109 15 5 54 28 13 100 16 4 28 16 5 53 17 2 5 9 4 20 18 0 0 5 1 6 T ABLE 3 1. A. G Lunts, 1950, "Prilozhenie Matrichnoj Bulevskoj Algebry k Analizu i Sintezu Relejno-Kontaktnyx Sxem," Doklady Akade- mii Nauk SSSR, 70, pp. 421-23. 2. K. V. Valdimirskij, 1951, "O Sinxronnom F il'tre," Zhurnal Eksperimental'noj i Teoreticheskoj Fiziki, 2 1, pp. 2-10. 3. B. P. Aseev, 1947, Osnovy Padiotexniki (Moskva: Svjaz'izdat) (a) pp. 10, 18, 20, 21, 23, 33, 37, 42, 45, 49, 55 (part); (b) pp. 55 (part), 59, 64, 65, 71, 122 REFERENCE Oettinger, A. G., "A Study for the Design of an Automatic Dictionary," Doctoral Thesis, Har- vard University (1954).
CÓ THỂ BẠN MUỐN DOWNLOAD
-
Báo cáo khoa học: Nghiên cứu sản xuất giá đậu nành
8 p | 258 | 35
-
Báo cáo khoa học: Vị thế của tiếng Anh trên thế giới và ở Việt Nam
8 p | 164 | 12
-
Báo cáo khoa học:Bắt đầu và thể khởi phát tiếng Việt
17 p | 101 | 6
-
báo cáo khoa học: " Part I, Patient perspective: activating patients to engage their providers in the use of evidencebased medicine: a qualitative evaluation of the VA Project to Implement Diuretics (VAPID)"
11 p | 122 | 5
-
Báo cáo khoa học: "The complete genome of klassevirus – a novel picornavirus in pediatric stool"
9 p | 91 | 4
-
Báo cáo khoa học: Các thế hệ máy gia tốc xạ trị và kỹ thuật ứng dụng trong lâm sàng
22 p | 7 | 4
-
báo cáo khoa học: " Looking inside the black box: a theory-based process evaluation alongside a randomised controlled trial of printed educational materials (the Ontario printed educational message, OPEM) to improve referral and prescribing practices in primary care in Ontario, Canada"
8 p | 128 | 4
-
báo cáo khoa học: " Overview of the VA Quality Enhancement Research Initiative (QUERI) and QUERI theme articles: QUERI Series"
9 p | 66 | 3
-
báo cáo khoa học: " Taking stock of current societal, political and academic stakeholders in the Canadian healthcare knowledge translation agenda"
6 p | 80 | 3
-
báo cáo khoa học: " Testing a TheoRY-inspired MEssage ('TRY-ME'): a sub-trial within the Ontario Printed Educational Message (OPEM) trial"
8 p | 72 | 3
-
báo cáo khoa học: " An observational study of the effectiveness of practice guideline implementation strategies examined according to physicians' cognitive styles"
9 p | 118 | 3
-
Báo cáo khoa học: " Expression of Ebolavirus glycoprotein on the target cells enhances viral entry"
15 p | 107 | 3
-
Báo cáo khoa học: "Effective suppression of Dengue fever virus in mosquito cell cultures using retroviral transduction of hammerhead ribozymes targeting the viral genome"
17 p | 75 | 3
-
Báo cáo khoa học: " Development of TaqMan® MGB fluorescent real-time PCR assay for the detection of anatid herpesvirus 1"
8 p | 87 | 3
-
báo cáo khoa học: " Implementing evidence-based interventions in health care: application of the replicating effective programs framework"
10 p | 75 | 3
-
Báo cáo khoa học: " The directionality of the nuclear transport of the influenza A genome is driven by selective exposure of nuclear localization sequences on nucleoprotein"
12 p | 64 | 3
-
Báo cáo khoa học: "Evolution of the M gene of the influenza A virus in different host species: large-scale sequence analysis"
13 p | 66 | 3
-
Báo cáo khoa học: "Protein intrinsic disorder and influenza virulence: the 1918 H1N1 and H5N1 viruses"
12 p | 60 | 3
Chịu trách nhiệm nội dung:
Nguyễn Công Hà - Giám đốc Công ty TNHH TÀI LIỆU TRỰC TUYẾN VI NA
LIÊN HỆ
Địa chỉ: P402, 54A Nơ Trang Long, Phường 14, Q.Bình Thạnh, TP.HCM
Hotline: 093 303 0098
Email: support@tailieu.vn