Trường Đại học Công nghiệp Tp. HCM Khoa Công nghệ thông tin (Faculty of Information Technology)

N.L.P. NATURAL LANGUAGE PROCESSING

 Teacher: Lê Ngọc Tấn  Email: letan.dhcn@gmail.com  Blog: http://lengoctan.wordpress.com

Chapter 4 Computational Linguistics

NLP. p.2

What is computational linguistics?

or rule-based modeling of natural language from a computational perspective

 It is an interdisciplinary field dealing with the statistical

 Corpus, Corpora

 Pre-processing : normalization, tokenization,…

 Alignment Methods

NLP. p.3

 Programming

Corpus Definitions

 What is a corpus?

– It contains an important number of texts – Corpora : a set of corpus

 Golden corpus

– Brown Corpus – Susanne Corpus – EUROPARL Corpus

 Corpus can be annotated or POS tagged

NLP. p.4

Corpus Categories (1)

 Schema of corpus evolution

NLP. p.5

Corpus Categories (2)

A corpus which contains data which are not parallel but still closely related by conveying the same information

NLP. p.6

 What is a comparable corpus?

Corpus Categories (2)

It is a corpus containing each two sentences stowed in pairs in two languages but the type of alignment is not really 1-1

NLP. p.7

 What is a noisy parallel corpus?

Corpus Categories (3)

A corpus which contains each two sentences stowed in pairs in two languages

NLP. p.8

 What is a parallel corpus?

Corpus Categories (4)

In multilingual corpora, a corpus of a language is a translation from another corpus of another language and there are only two languages

NLP. p.9

 What is a bilingual parallel corpus?

Parallel corpora application

 Teaching second languages  Translation didactics  Terminology studies  Multilingual edition  Product internationalization  Automatic translation  Multilingual information retrieval

NLP. p.10

Alignment Methods (1)

– The alignment techniques make the corpora useful and

exploitable

 Approaches

– Text alignment – Sentence alignment – Word alignment

 Methods

NLP. p.11

 How to evaluate the alignment methods?

Alignment Methods (2)

language 2:  1 – 0 : omission  0 – 1 : addition  1 – 1 : exactly correspondence  m – n : fusion, with m >1 and n > 1

NLP. p.12

 The alignment types between the language 1 and the

Alignment Methods (3)

 Syntactic analysis  Structure between languages

NLP. p.13

 Difficulty about alignment methods:

How to evaluate alignment methods?

– Calculate the precision, the recall and the F-measure – Calculate the error rates – Calculate the metrics such as BLEU, NIST, TER,

PER, PER*,…

 How to evaluate the alignment methods?

– A measure or a score of a test’s accuracy – A weighted average of the precision and the call

NLP. p.14

 F-measure or F1 score is used for

Normalization, Lemmatization and Tokenization

– To process a corpus into one standard format

 Normalization

– To determine the lemma for a given word – To group together the different inflected forms of a word so

they can be analyzed as a single item

 Tokenization

– To break a text into words or symbols or phrases or other

meaningful

– Token

NLP. p.15

 Lemmatization