Trường Đại học Công nghiệp Tp. HCM Khoa Công nghệ thông tin (Faculty of Information Technology)
N.L.P. NATURAL LANGUAGE PROCESSING
Teacher: Lê Ngọc Tấn Email: letan.dhcn@gmail.com Blog: http://lengoctan.wordpress.com
Chapter 4 Computational Linguistics
NLP. p.2
What is computational linguistics?
or rule-based modeling of natural language from a computational perspective
It is an interdisciplinary field dealing with the statistical
Corpus, Corpora
Pre-processing : normalization, tokenization,…
Alignment Methods
NLP. p.3
Programming
Corpus Definitions
What is a corpus?
– It contains an important number of texts – Corpora : a set of corpus
Golden corpus
– Brown Corpus – Susanne Corpus – EUROPARL Corpus
Corpus can be annotated or POS tagged
NLP. p.4
Corpus Categories (1)
Schema of corpus evolution
NLP. p.5
Corpus Categories (2)
A corpus which contains data which are not parallel but still closely related by conveying the same information
NLP. p.6
What is a comparable corpus?
Corpus Categories (2)
It is a corpus containing each two sentences stowed in pairs in two languages but the type of alignment is not really 1-1
NLP. p.7
What is a noisy parallel corpus?
Corpus Categories (3)
A corpus which contains each two sentences stowed in pairs in two languages
NLP. p.8
What is a parallel corpus?
Corpus Categories (4)
In multilingual corpora, a corpus of a language is a translation from another corpus of another language and there are only two languages
NLP. p.9
What is a bilingual parallel corpus?
Parallel corpora application
Teaching second languages Translation didactics Terminology studies Multilingual edition Product internationalization Automatic translation Multilingual information retrieval
NLP. p.10
Alignment Methods (1)
– The alignment techniques make the corpora useful and
exploitable
Approaches
– Text alignment – Sentence alignment – Word alignment
Methods
NLP. p.11
How to evaluate the alignment methods?
Alignment Methods (2)
language 2: 1 – 0 : omission 0 – 1 : addition 1 – 1 : exactly correspondence m – n : fusion, with m >1 and n > 1
NLP. p.12
The alignment types between the language 1 and the
Alignment Methods (3)
Syntactic analysis Structure between languages
NLP. p.13
Difficulty about alignment methods:
How to evaluate alignment methods?
– Calculate the precision, the recall and the F-measure – Calculate the error rates – Calculate the metrics such as BLEU, NIST, TER,
PER, PER*,…
How to evaluate the alignment methods?
– A measure or a score of a test’s accuracy – A weighted average of the precision and the call
NLP. p.14
F-measure or F1 score is used for
Normalization, Lemmatization and Tokenization
– To process a corpus into one standard format
Normalization
– To determine the lemma for a given word – To group together the different inflected forms of a word so
they can be analyzed as a single item
Tokenization
– To break a text into words or symbols or phrases or other
meaningful
– Token
NLP. p.15
Lemmatization