Trường Đại học Công nghiệp Tp. HCM<br />
Khoa Công nghệ thông tin<br />
(Faculty of Information Technology)<br />
<br />
N.L.P.<br />
NATURAL LANGUAGE PROCESSING<br />
Teacher: Lê Ngọc Tấn<br />
Email: letan.dhcn@gmail.com<br />
Blog: http://lengoctan.wordpress.com<br />
<br />
<br />
Chapter 4<br />
Computational Linguistics<br />
<br />
NLP. p.2<br />
<br />
What is computational linguistics?<br />
<br />
<br />
It is an interdisciplinary field dealing with the statistical<br />
or rule-based modeling of natural language from a<br />
computational perspective<br />
<br />
<br />
<br />
Corpus, Corpora<br />
<br />
<br />
<br />
Pre-processing : normalization, tokenization,…<br />
<br />
<br />
<br />
Alignment Methods<br />
<br />
<br />
<br />
Programming<br />
NLP. p.3<br />
<br />
Corpus Definitions<br />
<br />
<br />
What is a corpus?<br />
– It contains an important number of texts<br />
– Corpora : a set of corpus<br />
<br />
<br />
<br />
Golden corpus<br />
– Brown Corpus<br />
– Susanne Corpus<br />
– EUROPARL Corpus<br />
<br />
<br />
<br />
Corpus can be annotated or POS tagged<br />
<br />
NLP. p.4<br />
<br />
Corpus Categories (1)<br />
<br />
<br />
Schema of corpus evolution<br />
<br />
NLP. p.5<br />
<br />