Trường Đại học Công nghiệp Tp. HCM Khoa Công nghệ thông tin (Faculty of Information Technology)

N.L.P. NATURAL LANGUAGE PROCESSING

 Teacher: Lê Ngọc Tấn  Email: letan.dhcn@gmail.com  Blog: http://lengoctan.wordpress.com

Chapter 3 Basic principles for NLP

NLP. p.2

POS – part of speech tagging (1)

 Perhaps starting with Aristotle in the West

(384-322 BCE), there was the idea of having parts of speech

 It comes from Dionysius Thrax of Alexandria (c. 100 BCE), the idea that is still with us that there are 8 traditional parts of speech: – Thrax: noun, verb, article, adverb, preposition,

conjunction, participle, pronoun

– School grammar: noun, verb, adjective, adverb, preposition, conjunction, interjection, pronoun

NLP. p.3

POS – part of speech examples for English

 N  V  ADJ  ADV  P  PRO  DET  CONJ

noun chair, bandwidth, pacing verb study, debate, munch adj purple, tall, ridiculous adverb unfortunately, slowly preposition of, by, to I, me, mine pronoun determiner the, a, that, those conjunction

and, or

NLP. p.4

Open vs. Closed Classes

 Open vs. Closed classes

– Closed:

• determiners: a, an, the • pronouns: she, he, I • prepositions: on, under, over, near, by, … • Why “closed”?

– Open:

• Nouns, Verbs, Adjectives, Adverbs.

NLP. p.5

POS – part of speech tagging (2)

 Words often have more than one POS:

Ex: the back door : JJ (Adjective) on my back : NN (noun) promised to back the bill : VB (verb, base form)

 But in one sentence, one word has only one POS

tag.

 To do POS tagging, we need to choose a standard set of tags to work with, e.g. Penn TreeBank, Brown,…

NLP. p.6

POS tagsets

 There are many POS tagsets

 MINIPAR  PENN TREE BANK: 36 tags without punctuations  VCL – VIETNAMESE COMPUTATIONAL LINGUISTICS (A.3.1, page 373 & A.4, page 378 SGK)

NLP. p.7

Penn TreeBank POS tagset

NLP. p.8

POS – part of speech tagging (3)

 Definition: The process of assigning a part-of-speech or

lexical class marker to each word in a corpus.

 The POS tagging problem is to decide on the correct

POS tag for a particular instance of a word

 Example:

Input: I can can a can Ambiguity: pronoun/ auxiliary | verb | noun/ auxiliary | verb | noun/ determiner /auxiliary | verb | noun Output: I/pronoun can/auxiliary can/verb a/determiner can/noun

NLP. p.9

POS – Methods of tagging

 HMM: Hidden Markov Model  Rules-based  Maximum entropy  Neural network  Decision tree  Transformation Based Learning (the most efficient

and the most used method) – fast-TBL method – fTBL-toolkit

NLP. p.10

POS tagging – Evaluation

 So once you have you POS tagger running how

do you evaluate it? – Overall error rate with respect to a gold-standard test

set.

– Error rates on particular tags – Error rates on particular words – Tag confusions...

NLP. p.11

POS tagging – Evaluation

 The result is compared with a manually coded

“Gold Standard” – Typically accuracy reaches 96-97% – This may be compared with result for a baseline

tagger (one that uses no context).

 Important: 100% is impossible even for human

annotators.

NLP. p.12

Parsing

 Definition : Parsing is the process of analyzing a text,

made of a sequence of tokens, to determine its grammatical structure with respect to a given formal grammar

 Syntactic analysis  There are two techniques in parsing:

– Top–Down – Bottom–Up

 In the parsing step, grammar is used to be examined:

– Context Free Grammar (CFG) – Probabilistic Context Free Grammar (PCFG) – Lexical Functional Grammar (LFG)

NLP. p.13

Grammar

NLP. p.14

Simple Grammar

NLP. p.15

Sentence types

 Declaratives: A plane left.

S NP VP

 Imperatives: Leave!

S VP

 Yes-No Questions: Did the plane leave?

S Aux NP VP

 WH Questions: When did the plane leave?

S WH-NP Aux NP VP

NLP. p.16

Derivations

 A derivation is a sequence of rules applied to a

string that accounts for that string – Covers all the elements in the string – Covers only the elements in the string

NLP. p.17

Top-Down Search

 Since we’re trying to find trees rooted with an S

(Sentences), why not start with the rules that give us an S.

 Then we can work our way down from there to

the words.

NLP. p.18

Sample grammar

NLP. p.19

Parsing “Book that flight”

NLP. p.20

Bottom-Up Parsing

 Of course, we also want trees that cover the input words. So we might also start with trees that link up with the words in the right way.

 Then work your way up from there to larger and

larger trees.

NLP. p.21

Parsing “Book that flight”

NLP. p.22

Ambiguity

NLP. p.23

Morphology, morphemes, stems and lemma

 Morphology is the relationship between a language unit

and its form Ex: book, books, house, house-hold

 Morpheme is the smallest meaningful units of a word

(syllable) Ex:

information  inform-ation reading  read-ing

NLP. p.24

Morphology, morphemes, stems and lemma

 Stemmatization : to group one or more morphemes of a

word Ex:

pretties  pretty useful  use

 Stems

 Stemma

NLP. p.25

Morphology, morphemes, stems and lemma

 Lemmatization : to group together the different inflected forms of a word so they can be analyzed as a single item Ex:

going  go reading  read

 Lemma

NLP. p.26

Lexicology, syntax and semantics

 Lexicography is a study of linguistics faculty

– Thesaurus – Lexicon – Dictionaries – Encyclopedia

 Syntax

 Semantics

NLP. p.27

Etymology

 Study of words and of the structure of words

 Example :

anti (morpheme) + poison (morpheme) = antipoison (morpheme)

 Classification of words structure: – Simple word. Ex: book, boy, sister – Complex/derived word. Ex: babysister – Compound word. Ex: tall-boy , swimming-pool

NLP. p.28