Bài giảng Xử lý ngôn ngữ tự nhiên: Chương 3 - Hoàng Anh Việt

Chia sẻ: _ _ | Ngày: | Loại File: PDF | Số trang:78

Thêm vào BST

Báo xấu

1
lượt xem 0
download

Download Vui lòng tải xuống để xem tài liệu đầy đủ

Chương 3 - Gắn nhãn từ loại (Part-of-Speech Tagging) trình bày các khái niệm, vai trò và phương pháp gắn nhãn từ loại trong xử lý ngôn ngữ tự nhiên. Nội dung bao gồm: Khái niệm và mục tiêu của gắn nhãn từ loại; Các nhãn từ loại phổ biến (danh từ, động từ, tính từ, trạng từ,...); Phương pháp gắn nhãn: dựa trên luật, mô hình xác suất (HMM), và học máy (CRF, BiLSTM, v.v.); Đánh giá độ chính xác và ứng dụng trong các bài toán NLP khác như phân tích cú pháp, dịch máy, phân tích cảm xúc.

Chủ đề:

Bình luận(0) Đăng nhập để gửi bình luận!

Lưu

Nội dung Text: Bài giảng Xử lý ngôn ngữ tự nhiên: Chương 3 - Hoàng Anh Việt

Giảng viên: Hoàng Anh Việt hoanganhviet@gmail.com
Understanding – the Big Picture Morphology POS Tagging Syntax Semantics Discourse Integration Generation goes backwards. For this reason, we generally want declarative representations of the facts. POS tagging is an exception to this.
Overview ¨  Bài toán gán nhãn ¤  Ngôn ngữ học: phân từ loại cú pháp ¤  Toán học: gán nhãn cho một dãy ký hiệu ¨  Gán nhãn từ loại ¤  Tiếp cận thống kê: Hidden Markov Model và Viterbi algorithm ¤  Transformation Rule (Brill's tagger)
Tagging problems ¨  Input cho một dãy các ký hiệu: ¤  x1 x2 ... xn ¨  Output: gán nhãn cho các ký hiệu này ¤  y1 y2 ... Yn a b c c a d e -> a/C b/D c/E c/E a/D d/C e/C ¨  Bài toán: ¤  Gán nhãn từ loại: POS tagging ¤  Nhận dạng thực thể tên: Name Entity Recognition ¤  ....
Part-Of-Speech tagging INPUT: Profits soared at Boeing Co., easily topping forecasts on Wall Street, as their CEO Alan Mulally announced first quarter results. OUTPUT: Profits/N soared/V at/P Boeing/N Co./N ,/, easily/ADV topping/V forecasts/N on/P Wall/N Street/N ,/, as/P their/POSS CEO/N Alan/N Mulally/N announced/V first/ADJ quarter/N results/N ./. N = Noun V = Verb P = Preposition Adv = Adverb Adj = Adjective
Why POS is hard ? 6 ¨  Ambiguities ¤  He will race/VB the car. ¤  When will the race/NOUN end? ¤  The boat floated/VBD down the river sank. à Average of ~2 parts of speech for each word ¨  Unkown word (new word) ¨  The number of tags used by different systems varies a lot ¤  Penn Treebank: 45 ¤  Brown corpus: 87 ¤  C7 tagset: 146 22/8/2007
POS Ambiguity (in the Brown Corpus) Unambiguous (1 tag): 35,340 Ambiguous (2-7 tags): 4,100 2 tags 3,760 3 tags 264 4 tags 61 5 tags 12 6 tags 2 7 tags 1 (Derose, 1988)
Tagsets for English
Information Extraction Name Entity Recognition INPUT: Profits soared at Boeing Co., easily topping forecasts on Wall Street, as their CEO Alan Mulally announced first quarter results. OUTPUT: Profits soared at [Company Boeing Co.], easily topping forecasts on [Location Wall Street], as their CEO [Person Alan Mulally] announced first quarter results.
Name entity extraction as Tagging INPUT: Profits soared at Boeing Co., easily topping forecasts on Wall Street, as their CEO Alan Mulally announced first quarter results. OUTPUT: Profits/NA soared/NA at/NA Boeing/SC Co./CC ,/NA easily/NA topping/NA forecasts/NA on/NA Wall/SL Street/CL ,/NA as/NA their/ NA CEO/NA Alan/SP Mulally/CP announced/NA first/NA quarter/NA results/NA ./NA NA = No entity SC = Start Company CC = Continue Company SL = Start Location CL = Continue Location
POS tagging: Rule-based ¨  Example after-morphology data (using Penn tagset): I !watch !a !fly !.! NN !NN !DT !NN !.! PRP !VB !NN !VB! !VBP ! ¨  Rules using ¤  word forms, from context & current position ¤  tags, from context and current position ¤  tag sets, from context and current position ¤  combinations thereof
Rule-Based POS Tagging Step 1: Using a dictionary, assign to each word a list of possible tags. Step 2: Figure out what to do about words that are unknown or ambiguous. Two approaches: • Rules that specify what to do. • Rules that specify what not to do:
Tagging by Parsing ¨  Build a parse tree from the multiple input: S VP NP I watch a fly .! NN NN ¨  Track down rules: e.g., NP → DT NN: extract (a/DT fly/NN) DT NN .! PRP VB ¨  More difficult thanNN VB! itself; results mixed tagging VBP VBP! !
Tiếp cận thống kê Training set: 1 Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/, will/MD join/VB the/ DT board/NN as/IN a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ./. 2 Mr./NNP Vinken/NNP is/VBZ chairman/NN of/IN Elsevier/NNP N.V./NNP ,/, the/DT Dutch/NNP publishing/VBG group/NN ./. ... 38,219 It/PRP is/VBZ also/RB pulling/VBG 20/CD people/NNS out/IN of/IN Puerto/NNP Rico/NNP ,/, who/WP were/VBD helping/VBG Huricane/NNP Hugo/NNP victims/NNS ,/, and/CC sending/VBG them/PRP to/TO San/NNP Francisco/NNP instead/RB ./. • From the training set, induce a function or “program” that maps new sentences to their tag sequences.
Các phương pháp (overview) ¨  Xác suất ¤  HMM ¤  Maximum Entropy ¨  Rule-based ¤  Transformation rule, Error driven learning Classifier combination
Hidden Markov Models ¨  Have input sentence: S = w1,w2, . . ., wn ¨  Have a tag sequence: T = t1,t2, . . ., tn ¨  Use an HMM to define P(t1,t2, . . ., tn, w1,w2, . . ., wn) đối với bất kỳ S và T có cùng độ dài ¨  -> Dãy tag tốt nhất cho S được xác định: T* = argmaxTP(T,S)
T * = arg max P(T | S ) T P( S | T ) P(T ) = arg max T P( S ) = arg max P( S | T ) P(T ) T = arg max P( w1 ,..., wn | t1 ,..., t n ) × P(t1 ,..., t n ) T n n = arg max ∏ P( wi | ti )∏ P(ti | ti − 2 , ti −1 ) T i =1 i =1
Mô hình kênh nhiễu ¨  Noisy Channel setting: ¨  Input (tags) Output (words) ¨  The channel ¨  NNP VBZ DT... (adds “noise”) John drinks the ... ¨  Goal (as usual): discover “input” to the channel (T, the tag seq.) given the “output” (W, the word sequence) ¤  p(T|W) = p(W|T) p(T) / p(W) ¤  p(W) fixed (W given)... ¤  argmaxT p(T|W) = argmaxT p(W|T) p(T)
Ví dụ ¨  S = the boy laughed ¨  T = DT NN VBD P(T,S) = P(END|NN, VBD)× P(DT|START, START)× P(NN|START, DT)× P(VBD|DT, NN)× P(the|DT)× P(boy|NN)× P(laughed|VBD)