Part of Speech Tagging
- The first thing we did with GATE was tokenize. All that
does is read in the document and break it into words. That's
hard. Don't be fooled, the machine doesn't understand a document.
- The next bit is the part of speech (POS) tagging.
- The POS of a word is its type. It's a noun, or a verb.
- Most (or at least a lot of) words are lexically ambiguous. run
- There is no broadly agreed tag set for English (or really any
natural language).
- What a tagger does is go through a document and tag each word
with its POS.
- The Brill tagger is an example of this. It gets about 98% of the
words right.
- It's trained on a document set that's marked up. (Word POS pairs)
- You get the trained version but it can be trained.
- Viviane Orengo made one for Portuguese.
- It really just does a simple Markov model with a couple word look
ahead.
- It's a really fast FSA.