Tagging

The first thing we did with GATE was tokenize. All that does is read in the document and break it into words. That's hard. Don't be fooled, the machine doesn't understand a document.
The next bit is the part of speech (POS) tagging.
The POS of a word is its type. It's a noun, or a verb.
Most (or at least a lot of) words are lexically ambiguous. run
There is no broadly agreed tag set for English (or really any natural language).
What a tagger does is go through a document and tag each word with its POS.
The Brill tagger is an example of this. It gets about 98% of the words right.
It's trained on a document set that's marked up. (Word POS pairs)
You get the trained version but it can be trained.
Viviane Orengo made one for Portuguese.
It really just does a simple Markov model with a couple word look ahead.
It's a really fast FSA.

Part of Speech Tagging