Text Extraction
- Text extraction is automatically processing text documents to
generate database entries of their meaning.
- This is not summarization, though that's related.
- It is almost exclusively domain dependent.
- You are looking for a database entry that contains the important
information (for the selected domain) in that document.
- In MUC, documents were typically newspaper articles, but I expect now
they are websites.
- Example domains are central american terrorist events,
corporate executive job changes, and rocket launch events.
- One thing they derived from MUC was that you need to do some
processing, but you can't typically do full processing.
- So, they used cascaded finite state automata (Gerry Hobbes)
instead of full fledged parsing.
- That's really fast.
- You build the highest level FSA (and maybe a couple just below that)
semi-automatically for the domain.