Bag of Words
- There is a lot of strucutre to text. The dog chased the cat. ->
det N past-v det N period.
- However, if you ignore the structure, you can get a lot out of
an article.
- Take the document (or paragraph) and make a bag of words.
- You might want to stem them and remove stop words.
- If you've got a lot of documents, you can build a word by
document matrix.
- To a surprsingly large extent, the document can be
thought of as the words in it.
- Similarly, the word has meanings close to the documents it is in.
- You can also reduce this matrix (by e.g. LSA), and the
reduced vectors (4000x30000 -> 4000x300+300*30000) still work.
- This helps with synonymy and information retrieval.
- It's also used for plagiarism detection.
- It helps if the documents are from a relatively small domain.
- We used this for cross-linguistic information retrieval, and these
techniques can also help with translation.