Bag of Words
- Bag of Words techniques (like latent semantic analysis) are a good
approximation to semantics.
- You take a bunch of documents (typically paragraphs) and build
a term by document matrix.
- The idea is that the meaning of the document is the words that
are in it, and the meaning of the word is the documents it is in.
- You typically remove stop words (like the, of, and and).
- You can also reduce the matrix into two smaller ones (using a
Singular Value Decomposition(SVD) algorithm).
- So a 3000x5000 document by word matrix becomes a 3000x300 document
matrix and a 5000x300 word matrix.
- They typically use a cosine measurement to measure similarity.
- Terms that are similar will point in a similar direction (in 300
dimensional space).
- You can add words together, and you can compare documents with
words or each other.