Datamining

This is some work with Viviane Orengo and we've submitted it to the Neural Computing and Applications Journal.
We did two tasks: the congressional voting task and information retrieval
Congressional Voting
1. From the Cal Irvine Database
2. Given the voting records on 16 bills of US congresspeople, categorise whether they are Republican or Democratic.
3. It's supervised and they can abstain.
4. We did a 5-fold validation.
5. We got about 89% of them if we trained on 80% and tested on 20% and about 86% in the reverse condition.
6. Prior work indicates that the best possible result is between 90 and 95 (Schlimmer's PhD thesis).
Information Retrieval
1. Viviane Orengo has just finished her PhD in Information Retrieval (using LSI and cross-linguistically).
2. For this paper, she used a couple of standard IR tasks, the Time Magazine collection and the Cranfield collection.
3. She stemmed the text of 425/1400 documents, with 83/225 queries, and 7596/2629 terms.
4. Each term in more than one document was assigned a neuron and each neuron had 40 synapses leaving it.
5. We trained by presenting each document 20 times.
6. We tested by activating words in the query and letting activation spread for 5 cycles.
7. We then did a Pearson's comparison to all of the document networks (computationally expensive).
8. The results with a compensatory rule were 40%/28% on the Cranfield test
9. LSI is a standard technique and we (somewhat surprisingly) do better than it.
10. Note that the compensatory rule really has the same effect as the standard IR measure of term frequency inverse document frequency (TF-IDF).