N-Fold Analysis
- The standard methodology in categorisation is an n-fold
analysis.
- If you take a dataset, run a learning algortihm on it,
and then test on the same dataset, it's pretty easy to
do well. Just memorise it. That's testing on the training
set.
- Instead, you want to split the data into, say, 2 data sets; let's
call them A and B.
- Now run the learning algorithm on A, then test on B. Just to
be fair, do it the other way too. Train on B test on A. That's
a 2-fold test.
- You can run a 5 fold or a 10-fold test. Note you can train on
9 and test on 1 (10 times) or train on 1 and test on 9.
- Why is an n-fold test better?
- Which fold is right?
- What's the aim of the overall process? (Unseen data)
- You can also parameter fit with n-fold analysis, which is a dodgy
methodology.
- You can also break your training set up, and this is often done
to reduce over-fitting.