Systems vs. People
Inner Annotator Agreement
- The precision and recall measurements make use of a gold standard.
- The gold standard is developed by people, and it's not perfect.
- A good indicator of how perfect the standard is is
inner-annotator agreement.
- If the F measurement between annotators is roughly the same as your
systems measurement, then you've done about as well as you could.
- In MUC-7: the best Named Entity system got a 93 F-score and
inner-annotator agreement was 97.
- For template elements the best score was 87.
- For the full template, the inner annnotator score was 97 and the
best score was 51 (by SRA).
- I got TE 47 and ST 1 in about a days work on the new domain.