I am "playing" with the task of automated text categorization and
inevitably hit a few dilemmas. I have tried different combinations of
SVM and NaiveBayes, here are some results:
- algorithm::svm (single world, through AI::Categorizer) ~ 92%
accuracy (with the linear kernel, the radial one has bellow 10% with
all sorts of values tried for gamma and c)
- algorithm::svmlight ( nr. of categories worlds - each trained
against the others ) ~ 62% in ranking mode
- algorithm::naivebayes (one world, through AI::Categorizer) ~ 94%
- algorithm::naivebayes (each against all other) ~ 73%
These are on the same corpus ( which isn't perfect at all, but that a
negligible information for now :) ).
By accuracy I mean tested accuracy on a single category, which is, if
the first category returned (highest score) is the supposed one, it's
a hit, else, a miss.
By single world I mean all categories build a single model, against
tests are run. By multiple worlds (each against all other) I mean each
category builds a model in which the tokens from that category are
positive and the tokens from all other categories are negative.
So, back to my dilemmas. :) The results are puzzling, as many of the
research papers on the subject I've consulted say that SVM is
supposedly the best algorithm for this task. The radial kernel should
give the best results, for empirical-found values of gamma and C.
Ignoring the fact that SVM is much, much slower to train than NB, it
still has worse accuracy. What am I doing wrong ?
I would happily ignore all this and use NB, but it has one major flaw.
"The winner takes it all", the first result returned is way too far
(as in distance :)) from the others, which isn't exactly accurate if
one cares of a balanced results pool. I don't know whether this is an
implementation problem - I poked around the rescale() function in
Util.pm with no real success - or a general algorithm problem. My goal
is to have an implementation that can say: this text is 60% cat X, 20%
cat Y, 18% cat Z and 2% other cats. Is this feasible ? If so, what
approach would you recommend (which algorithm, which implementation or
what path for implementing it ) ?
perl -MLWP::Simple -e'print$_[rand(split(q|%%\n|,