I am "playing" with the task of automated text categorization and

inevitably hit a few dilemmas. I have tried different combinations of

SVM and NaiveBayes, here are some results:

- algorithm::svm (single world, through AI::Categorizer) ~ 92%

accuracy (with the linear kernel, the radial one has bellow 10% with

all sorts of values tried for gamma and c)

- algorithm::svmlight ( nr. of categories worlds - each trained

against the others ) ~ 62% in ranking mode

- algorithm::naivebayes (one world, through AI::Categorizer) ~ 94%

- algorithm::naivebayes (each against all other) ~ 73%

These are on the same corpus ( which isn't perfect at all, but that a

negligible information for now :) ).

By accuracy I mean tested accuracy on a single category, which is, if

the first category returned (highest score) is the supposed one, it's

a hit, else, a miss.

By single world I mean all categories build a single model, against

tests are run. By multiple worlds (each against all other) I mean each

category builds a model in which the tokens from that category are

positive and the tokens from all other categories are negative.

So, back to my dilemmas. :) The results are puzzling, as many of the

research papers on the subject I've consulted say that SVM is

supposedly the best algorithm for this task. The radial kernel should

give the best results, for empirical-found values of gamma and C.

Ignoring the fact that SVM is much, much slower to train than NB, it

still has worse accuracy. What am I doing wrong ?

I would happily ignore all this and use NB, but it has one major flaw.

"The winner takes it all", the first result returned is way too far

(as in distance :)) from the others, which isn't exactly accurate if

one cares of a balanced results pool. I don't know whether this is an

implementation problem - I poked around the rescale() function in

Util.pm with no real success - or a general algorithm problem. My goal

is to have an implementation that can say: this text is 60% cat X, 20%

cat Y, 18% cat Z and 2% other cats. Is this feasible ? If so, what

approach would you recommend (which algorithm, which implementation or

what path for implementing it ) ?

TIA

--

perl -MLWP::Simple -e'print$_[rand(split(q|%%\n|,

get(q=http://cpan.org/misc/japh=)))]'