Grokbase Groups Perl ai January 2007
FAQ
Hello,
I am "playing" with the task of automated text categorization and
inevitably hit a few dilemmas. I have tried different combinations of
SVM and NaiveBayes, here are some results:
- algorithm::svm (single world, through AI::Categorizer) ~ 92%
accuracy (with the linear kernel, the radial one has bellow 10% with
all sorts of values tried for gamma and c)
- algorithm::svmlight ( nr. of categories worlds - each trained
against the others ) ~ 62% in ranking mode
- algorithm::naivebayes (one world, through AI::Categorizer) ~ 94%
- algorithm::naivebayes (each against all other) ~ 73%

These are on the same corpus ( which isn't perfect at all, but that a
negligible information for now :) ).
By accuracy I mean tested accuracy on a single category, which is, if
the first category returned (highest score) is the supposed one, it's
a hit, else, a miss.
By single world I mean all categories build a single model, against
tests are run. By multiple worlds (each against all other) I mean each
category builds a model in which the tokens from that category are
positive and the tokens from all other categories are negative.

So, back to my dilemmas. :) The results are puzzling, as many of the
research papers on the subject I've consulted say that SVM is
supposedly the best algorithm for this task. The radial kernel should
give the best results, for empirical-found values of gamma and C.
Ignoring the fact that SVM is much, much slower to train than NB, it
still has worse accuracy. What am I doing wrong ?
I would happily ignore all this and use NB, but it has one major flaw.
"The winner takes it all", the first result returned is way too far
(as in distance :)) from the others, which isn't exactly accurate if
one cares of a balanced results pool. I don't know whether this is an
implementation problem - I poked around the rescale() function in
Util.pm with no real success - or a general algorithm problem. My goal
is to have an implementation that can say: this text is 60% cat X, 20%
cat Y, 18% cat Z and 2% other cats. Is this feasible ? If so, what
approach would you recommend (which algorithm, which implementation or
what path for implementing it ) ?
TIA

--
perl -MLWP::Simple -e'print$_[rand(split(q|%%\n|,
get(q=http://cpan.org/misc/japh=)))]'

Search Discussions

  • Ken Williams at Jan 8, 2007 at 5:23 am

    On Jan 5, 2007, at 7:10 AM, zgrim wrote:

    So, back to my dilemmas. :) The results are puzzling, as many of the
    research papers on the subject I've consulted say that SVM is
    supposedly the best algorithm for this task. The radial kernel should
    give the best results, for empirical-found values of gamma and C.
    This may be an issue with your corpus - I quite often find that when
    I don't have enough training data for the SVM to pick up on the
    "truth" patterns, or (somewhat equivalently) when there's a lot of
    noise in the data, a linear kernel will outperform a radial (RBF). I
    tend to think that's because the RBF is more expressive, and it's
    overfitting the noise in the training set.

    Ignoring the fact that SVM is much, much slower to train than NB, it
    still has worse accuracy. What am I doing wrong ?
    That may be an accident of your corpus too. Are you using cross-
    validation for these experiments? If so, you should be able to get
    some error bars to tell whether the difference is statistically
    significant or not. I'm guessing a 2% advantage may not be, in this
    case.
    I would happily ignore all this and use NB, but it has one major flaw.
    "The winner takes it all", the first result returned is way too far
    (as in distance :)) from the others, which isn't exactly accurate if
    one cares of a balanced results pool. I don't know whether this is an
    implementation problem - I poked around the rescale() function in
    Util.pm with no real success - or a general algorithm problem. My goal
    is to have an implementation that can say: this text is 60% cat X, 20%
    cat Y, 18% cat Z and 2% other cats. Is this feasible ? If so, what
    approach would you recommend (which algorithm, which implementation or
    what path for implementing it ) ?
    Unfortunately, neither NB nor SVMs can really tell you that. SVMs
    are purely discriminative, so all they can tell you is "I think this
    new example is more like class A than class B in my training data".
    There's no probability involved at all. That said, I believe there
    has been some research into how to translate SVM output scores into
    probabilities or confidence scores, but I'm not really familiar with it.

    NB on the surface would seem to be a better option since it's
    directly based on probabilities, but again the algorithm was designed
    only to discriminate, so all those denominators that are thrown away
    (the "P(words)" terms in the A::NB documentation) mean that the
    notion of probabilities is lost. The rescale() function is basically
    just a hack to return scores that are a little more convenient to
    work with than the raw output of the algorithm. As you've seen, it
    tends to be a little arrogant, greatly exaggerating the score for the
    first category and giving tiny scores to the rest. I'm sure there
    are better algorithms that could be used there, but in many cases
    either one doesn't really care about the actual scores, or one
    (*ahem*) does something ad hoc like taking the square root of all the
    scores, or the fifth root, or whatever, just to get some numbers that
    look better to end users.

    As for a better alternative, I'm not familiar with any that will be
    as accessible from a perl world, but you might want to look at some
    language modeling papers - I really like the LDA papers from Michael
    Jordan (no, not that Michael Jordan, this one: http://
    citeseer.ist.psu.edu/541352.html), which are by no means
    straightforward, but they will indeed let you describe each document
    as generated by a mixture of categories.

    -Ken
  • Tom Fawcett at Jan 8, 2007 at 4:51 pm

    On Jan 7, 2007, at 9:23 PM, Ken Williams wrote:
    I would happily ignore all this and use NB, but it has one major
    flaw.
    "The winner takes it all", the first result returned is way too far
    (as in distance :)) from the others, which isn't exactly accurate if
    one cares of a balanced results pool. I don't know whether this is an
    implementation problem - I poked around the rescale() function in
    Util.pm with no real success - or a general algorithm problem. My
    goal
    is to have an implementation that can say: this text is 60% cat X,
    20%
    cat Y, 18% cat Z and 2% other cats. Is this feasible ? If so, what
    approach would you recommend (which algorithm, which
    implementation or
    what path for implementing it ) ?
    Unfortunately, neither NB nor SVMs can really tell you that. SVMs
    are purely discriminative, so all they can tell you is "I think
    this new example is more like class A than class B in my training
    data". There's no probability involved at all. That said, I
    believe there has been some research into how to translate SVM
    output scores into probabilities or confidence scores, but I'm not
    really familiar with it.

    NB on the surface would seem to be a better option since it's
    directly based on probabilities, but again the algorithm was
    designed only to discriminate, so all those denominators that are
    thrown away (the "P(words)" terms in the A::NB documentation) mean
    that the notion of probabilities is lost. The rescale() function
    is basically just a hack to return scores that are a little more
    convenient to work with than the raw output of the algorithm. As
    you've seen, it tends to be a little arrogant, greatly exaggerating
    the score for the first category and giving tiny scores to the
    rest. I'm sure there are better algorithms that could be used
    there, but in many cases either one doesn't really care about the
    actual scores, or one (*ahem*) does something ad hoc like taking
    the square root of all the scores, or the fifth root, or whatever,
    just to get some numbers that look better to end users.
    Just to add a note here: Ken is correct -- both NB and SVMs are known
    to be rather poor at providing accurate probabilities. Their scores
    tend to be too extreme. Producing good probabilities from these
    scores is called calibrating the classifier, and it's more complex
    than just taking a root of the score. There are several methods for
    calibrating scores. The good news is that there's an effective one
    called isotonic regression (or Pool Adjacent Violators) which is
    pretty easy and fast. The bad news is that there's no plug-in (ie,
    CPAN-ready) perl implementation of it (I've got a simple
    implementation which I should convert and contribute someday).

    If you want to read about classifier calibration, google one of these
    titles:

    "Transforming classifier scores into accurate multiclass probability
    estimates"
    by Bianca Zadrozny and Charles Elkan

    "Predicting Good Probabilities With Supervised Learning"
    by A. Niculescu-Mizil and R. Caruana

    Regards,
    -Tom
  • Ken Williams at Jan 9, 2007 at 4:38 am

    On Jan 8, 2007, at 10:51 AM, Tom Fawcett wrote:

    Just to add a note here: Ken is correct -- both NB and SVMs are
    known to be rather poor at providing accurate probabilities. Their
    scores tend to be too extreme. Producing good probabilities from
    these scores is called calibrating the classifier, and it's more
    complex than just taking a root of the score. There are several
    methods for calibrating scores. The good news is that there's an
    effective one called isotonic regression (or Pool Adjacent
    Violators) which is pretty easy and fast. The bad news is that
    there's no plug-in (ie, CPAN-ready) perl implementation of it (I've
    got a simple implementation which I should convert and contribute
    someday).

    If you want to read about classifier calibration, google one of
    these titles:

    "Transforming classifier scores into accurate multiclass
    probability estimates"
    by Bianca Zadrozny and Charles Elkan

    "Predicting Good Probabilities With Supervised Learning"
    by A. Niculescu-Mizil and R. Caruana

    Cool, thanks for the references. It might be nice to add somesuch
    scheme to Algorithm::NaiveBayes (and friends), so that the user has a
    choice of several normalization schemes, including "none". If I get
    a surplus of tuits I'll add it, or if you feel like contributing your
    stuff that would be great too.

    -Ken

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupai @
categoriesperl
postedJan 5, '07 at 1:10p
activeJan 9, '07 at 4:38a
posts4
users3
websiteperl.org

People

Translate

site design / logo © 2021 Grokbase