Grokbase Groups R r-help August 2012
FAQ
My data is 50,000 instances of about 200 predictor values, and for all 50,000
examples I have the actual class labels (binary). The data is quite
unbalanced with about 10% or less of the examples having a positive outcome
and the remainder, of course, negative. Nothing suggests the data has any
order, and it doesn't appear to have any, so I've pulled the first 30,000
examples to use as training data, reserving the remainder for test data.

There are actually 3 distinct sets of class labels associated with the
predictor data, and I've built 3 distinct models. When each model is used in
predict() with the training data and true class labels, I get AUC values of
0.95, 0.98 and 0.98 for the 3 classifier problems.

When I run these models against the 'unknown' inputs that I held out--the
20,000 instances--I get AUC values of about 0.55 or so for each of the three
problems, give or take. I reran the entire experiment, but instead using
40,000 instances for the model building, and the remaining 10,000 for
testing. The AUC values showed a modest improvement, but still under 0.60.

I've looked at a) the number of unique values that each predictor takes on,
and b) the number of values, for a given predictor, that appear in the test
data that do not appear in the training data. I can eliminate variables
that have very few non-null values, and those that have very few unique
values (the two are largely the same), but I wouldn't expect this to have
any influence on the model.

I've already eliminated variables that are null in every instance, and
duplicate variables having identical values for all instances. I have not
done anything to check further for dependant variables, and don't know how
to.

Besides getting a clue, what might be my next best step?




--
View this message in context: http://r.789695.n4.nabble.com/Analyzing-Poor-Performance-Using-naiveBayes-tp4639825.html
Sent from the R help mailing list archive at Nabble.com.

Search Discussions

  • C.H. at Aug 10, 2012 at 3:46 am
    I think you have been hit by the problem of high variance. (overfitting)

    Maybe you should consider doing a feature selection perhaps using the
    chisq ranking from FSelector.

    And then training the Naive Bayes using the top n features (n=1 to
    200) as ranked by chisq, plot the AUCs or F1 score from both training
    set and cross training set against n. From the graph, you can select
    the optimal number of n.

    On Fri, Aug 10, 2012 at 6:40 AM, Kirk Fleming wrote:
    My data is 50,000 instances of about 200 predictor values, and for all 50,000
    examples I have the actual class labels (binary). The data is quite
    unbalanced with about 10% or less of the examples having a positive outcome
    and the remainder, of course, negative. Nothing suggests the data has any
    order, and it doesn't appear to have any, so I've pulled the first 30,000
    examples to use as training data, reserving the remainder for test data.

    There are actually 3 distinct sets of class labels associated with the
    predictor data, and I've built 3 distinct models. When each model is used in
    predict() with the training data and true class labels, I get AUC values of
    0.95, 0.98 and 0.98 for the 3 classifier problems.

    When I run these models against the 'unknown' inputs that I held out--the
    20,000 instances--I get AUC values of about 0.55 or so for each of the three
    problems, give or take. I reran the entire experiment, but instead using
    40,000 instances for the model building, and the remaining 10,000 for
    testing. The AUC values showed a modest improvement, but still under 0.60.

    I've looked at a) the number of unique values that each predictor takes on,
    and b) the number of values, for a given predictor, that appear in the test
    data that do not appear in the training data. I can eliminate variables
    that have very few non-null values, and those that have very few unique
    values (the two are largely the same), but I wouldn't expect this to have
    any influence on the model.

    I've already eliminated variables that are null in every instance, and
    duplicate variables having identical values for all instances. I have not
    done anything to check further for dependant variables, and don't know how
    to.

    Besides getting a clue, what might be my next best step?




    --
    View this message in context: http://r.789695.n4.nabble.com/Analyzing-Poor-Performance-Using-naiveBayes-tp4639825.html
    Sent from the R help mailing list archive at Nabble.com.

    ______________________________________________
    R-help at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
    and provide commented, minimal, self-contained, reproducible code.
  • Kirk Fleming at Aug 10, 2012 at 7:16 pm
    Per your suggestion I ran chi.squared() against my training data and to my
    delight, found just 50 parameters that were non-zero influencers. I built
    the model through several iterations and found n = 12 to be the optimum for
    the training data.

    However, results still no so good for the test data. Here are he results for
    both with the AUC values for n = 3 to 50, training data in the 0.97 range,
    test data in the 0.55 area.

    http://r.789695.n4.nabble.com/file/n4639964/Feature_Selection_02.jpg

    If the training and test data sets were not so indistinguishable, I'd assume
    something weird about the test data--but I can't tell the two apart using
    any descriptive, 'meta' statistics I've tried so far. Having double-checked
    for dumb errors and having still obtained the same results, I toasted
    everything and started from scratch--still the same performance on the test
    data.

    Maybe I take a break and reflect for 30 min.



    --
    View this message in context: http://r.789695.n4.nabble.com/Analyzing-Poor-Performance-Using-naiveBayes-tp4639825p4639964.html
    Sent from the R help mailing list archive at Nabble.com.
  • Kirk Fleming at Aug 10, 2012 at 10:05 pm
    As some additional information, I re-ran the model across the range of n = 50
    to 150 (n being the 'top n' parameters returned by chi.squared), and this
    time used a completed different subset of the data for both training and
    test. Nearly identical results, with the typical train AUC about 0.98 and
    the typical test AUC about 0.56. The other change I made: 30k records
    (instances) for training this time and 20k for test.

    I'll check to see if the set of class labels I'm using (I'm currently only
    running one of the 3 sets) is the least balanced and if so, I'll grab the
    most balanced. However, none of the three sets is much better than 90/10 I
    don't think.



    --
    View this message in context: http://r.789695.n4.nabble.com/Analyzing-Poor-Performance-Using-naiveBayes-tp4639825p4639985.html
    Sent from the R help mailing list archive at Nabble.com.
  • Patrick Connolly at Sep 15, 2012 at 1:51 am
    On Thu, 09-Aug-2012 at 03:40PM -0700, Kirk Fleming wrote:

    My data is 50,000 instances of about 200 predictor values, and for
    all 50,000 examples I have the actual class labels (binary). The
    data is quite unbalanced with about 10% or less of the examples
    having a positive outcome and the remainder, of course,
    negative. Nothing suggests the data has any order, and it doesn't
    appear to have any, so I've pulled the first 30,000 examples to use
    as training data, reserving the remainder for test data.

    There are actually 3 distinct sets of class labels associated with
    the predictor data, and I've built 3 distinct models. When each
    model is used in predict() with the training data and true class
    labels, I get AUC values of 0.95, 0.98 and 0.98 for the 3
    classifier problems.

    I don't know where you got naiveBayes from so I can't check it, but my
    experience with boosted regression trees might be useful. I had AUC
    values fairly similar to yours with only one tenth of the number of
    instances you have.


    If naiveBayes has the ability to use a validation set, I think you'll
    find it makes a huge difference. In my case, it brought the Training
    AUC down to something like 0.85 but the test AUC was only slightly
    less, say 0.81.


    Try reserving about 20-25% of your training data for a validation set,
    then calculate your AUC on the combined Training and validation data.
    It will probably go down somewhat but your Test AUC will look much
    better.


    I'd be interested to know what you discover.




    --
    ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.
    ___ Patrick Connolly
    {~._.~} Great minds discuss ideas
    _( Y )_ Average minds discuss events
    (:_~*~_:) Small minds discuss people
    (_)-(_) ..... Eleanor Roosevelt


    ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupr-help @
categoriesr
postedAug 9, '12 at 10:40p
activeSep 15, '12 at 1:51a
posts5
users3
websiter-project.org
irc#r

People

Translate

site design / logo © 2023 Grokbase