examples I have the actual class labels (binary). The data is quite

unbalanced with about 10% or less of the examples having a positive outcome

and the remainder, of course, negative. Nothing suggests the data has any

order, and it doesn't appear to have any, so I've pulled the first 30,000

examples to use as training data, reserving the remainder for test data.

There are actually 3 distinct sets of class labels associated with the

predictor data, and I've built 3 distinct models. When each model is used in

predict() with the training data and true class labels, I get AUC values of

0.95, 0.98 and 0.98 for the 3 classifier problems.

When I run these models against the 'unknown' inputs that I held out--the

20,000 instances--I get AUC values of about 0.55 or so for each of the three

problems, give or take. I reran the entire experiment, but instead using

40,000 instances for the model building, and the remaining 10,000 for

testing. The AUC values showed a modest improvement, but still under 0.60.

I've looked at a) the number of unique values that each predictor takes on,

and b) the number of values, for a given predictor, that appear in the test

data that do not appear in the training data. I can eliminate variables

that have very few non-null values, and those that have very few unique

values (the two are largely the same), but I wouldn't expect this to have

any influence on the model.

I've already eliminated variables that are null in every instance, and

duplicate variables having identical values for all instances. I have not

done anything to check further for dependant variables, and don't know how

to.

Besides getting a clue, what might be my next best step?

--

View this message in context: http://r.789695.n4.nabble.com/Analyzing-Poor-Performance-Using-naiveBayes-tp4639825.html

Sent from the R help mailing list archive at Nabble.com.