Grokbase Groups Perl ai February 2005
Hi Jason,

Most likely, the reason this isn't working is that the training data
isn't adequate for the task. Essentially you're feeding it a bunch of
examples where the input string exactly matches the output category,
and only one training example for each category, but then asking it at
run-time to switch gears and deal with noisy data. Essentially, it
doesn't have enough information to extrapolate a profile for each

What this means is that in order to use AI::Categorizer in the obvious
way for this project, you're going to have to get your hands on some
training data that has the same statistical properties as what you'll
see at run time. That means noisy data, with all the mistypings and
invalid information, and each noisy string mapped to its correction
(the "category").

If you're coming into this project to try to automate a process that's
been happening by hand for a while, perhaps you can get your hands on
the mistyped/erroneous strings and their corrections, and use that set
as training data. If not, you may have to spend some time (or hire
someone) hand-categorizing your input.

If you hang around AI stuff long enough, you'll realize that this issue
is often the *main* obstacle to doing machine learning, and you'll
understand why people often call their training set "gold data". =)

If coming up with a good set of training data isn't an option for this
project, you might try a different approach altogether. For instance,
re-cast the problem as a search-engine problem, where your "documents"
are your 8200 description strings, your "words" are all the
character-n-grams (substrings of length n) from those strings, and your
"queries" are the noisy strings you're trying to clean up. Sometimes
that works pretty well.

Or you could try the Levenshtein edit distance that Samy suggested. Or
you could try something else that you invent. =)


On Feb 4, 2005, at 4:18 AM, Jason Armstrong wrote:

Perhaps someone on this list has some good advice for me. I am working
on a project that imports vehicle descriptions. Very often, the data
capturers give invalid information, or mistyped data. I am looking for
way to intelligently reformat the data, and add the mistyped entry for
future use.

I should add that I have very little experience in AI or machine
learning. I don't mind spending a day or two reading up, but my focus
more on implementing something practical, either in perl or in C.

I've been looking at AI::Categorizer. I have a list of all valid
descriptions (about 8200). I create for each of these a knowledge set,
with the content the same as the category:


my $c = new AI::Categorizer(
knowledge_set => AI::Categorizer::KnowledgeSet->new
( name => 'Vehicles', ),
learner_class => 'AI::Categorizer::Learner::NaiveBayes');

my $l = $c->learner;

my %docs;
foreach (vehicle descriptions) {
$docs{$i}->{content} = $content;
$docs{$i++}->{category} = [$content];

foreach (keys %docs) {
$c->knowledge_set->make_document(name => $_, %{$docs->{$_}});


Sometimes it works well:

input: VOLVO
output: VOLVO FH 12

Sometimes not (there is one category called 'WARRIOR 14-160 14-160'):

input: WARRIOR

In fact, the 'PORSCHE 911 CARRERA' category gets returned most often
(are you sure this is artificial intelligence ?-))

Even when I add the content directly:

$content = 'WARRIOR';
$category = 'WARRIOR 14-160 14-160';
name => $i++, categories => [$category], content =>

I still get the above result.

I also have some problems with saving the training set. After the above
example, I do: $l->save_state(directory), and then exit the program.
When I start it up again:

if (-d directory) {

But then when I try to do anything:

Can't call method "predict" on an undefined value at
/usr/local/share/perl/5.8.4/AI/Categorizer/Learner/ line

Two other things:

1. SVM takes forever, and then crashes after consuming all the memory.
2. Does everything need to be loaded into memory, or is there a way to
access the data via a database, for example.

I'm ideally looking for something similar to DSpam, which can rate a
description and suggest the best category that it belongs in.

Are there any suggestions?

Thank-you in advance.

Jason Armstrong

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 5 of 9 | next ›
Discussion Overview
groupai @
postedFeb 4, '05 at 10:19a
activeFeb 6, '05 at 4:20p



site design / logo © 2021 Grokbase