Grokbase Groups Perl ai February 2005
Another module you may want to use in conjunction is String::Approx. It
uses the Levenshtein edit distance to determine whether a string
approximately matches another or not. In both your volvo and warrior
cases, it would match correctly.

Good luck!
On Feb 4, 2005, at 2:18 AM, Jason Armstrong wrote:

Perhaps someone on this list has some good advice for me. I am working
on a project that imports vehicle descriptions. Very often, the data
capturers give invalid information, or mistyped data. I am looking for
way to intelligently reformat the data, and add the mistyped entry for
future use.

I should add that I have very little experience in AI or machine
learning. I don't mind spending a day or two reading up, but my focus
more on implementing something practical, either in perl or in C.

I've been looking at AI::Categorizer. I have a list of all valid
descriptions (about 8200). I create for each of these a knowledge set,
with the content the same as the category:


my $c = new AI::Categorizer(
knowledge_set => AI::Categorizer::KnowledgeSet->new
( name => 'Vehicles', ),
learner_class => 'AI::Categorizer::Learner::NaiveBayes');

my $l = $c->learner;

my %docs;
foreach (vehicle descriptions) {
$docs{$i}->{content} = $content;
$docs{$i++}->{category} = [$content];

foreach (keys %docs) {
$c->knowledge_set->make_document(name => $_, %{$docs->{$_}});


Sometimes it works well:

input: VOLVO
output: VOLVO FH 12

Sometimes not (there is one category called 'WARRIOR 14-160 14-160'):

input: WARRIOR

In fact, the 'PORSCHE 911 CARRERA' category gets returned most often
(are you sure this is artificial intelligence ?-))

Even when I add the content directly:

$content = 'WARRIOR';
$category = 'WARRIOR 14-160 14-160';
name => $i++, categories => [$category], content =>

I still get the above result.

I also have some problems with saving the training set. After the above
example, I do: $l->save_state(directory), and then exit the program.
When I start it up again:

if (-d directory) {

But then when I try to do anything:

Can't call method "predict" on an undefined value at
/usr/local/share/perl/5.8.4/AI/Categorizer/Learner/ line

Two other things:

1. SVM takes forever, and then crashes after consuming all the memory.
2. Does everything need to be loaded into memory, or is there a way to
access the data via a database, for example.

I'm ideally looking for something similar to DSpam, which can rate a
description and suggest the best category that it belongs in.

Are there any suggestions?

Thank-you in advance.

Jason Armstrong

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 2 of 9 | next ›
Discussion Overview
groupai @
postedFeb 4, '05 at 10:19a
activeFeb 6, '05 at 4:20p



site design / logo © 2021 Grokbase