Grokbase Groups Perl ai February 2005
FAQ
Perhaps someone on this list has some good advice for me. I am working
on a project that imports vehicle descriptions. Very often, the data
capturers give invalid information, or mistyped data. I am looking for a
way to intelligently reformat the data, and add the mistyped entry for
future use.

I should add that I have very little experience in AI or machine
learning. I don't mind spending a day or two reading up, but my focus is
more on implementing something practical, either in perl or in C.

I've been looking at AI::Categorizer. I have a list of all valid vehicle
descriptions (about 8200). I create for each of these a knowledge set,
with the content the same as the category:

Briefly:

my $c = new AI::Categorizer(
knowledge_set => AI::Categorizer::KnowledgeSet->new
( name => 'Vehicles', ),
learner_class => 'AI::Categorizer::Learner::NaiveBayes');

my $l = $c->learner;

my %docs;
foreach (vehicle descriptions) {
$docs{$i}->{content} = $content;
$docs{$i++}->{category} = [$content];
}

foreach (keys %docs) {
$c->knowledge_set->make_document(name => $_, %{$docs->{$_}});
}

$l->train;


Sometimes it works well:

input: VOLVO
output: VOLVO FH 12

Sometimes not (there is one category called 'WARRIOR 14-160 14-160'):

input: WARRIOR
output: PORSCHE 911 CARRERA

In fact, the 'PORSCHE 911 CARRERA' category gets returned most often
(are you sure this is artificial intelligence ?-))

Even when I add the content directly:

$content = 'WARRIOR';
$category = 'WARRIOR 14-160 14-160';
$c->knowledge_set->make_document(
name => $i++, categories => [$category], content => $content);

I still get the above result.

I also have some problems with saving the training set. After the above
example, I do: $l->save_state(directory), and then exit the program.
When I start it up again:

if (-d directory) {
$l->restore_state(directory);
}

But then when I try to do anything:

Can't call method "predict" on an undefined value at
/usr/local/share/perl/5.8.4/AI/Categorizer/Learner/NaiveBayes.pm line 28


Two other things:

1. SVM takes forever, and then crashes after consuming all the memory.
2. Does everything need to be loaded into memory, or is there a way to
access the data via a database, for example.

I'm ideally looking for something similar to DSpam, which can rate a
description and suggest the best category that it belongs in.

Are there any suggestions?

Thank-you in advance.

--
Jason Armstrong

Search Discussions

  • Samy Kamkar at Feb 4, 2005 at 5:21 pm
    Another module you may want to use in conjunction is String::Approx. It
    uses the Levenshtein edit distance to determine whether a string
    approximately matches another or not. In both your volvo and warrior
    cases, it would match correctly.

    Good luck!
    On Feb 4, 2005, at 2:18 AM, Jason Armstrong wrote:

    Perhaps someone on this list has some good advice for me. I am working
    on a project that imports vehicle descriptions. Very often, the data
    capturers give invalid information, or mistyped data. I am looking for
    a
    way to intelligently reformat the data, and add the mistyped entry for
    future use.

    I should add that I have very little experience in AI or machine
    learning. I don't mind spending a day or two reading up, but my focus
    is
    more on implementing something practical, either in perl or in C.

    I've been looking at AI::Categorizer. I have a list of all valid
    vehicle
    descriptions (about 8200). I create for each of these a knowledge set,
    with the content the same as the category:

    Briefly:

    my $c = new AI::Categorizer(
    knowledge_set => AI::Categorizer::KnowledgeSet->new
    ( name => 'Vehicles', ),
    learner_class => 'AI::Categorizer::Learner::NaiveBayes');

    my $l = $c->learner;

    my %docs;
    foreach (vehicle descriptions) {
    $docs{$i}->{content} = $content;
    $docs{$i++}->{category} = [$content];
    }

    foreach (keys %docs) {
    $c->knowledge_set->make_document(name => $_, %{$docs->{$_}});
    }

    $l->train;


    Sometimes it works well:

    input: VOLVO
    output: VOLVO FH 12

    Sometimes not (there is one category called 'WARRIOR 14-160 14-160'):

    input: WARRIOR
    output: PORSCHE 911 CARRERA

    In fact, the 'PORSCHE 911 CARRERA' category gets returned most often
    (are you sure this is artificial intelligence ?-))

    Even when I add the content directly:

    $content = 'WARRIOR';
    $category = 'WARRIOR 14-160 14-160';
    $c->knowledge_set->make_document(
    name => $i++, categories => [$category], content =>
    $content);

    I still get the above result.

    I also have some problems with saving the training set. After the above
    example, I do: $l->save_state(directory), and then exit the program.
    When I start it up again:

    if (-d directory) {
    $l->restore_state(directory);
    }

    But then when I try to do anything:

    Can't call method "predict" on an undefined value at
    /usr/local/share/perl/5.8.4/AI/Categorizer/Learner/NaiveBayes.pm line
    28


    Two other things:

    1. SVM takes forever, and then crashes after consuming all the memory.
    2. Does everything need to be loaded into memory, or is there a way to
    access the data via a database, for example.

    I'm ideally looking for something similar to DSpam, which can rate a
    description and suggest the best category that it belongs in.

    Are there any suggestions?

    Thank-you in advance.

    --
    Jason Armstrong
  • Tim Allwine at Feb 4, 2005 at 6:15 pm

    Jason Armstrong wrote:
    ...
    I've been looking at AI::Categorizer. I have a list of all valid vehicle
    descriptions (about 8200). I create for each of these a knowledge set,
    with the content the same as the category:

    Briefly:

    my $c = new AI::Categorizer(
    knowledge_set => AI::Categorizer::KnowledgeSet->new
    ( name => 'Vehicles', ),
    learner_class => 'AI::Categorizer::Learner::NaiveBayes');

    my $l = $c->learner;

    my %docs;
    foreach (vehicle descriptions) {
    $docs{$i}->{content} = $content;
    $docs{$i++}->{category} = [$content];
    }

    foreach (keys %docs) {
    $c->knowledge_set->make_document(name => $_, %{$docs->{$_}});
    }

    $l->train;


    Sometimes it works well:
    I'm using AI::Categorizer to categorize books and have many of the same
    questions as you. AI::Categorizer::Learner::KNN is working the best and
    like you when ever I try AI::Categorizer::Learner::SVM it blows up every
    time. I even moved it off onto a 64 bit machine with 8 gigs of memory
    and it still won't run. We have over 10,000 trained books using the text
    supplied by Amazon to train with.

    -Tim
  • Marco Baroni at Feb 4, 2005 at 6:21 pm
    It is not in perl, but SMVlight (http://svmlight.joachims.org/) offers a
    very efficient (in my experience) C implementation of support vector
    machines, and, being a command line tool, it's easy to interface it to
    perl.

    Regards,

    Marco




    --
    Marco Baroni
    SSLMIT, University of Bologna
    http://sslmit.unibo.it/~baroni
  • Ken Williams at Feb 5, 2005 at 2:36 am
    Hi Jason,

    Most likely, the reason this isn't working is that the training data
    isn't adequate for the task. Essentially you're feeding it a bunch of
    examples where the input string exactly matches the output category,
    and only one training example for each category, but then asking it at
    run-time to switch gears and deal with noisy data. Essentially, it
    doesn't have enough information to extrapolate a profile for each
    category.

    What this means is that in order to use AI::Categorizer in the obvious
    way for this project, you're going to have to get your hands on some
    training data that has the same statistical properties as what you'll
    see at run time. That means noisy data, with all the mistypings and
    invalid information, and each noisy string mapped to its correction
    (the "category").

    If you're coming into this project to try to automate a process that's
    been happening by hand for a while, perhaps you can get your hands on
    the mistyped/erroneous strings and their corrections, and use that set
    as training data. If not, you may have to spend some time (or hire
    someone) hand-categorizing your input.

    If you hang around AI stuff long enough, you'll realize that this issue
    is often the *main* obstacle to doing machine learning, and you'll
    understand why people often call their training set "gold data". =)

    If coming up with a good set of training data isn't an option for this
    project, you might try a different approach altogether. For instance,
    re-cast the problem as a search-engine problem, where your "documents"
    are your 8200 description strings, your "words" are all the
    character-n-grams (substrings of length n) from those strings, and your
    "queries" are the noisy strings you're trying to clean up. Sometimes
    that works pretty well.

    Or you could try the Levenshtein edit distance that Samy suggested. Or
    you could try something else that you invent. =)

    -Ken

    On Feb 4, 2005, at 4:18 AM, Jason Armstrong wrote:

    Perhaps someone on this list has some good advice for me. I am working
    on a project that imports vehicle descriptions. Very often, the data
    capturers give invalid information, or mistyped data. I am looking for
    a
    way to intelligently reformat the data, and add the mistyped entry for
    future use.

    I should add that I have very little experience in AI or machine
    learning. I don't mind spending a day or two reading up, but my focus
    is
    more on implementing something practical, either in perl or in C.

    I've been looking at AI::Categorizer. I have a list of all valid
    vehicle
    descriptions (about 8200). I create for each of these a knowledge set,
    with the content the same as the category:

    Briefly:

    my $c = new AI::Categorizer(
    knowledge_set => AI::Categorizer::KnowledgeSet->new
    ( name => 'Vehicles', ),
    learner_class => 'AI::Categorizer::Learner::NaiveBayes');

    my $l = $c->learner;

    my %docs;
    foreach (vehicle descriptions) {
    $docs{$i}->{content} = $content;
    $docs{$i++}->{category} = [$content];
    }

    foreach (keys %docs) {
    $c->knowledge_set->make_document(name => $_, %{$docs->{$_}});
    }

    $l->train;


    Sometimes it works well:

    input: VOLVO
    output: VOLVO FH 12

    Sometimes not (there is one category called 'WARRIOR 14-160 14-160'):

    input: WARRIOR
    output: PORSCHE 911 CARRERA

    In fact, the 'PORSCHE 911 CARRERA' category gets returned most often
    (are you sure this is artificial intelligence ?-))

    Even when I add the content directly:

    $content = 'WARRIOR';
    $category = 'WARRIOR 14-160 14-160';
    $c->knowledge_set->make_document(
    name => $i++, categories => [$category], content =>
    $content);

    I still get the above result.

    I also have some problems with saving the training set. After the above
    example, I do: $l->save_state(directory), and then exit the program.
    When I start it up again:

    if (-d directory) {
    $l->restore_state(directory);
    }

    But then when I try to do anything:

    Can't call method "predict" on an undefined value at
    /usr/local/share/perl/5.8.4/AI/Categorizer/Learner/NaiveBayes.pm line
    28


    Two other things:

    1. SVM takes forever, and then crashes after consuming all the memory.
    2. Does everything need to be loaded into memory, or is there a way to
    access the data via a database, for example.

    I'm ideally looking for something similar to DSpam, which can rate a
    description and suggest the best category that it belongs in.

    Are there any suggestions?

    Thank-you in advance.

    --
    Jason Armstrong
  • Richard Jelinek at Feb 5, 2005 at 7:26 am
    Hi Ken,
    On Fri, Feb 04, 2005 at 08:36:10PM -0600, Ken Williams wrote:
    What this means is that in order to use AI::Categorizer in the obvious
    way for this project, you're going to have to get your hands on some
    training data that has the same statistical properties as what you'll
    see at run time. That means noisy data, with all the mistypings and
    invalid information, and each noisy string mapped to its correction
    (the "category").
    True true. And while this is true, the reports about nonfunctional SVM
    are also true. At least I can confirm them and have mentioned them
    here some time ago already.

    What can/will "we" do about this?


    --
    best regards,

    Dipl.-Inf. Richard Jelinek

    - The PetaMem Group - Prague/Nuremberg - www.petamem.com -
    -= 3394928 Mind Units =-
  • Ken Williams at Feb 5, 2005 at 2:09 pm

    On Feb 5, 2005, at 1:26 AM, Richard Jelinek wrote:
    True true. And while this is true, the reports about nonfunctional SVM
    are also true. At least I can confirm them and have mentioned them
    here some time ago already.

    What can/will "we" do about this?
    Oh yes, sorry I forgot to address this in my message.

    I remember I looked into this a couple years ago, when Algorithm::SVM
    was first released, and couldn't figure out a solution. I wrote to the
    authors of libsvm, and got the impression from them that it's sort of
    just the way that code is on large data sets, and we'd be better off
    using bsvm instead.

    http://www.csie.ntu.edu.tw/~cjlin/

    I think there may be other alternatives by now too, but unfortunately I
    haven't had any time to look into them, and I think this cause is sort
    of "waiting for a champion."

    -Ken
  • Jason Armstrong at Feb 5, 2005 at 5:23 pm
    Thanks for all the good feedback, I'll certainly be following up on it.

    I did find one reason why I wasn't getting good matches ... when I
    looked more carefully at the perl data structure, I found that the
    'features' hash only contained alphabetic characters. So, for example,
    in the string 'WARRIOR 14-160 14-160', only the warrior part was being
    used. Also, with 'BMW 318i' and 'BWM 525i', the numbers were being
    ignored, and with something like 'A/T', two separate features 'a' and
    't' were there.

    So my further question is how to get NaiveBayes to use white space
    separated words as features ('318i', 'a/t') and not just the individual
    alphabetic characters. Is it a simple option when calling
    new AI::Categorizer?

    --
    Jason Armstrong
  • Ken Williams at Feb 6, 2005 at 4:20 pm
    Aha, yes.

    AI::Categorizer lets you customize the tokenization behavior to be
    however you want, by subclassing the Document class and overriding the
    tokenize() method. You could do something like this:

    {
    package My::Documents;
    @ISA = qw(AI::Categorizer::Document::Text);
    sub tokenize {
    return [split ' ', $_[1]];
    }
    }
    my $c = new AI::Categorizer(
    document_class => 'My::Documents',
    );

    ...

    -Ken
    On Feb 5, 2005, at 11:23 AM, Jason Armstrong wrote:

    Thanks for all the good feedback, I'll certainly be following up on it.

    I did find one reason why I wasn't getting good matches ... when I
    looked more carefully at the perl data structure, I found that the
    'features' hash only contained alphabetic characters. So, for example,
    in the string 'WARRIOR 14-160 14-160', only the warrior part was being
    used. Also, with 'BMW 318i' and 'BWM 525i', the numbers were being
    ignored, and with something like 'A/T', two separate features 'a' and
    't' were there.

    So my further question is how to get NaiveBayes to use white space
    separated words as features ('318i', 'a/t') and not just the individual
    alphabetic characters. Is it a simple option when calling
    new AI::Categorizer?

    --
    Jason Armstrong

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupai @
categoriesperl
postedFeb 4, '05 at 10:19a
activeFeb 6, '05 at 4:20p
posts9
users6
websiteperl.org

People

Translate

site design / logo © 2021 Grokbase