Grokbase Groups Perl ai June 2002
FAQ
Hi,

I've just uploaded a Decision Tree module to CPAN. It provides
a pretty bare-bones implementation of the classic ID3 decision
tree building algorithm.

I plan on speaking about this module as part of my presentation
at TPC in July, so I might add a few more features to it if I
decide it needs them in order to be shown in public. See the
"LIMITATIONS" section of the docs.

Since it's a new module, I thought I'd include the docs here.

-Ken


NAME
AI::DecisionTree - Automatically Learns Decision Trees

SYNOPSIS
use AI::DecisionTree;
my $dtree = new AI::DecisionTree;

# A set of training data for deciding whether to play tennis
$dtree->add_instance
(attributes => {outlook => 'sunny',
temperature => 'hot',
humidity => 'high'},
result => 'no');

$dtree->add_instance
(attributes => {outlook => 'overcast',
temperature => 'hot',
humidity => 'normal'},
result => 'yes');

... repeat for several more instances, then:
$dtree->train;

# Find results for unseen instances
my $result = $dtree->get_result
(attributes => {outlook => 'sunny',
temperature => 'hot',
humidity => 'normal'});

DESCRIPTION
The "AI::DecisionTree" module automatically creates
so-called "decision
trees" to explain a set of training data. A decision tree is
a kind of
categorizer that use a flowchart-like process for categorizing new
instances. For instance, a learned decision tree might look like the
following, which classifies for the concept "play tennis":

OUTLOOK
/ | \
/ | \
/ | \
sunny/ overcast \rainy
/ | \
HUMIDITY | WIND
/ \ *no* / \
/ \ / \
high/ \normal / \
/ \ strong/ \weak
*no* *yes* / \
*no* *yes*

(This example, and the inspiration for the
"AI::DecisionTree" module,
come directly from Tom Mitchell's excellent book "Machine Learning",
available from McGraw Hill.)

A decision tree like this one can be learned from training data, and
then applied to previously unseen data to obtain results that are
consistent with the training data.

The usual goal of a decision tree is to somehow encapsulate
the training
data in the smallest possible tree. This is motivated by an "Occam's
Razor" philosophy, in which the simplest possible
explanation for a set
of phenomena should be preferred over other explanations.
Also, small
trees will make decisions faster than large trees, and they are much
easier for a human to look at and understand. One of the
biggest reasons
for using a decision tree instead of many other machine learning
techniques is that a decision tree is a much more scrutable decision
maker than, say, a neural network.

The current implementation of this module uses an extremely simple
method for creating the decision tree based on the training
instances.
It uses an Information Gain metric (based on expected reduction in
entropy) to select the "most informative" attribute at each
node in the
tree. This is essentially the ID3 algorithm, developed by J.
R. Quinlan
in 1986. The idea is that the attribute with the highest Information
Gain will (probably) be the best attribute to split the tree
on at each
point if we're interested in making small trees.

METHODS
new()

Creates a new decision tree object.

add_instance()

Adds a training instance to the set of instances which will
be used to
form the tree. An "attributes" parameter specifies a hash of
attribute-value pairs for the instance, and a "result" parameter
specifies the result.

train()

Builds the decision tree from the list of training instances.

get_result()

Returns the most likely result (from the set of all results given to
"add_instance()") for the set of attribute values given. An
"attributes"
parameter specifies a hash of attribute-value pairs for the
instance. If
the decision tree doesn't have enough information to find a
result, it
will return "undef".

rule_statements()

Returns a list of strings that describe the tree in rule-form. For
instance, for the tree diagram above, the following list would be
returned (though not necessarily in this order - the order is
unpredictable):

if outlook='rain' and wind='strong' -> 'no'
if outlook='rain' and wind='weak' -> 'yes'
if outlook='sunny' and humidity='normal' -> 'yes'
if outlook='sunny' and humidity='high' -> 'no'
if outlook='overcast' -> 'yes'

This can be helpful for scrutinizing the structure of a tree.

Note that while the order of the rules is unpredictable, the
order of
criteria within each rule reflects the order in which the
criteria will
be checked at decision-making time.

LIMITATIONS
A few limitations exist in the current version. All of them could be
removed in future versions - especially with your help. =)

No continuous attributes

In the current implementation, only discrete-valued attributes are
supported. This means that an attribute like "temperature" can have
values like "cool", "medium", and "hot", but using actual
temperatures
like 87 or 62.3 is not going to work. This is because the
values would
split the data too finely - the tree-building process would probably
think that it could make all its decisions based on the exact
temperature value alone, ignoring all other attributes, because each
temperature would have only been seen once.

The usual way to deal with this problem is for the
tree-building process
to figure out how to place the continuous attribute values
into a set of
bins (like "cool", "medium", and "hot") and then build the
tree based on
these bin values. Future versions of "AI::DecisionTree" may provide
support for this. For now, you have to do it yourself.

No support for noisy or inconsistent data

Currently, the tree-building technique cannot handle
inconsistent data.
That is, if you have two contradictory training instances,
the "train()"
method will throw an exception. Future versions may allow
inconsistent
data, finding the decision tree that most closely matches the data
rather than precisely matching it.

No support for tree-trimming

Most decision tree building algorithms use a two-stage
building process
- first a tree is built that completely fits the training
data (or fits
it as closely as possible if noisy data is supported), and
then the tree
is pruned so that it will generalize well to new instances.
This might
be done either by maximizing performance on a set of
held-out training
instances, or by pruning parts of the tree that don't seem
like they'll
be very valuable.

Currently, we build a tree that completely fits the training
data, and
we don't prune it. That means that the tree may overfit the training
data in many cases - i.e., you won't be able to see the
forest for the
trees (or, more accurately, the tree for the leaves).

This is mainly a problem when you're using "real world" or
noisy data.
If you're using data that you know to be a result of a rule-based
process and you just want to figure out what the rules are,
the current
implementation should work fine for you.

AUTHOR
Ken Williams, ken@mathforum.org

SEE ALSO
Mitchell, Tom (1997). Machine Learning. McGraw-Hill. pp 52-80.

Quinlan, J. R. (1986). Induction of decision trees. Machine
Learning,
1(1), pp 81-106.

the perl manpage.

Search Discussions

  • Gidon at Jun 9, 2002 at 4:08 pm
    I ran the code from the Synopsis. And I printed the result
    and the result was Yes. But it seems to me that in the test
    case two out of three of the values are those of NO Tennis
    days, so I figured that it should have responded NO.



    I also had the following warnings.

    Scalar value @count{$_} better written as $count{$_} at
    /usr/local/lib/perl5/site_perl/5.6.1/AI/DecisionTree.pm line 95.
    Use of uninitialized value in numeric le (<=) at
    /usr/local/lib/perl5/site_perl/5.6.1/AI/DecisionTree.pm line 87.


    Looks like an interesting module. Thanks.
    On Sun, 9 Jun 2002, Ken Williams wrote:

    Hi,

    I've just uploaded a Decision Tree module to CPAN. It provides
    a pretty bare-bones implementation of the classic ID3 decision
    tree building algorithm.

    I plan on speaking about this module as part of my presentation
    at TPC in July, so I might add a few more features to it if I
    decide it needs them in order to be shown in public. See the
    "LIMITATIONS" section of the docs.

    Since it's a new module, I thought I'd include the docs here.

    -Ken


    NAME
    AI::DecisionTree - Automatically Learns Decision Trees

    SYNOPSIS
    use AI::DecisionTree;
    my $dtree = new AI::DecisionTree;

    # A set of training data for deciding whether to play tennis
    $dtree->add_instance
    (attributes => {outlook => 'sunny',
    temperature => 'hot',
    humidity => 'high'},
    result => 'no');

    $dtree->add_instance
    (attributes => {outlook => 'overcast',
    temperature => 'hot',
    humidity => 'normal'},
    result => 'yes');

    ... repeat for several more instances, then:
    $dtree->train;

    # Find results for unseen instances
    my $result = $dtree->get_result
    (attributes => {outlook => 'sunny',
    temperature => 'hot',
    humidity => 'normal'});

    DESCRIPTION
    The "AI::DecisionTree" module automatically creates
    so-called "decision
    trees" to explain a set of training data. A decision tree is
    a kind of
    categorizer that use a flowchart-like process for categorizing new
    instances. For instance, a learned decision tree might look like the
    following, which classifies for the concept "play tennis":

    OUTLOOK
    / | \
    / | \
    / | \
    sunny/ overcast \rainy
    / | \
    HUMIDITY | WIND
    / \ *no* / \
    / \ / \
    high/ \normal / \
    / \ strong/ \weak
    *no* *yes* / \
    *no* *yes*

    (This example, and the inspiration for the
    "AI::DecisionTree" module,
    come directly from Tom Mitchell's excellent book "Machine Learning",
    available from McGraw Hill.)

    A decision tree like this one can be learned from training data, and
    then applied to previously unseen data to obtain results that are
    consistent with the training data.

    The usual goal of a decision tree is to somehow encapsulate
    the training
    data in the smallest possible tree. This is motivated by an "Occam's
    Razor" philosophy, in which the simplest possible
    explanation for a set
    of phenomena should be preferred over other explanations.
    Also, small
    trees will make decisions faster than large trees, and they are much
    easier for a human to look at and understand. One of the
    biggest reasons
    for using a decision tree instead of many other machine learning
    techniques is that a decision tree is a much more scrutable decision
    maker than, say, a neural network.

    The current implementation of this module uses an extremely simple
    method for creating the decision tree based on the training
    instances.
    It uses an Information Gain metric (based on expected reduction in
    entropy) to select the "most informative" attribute at each
    node in the
    tree. This is essentially the ID3 algorithm, developed by J.
    R. Quinlan
    in 1986. The idea is that the attribute with the highest Information
    Gain will (probably) be the best attribute to split the tree
    on at each
    point if we're interested in making small trees.

    METHODS
    new()

    Creates a new decision tree object.

    add_instance()

    Adds a training instance to the set of instances which will
    be used to
    form the tree. An "attributes" parameter specifies a hash of
    attribute-value pairs for the instance, and a "result" parameter
    specifies the result.

    train()

    Builds the decision tree from the list of training instances.

    get_result()

    Returns the most likely result (from the set of all results given to
    "add_instance()") for the set of attribute values given. An
    "attributes"
    parameter specifies a hash of attribute-value pairs for the
    instance. If
    the decision tree doesn't have enough information to find a
    result, it
    will return "undef".

    rule_statements()

    Returns a list of strings that describe the tree in rule-form. For
    instance, for the tree diagram above, the following list would be
    returned (though not necessarily in this order - the order is
    unpredictable):

    if outlook='rain' and wind='strong' -> 'no'
    if outlook='rain' and wind='weak' -> 'yes'
    if outlook='sunny' and humidity='normal' -> 'yes'
    if outlook='sunny' and humidity='high' -> 'no'
    if outlook='overcast' -> 'yes'

    This can be helpful for scrutinizing the structure of a tree.

    Note that while the order of the rules is unpredictable, the
    order of
    criteria within each rule reflects the order in which the
    criteria will
    be checked at decision-making time.

    LIMITATIONS
    A few limitations exist in the current version. All of them could be
    removed in future versions - especially with your help. =)

    No continuous attributes

    In the current implementation, only discrete-valued attributes are
    supported. This means that an attribute like "temperature" can have
    values like "cool", "medium", and "hot", but using actual
    temperatures
    like 87 or 62.3 is not going to work. This is because the
    values would
    split the data too finely - the tree-building process would probably
    think that it could make all its decisions based on the exact
    temperature value alone, ignoring all other attributes, because each
    temperature would have only been seen once.

    The usual way to deal with this problem is for the
    tree-building process
    to figure out how to place the continuous attribute values
    into a set of
    bins (like "cool", "medium", and "hot") and then build the
    tree based on
    these bin values. Future versions of "AI::DecisionTree" may provide
    support for this. For now, you have to do it yourself.

    No support for noisy or inconsistent data

    Currently, the tree-building technique cannot handle
    inconsistent data.
    That is, if you have two contradictory training instances,
    the "train()"
    method will throw an exception. Future versions may allow
    inconsistent
    data, finding the decision tree that most closely matches the data
    rather than precisely matching it.

    No support for tree-trimming

    Most decision tree building algorithms use a two-stage
    building process
    - first a tree is built that completely fits the training
    data (or fits
    it as closely as possible if noisy data is supported), and
    then the tree
    is pruned so that it will generalize well to new instances.
    This might
    be done either by maximizing performance on a set of
    held-out training
    instances, or by pruning parts of the tree that don't seem
    like they'll
    be very valuable.

    Currently, we build a tree that completely fits the training
    data, and
    we don't prune it. That means that the tree may overfit the training
    data in many cases - i.e., you won't be able to see the
    forest for the
    trees (or, more accurately, the tree for the leaves).

    This is mainly a problem when you're using "real world" or
    noisy data.
    If you're using data that you know to be a result of a rule-based
    process and you just want to figure out what the rules are,
    the current
    implementation should work fine for you.

    AUTHOR
    Ken Williams, ken@mathforum.org

    SEE ALSO
    Mitchell, Tom (1997). Machine Learning. McGraw-Hill. pp 52-80.

    Quinlan, J. R. (1986). Induction of decision trees. Machine
    Learning,
    1(1), pp 81-106.

    the perl manpage.
  • Ken Williams at Jun 10, 2002 at 2:49 am

    On Monday, June 10, 2002, at 01:04 AM, gidon wrote:
    I ran the code from the Synopsis. And I printed the result
    and the result was Yes. But it seems to me that in the test
    case two out of three of the values are those of NO Tennis
    days, so I figured that it should have responded NO.
    Well, two out of three are for YES days too. =)

    Actually, you can't really tell from the given training code
    what the result should be - it depends on what gets filled in
    for the "... repeat for several more instances" section. There
    might be several other instances that give you more information
    about how the tree should be constructed. With just two
    training instances, it's not going to be able to construct a
    very interesting or useful tree.

    For a better set of test data, have a peek inside the 'test.pl' file.
    I also had the following warnings.

    Scalar value @count{$_} better written as $count{$_} at
    /usr/local/lib/perl5/site_perl/5.6.1/AI/DecisionTree.pm line 95.
    Use of uninitialized value in numeric le (<=) at
    /usr/local/lib/perl5/site_perl/5.6.1/AI/DecisionTree.pm line 87.
    Thanks, the first one is a brain-o that I've just fixed. The
    second one was only slightly less stupid, fixed as well.

    If I make any other more substantial changes in the next few
    days I'll upload these fixes with a new version, or maybe just
    release a new version anyway.

    -Ken
  • Elias Assmann at Jun 9, 2002 at 4:44 pm
    On Sun, 9 Jun 2002, Ken Williams wrote:

    Hi,

    I can't really say anything interesting about your module, but I do
    think I've spotted a mistake in your documentation. Observe:
    OUTLOOK
    / | \
    / | \
    / | \
    sunny/ overcast \rainy
    / | \
    HUMIDITY | WIND
    / \ *no* / \
    / \ / \
    high/ \normal / \
    / \ strong/ \weak
    *no* *yes* / \
    *no* *yes*
    In this beautifully-drawn ASCII-tree, the result for outlook =
    overcast is no.
    if outlook='overcast' -> 'yes'
    But this rule states the opposite, as the training example seems to
    do.

    HTH,

    Elias

    --
    Gefängnis für Hans Mustermann wegen
    Fälschung und Verrat

    -- Die Einstürzenden Neubauten, Was ist ist
  • Ken Williams at Jun 10, 2002 at 2:33 am

    On Monday, June 10, 2002, at 02:44 AM, Elias Assmann wrote:

    Hi,

    I can't really say anything interesting about your module, but I do
    think I've spotted a mistake in your documentation. Observe:

    On Sun, 9 Jun 2002, Ken Williams wrote:


    OUTLOOK
    / | \
    / | \
    / | \
    sunny/ overcast \rainy
    / | \
    HUMIDITY | WIND
    / \ *no* / \
    / \ / \
    high/ \normal / \
    / \ strong/ \weak
    *no* *yes* / \
    *no* *yes*
    In this beautifully-drawn ASCII-tree, the result for outlook =
    overcast is no.
    Ah - I made a mistake in drawing the tree. It should have been
    'yes'. You've got eagle eyes. =)

    -Ken
  • Steffen Mueller at Jun 10, 2002 at 2:34 pm
    "Ken Williams" <kenw@ee.usyd.edu.au> schrieb im Newsbeitrag
    news:CBAB40C7-7B76-11D6-966F-0003936C1626@ee.usyd.edu.au...
    I've just uploaded a Decision Tree module to CPAN. It provides
    a pretty bare-bones implementation of the classic ID3 decision
    tree building algorithm.
    Very interesting module. I will play with it in the newar future :)
    LIMITATIONS
    A few limitations exist in the current version. All of them could be
    removed in future versions - especially with your help. =)
    I hope you count feedback as such!
    No continuous attributes

    In the current implementation, only discrete-valued attributes are
    supported. This means that an attribute like "temperature" can have
    values like "cool", "medium", and "hot", but using actual
    temperatures
    like 87 or 62.3 is not going to work.
    [...]

    I guess you could have a look at Robert Rothenbergs Tie::RangeHash module.
    It should help a lot for numerics. Maybe you can just recommend its use in
    the docs of your module.

    Steffen
    --
    @n=(544290696690,305106661574,116357),$b=16,@c=' ,JPacehklnorstu'=~
    /./g;for$n(@n){map{$h=int$n/$b**$_;$n-=$b**$_*$h;$c[@c]=$h}c(0..9);
    push@p,map{$c[$_]}@c[c($b..$#c)];$#c=$b-1}print@p;sub'c{reverse @_}

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupai @
categoriesperl
postedJun 9, '02 at 7:02a
activeJun 10, '02 at 2:34p
posts6
users5
websiteperl.org

People

Translate

site design / logo © 2021 Grokbase