Grokbase Groups Perl ai July 2001
FAQ
Hi,

I talked with a couple of people at YAPC about a couple of the
AI::Categorize algorithms, and one suggestion was to use cross-entropy
measurements to reduce the number of features (words) considered. I'm
hoping that by now (now that I've got time to do some work) the right
people have subscribed to this list, and can point me either to a good
reference or discuss the ideas themselves. True?


------------------- -------------------
Ken Williams Last Bastion of Euclidity
ken@forum.swarthmore.edu The Math Forum

Search Discussions

  • Lee Goddard at Jul 6, 2001 at 9:17 am
    What do you mean by 'cross-entity reference'? Could you be more explicit?

    Lee
    ---
    Obligatory perl schmutter .sig:
    perl -e "print chr(rand>.5?92:47) while 1"
    -----Original Message-----
    From: Ken Williams
    Sent: 06 July 2001 05:29
    To: perl-ai@perl.org
    Subject: Reducing feature sets with cross-entropy


    Hi,

    I talked with a couple of people at YAPC about a couple of the
    AI::Categorize algorithms, and one suggestion was to use cross-entropy
    measurements to reduce the number of features (words) considered. I'm
    hoping that by now (now that I've got time to do some work) the right
    people have subscribed to this list, and can point me either to a good
    reference or discuss the ideas themselves. True?


    ------------------- -------------------
    Ken Williams Last Bastion of Euclidity
    ken@forum.swarthmore.edu The Math Forum
  • John Porter at Jul 6, 2001 at 4:04 pm

    Ken Williams wrote:
    one suggestion was to use cross-entropy
    measurements to reduce the number of features (words) considered.
    Um, have you tried a web search? Seems to me there's a fair
    amount of info out there...

    --
    John Porter
  • Ken Williams at Jul 6, 2001 at 4:18 pm

    John Porter wrote:
    Ken Williams wrote:
    one suggestion was to use cross-entropy
    measurements to reduce the number of features (words) considered.
    Um, have you tried a web search? Seems to me there's a fair
    amount of info out there...
    At YAPC, it was decided that this list would be the proper place to
    discuss collaboration on the AI::Categorize modules. We decided that
    because there would perhaps be interested people on this list who
    wouldn't otherwise know that the work is going on, and also because it
    seems like exactly the kind of thing that the list was created for. If
    discussion gets out of control (which would probably be more than the
    current traffic of 1-3 messages perl week), we can fork it off to a new
    list.

    So with my original message, I'm mainly just trying to see whether
    anyone's interested in discussing the ideas. And I'm willing to do some
    reading and searching - Per Jambeck has pointed me to a comparative
    article by Yiming Yang that looks like a good place to start.

    You're right that there are a lot of resources to be found in a web
    search, but most of it is about very specific applications - perhaps
    introductory material is best found in a textbook.


    ------------------- -------------------
    Ken Williams Last Bastion of Euclidity
    ken@forum.swarthmore.edu The Math Forum
  • Ken Williams at Jul 6, 2001 at 4:30 pm

    Ken Williams wrote:
    You're right that there are a lot of resources to be found in a web
    search, but most of it is about very specific applications - perhaps
    introductory material is best found in a textbook.
    ...speaking of which, is anyone familiar with Thomas M. Mitchell's book
    "Machine Learning"? It has only positive reviews on Amazon, but I'm not
    sure whether that's reliable.


    ------------------- -------------------
    Ken Williams Last Bastion of Euclidity
    ken@forum.swarthmore.edu The Math Forum
  • Probonas Vasilis at Jul 6, 2001 at 4:46 pm
    I have personally tried this book as an introduction to Machine learning
    theory. It is a good book and I think the author has set up a web site
    with matterial supplementing his book.

    Since it is a matter of personal 'taste' just take an opportunity to
    browse this book if your local library holds a copy.

    ______________________________________________________
    Vasilis Promponas
    Department of Cell Biology and Biophysics
    Faculty of Biology
    University of Athens
    GR-15701 Greece
    e-mail: vprobon@cc.uoa.gr
    tel: +30-1-7274611
    On Fri, 6 Jul 2001, Ken Williams wrote:

    ken@forum.swarthmore.edu (Ken Williams) wrote:
    You're right that there are a lot of resources to be found in a web
    search, but most of it is about very specific applications - perhaps
    introductory material is best found in a textbook.
    ...speaking of which, is anyone familiar with Thomas M. Mitchell's book
    "Machine Learning"? It has only positive reviews on Amazon, but I'm not
    sure whether that's reliable.


    ------------------- -------------------
    Ken Williams Last Bastion of Euclidity
    ken@forum.swarthmore.edu The Math Forum
  • Lee Jones at Jul 6, 2001 at 4:52 pm
    I was in Tom Mitchell's graduate machine learning class at CMU when he was
    writing the book. We were working off of 'chapters in progress', but at
    the time I thought what was there was great. It hits most of the major
    topics in machine learning and gives algorithmic outlines of things in
    pseudo code. Each topic section was enough to understand the higher level
    concepts and then the references section pointed to the specifics. Again,
    I thought it was a great survey book.

    One side note, though, is that Mitchell isn't that hot on genetic
    algorithms as a field of machine learning so those chapters weren't given
    quite the same attention as the others (at least during the course).

    One more side note, some of Mitchell's work at the time was on text
    processing and classificiation on usenet and the web, so if you wanted to
    check some of his papers they may be of interest to what you are doing.

    -lee

    On Fri, 6 Jul 2001, Ken Williams bestowed the following wisdom:
    ken@forum.swarthmore.edu (Ken Williams) wrote:
    You're right that there are a lot of resources to be found in a web
    search, but most of it is about very specific applications - perhaps
    introductory material is best found in a textbook.
    ...speaking of which, is anyone familiar with Thomas M. Mitchell's book
    "Machine Learning"? It has only positive reviews on Amazon, but I'm not
    sure whether that's reliable.


    ------------------- -------------------
    Ken Williams Last Bastion of Euclidity
    ken@forum.swarthmore.edu The Math Forum
  • Nathan Torkington at Jul 6, 2001 at 8:02 pm

    ...speaking of which, is anyone familiar with Thomas M. Mitchell's book
    "Machine Learning"? It has only positive reviews on Amazon, but I'm not
    sure whether that's reliable.
    I have the book, and really really like it. I found it comprehensible
    and useful.

    Nat
  • Lenzo at Jul 6, 2001 at 5:44 pm
    It's a very good book, and Tom is a good teacher, too. He's here
    at CMU.

    kevin
    On Fri, Jul 06, 2001 at 11:30:54AM -0500, Ken Williams wrote:
    ken@forum.swarthmore.edu (Ken Williams) wrote:
    You're right that there are a lot of resources to be found in a web
    search, but most of it is about very specific applications - perhaps
    introductory material is best found in a textbook.
    ...speaking of which, is anyone familiar with Thomas M. Mitchell's book
    "Machine Learning"? It has only positive reviews on Amazon, but I'm not
    sure whether that's reliable.


    ------------------- -------------------
    Ken Williams Last Bastion of Euclidity
    ken@forum.swarthmore.edu The Math Forum
  • Ken Williams at Jul 11, 2001 at 8:22 pm
    I had a chance last week to read Yiming Yang's paper on feature set
    reduction:

    http://www.cs.cmu.edu/~yiming/papers.yy/icml97.ps.gz

    It contains the startling conclusion that the single biggest factor in
    getting good results by reducing feature sets is to keep frequently-used
    features (after getting rid of a stopword set) and throw away rare
    features. This algorithm was called "Document Frequency", because each
    term's "frequency" is defined as the number of corpus documents in which
    the term appears.

    The paper compared five different reduction algorithms:
    * Document Frequency (DF) - see above
    * Information Gain (IG) - an entropy-based method
    * Chi-squared Measure (CHI) - statistical correlations
    * Term Strength (TS) - uses similar-document clustering
    * Mutual Information (MI) - a term-category correlation formula

    The findings of the paper were that DF, IG, and CHI were roughly
    equivalent when eliminating up to 95% of the features, TS dropped
    sharply when over 50% of the features were eliminated, and MI did quite
    poorly overall.

    Based on these findings, I decided to implement a simple DF scheme in
    AI::Categorize rather than working on an entropy solution. I've hacked
    out the DF code, and now I'm working on documenting it and making sure
    it works. I've also decided to put in some time making improvements to
    the AI::Categorize::Evaluate package, so that I can tell how the changes
    affect accuracy and speed. Theoretically, both should go up.

    I hope to release an updated version of the modules soon.

    BTW, I'm still hoping someone wants to implement other AI::Categorize::
    modules!


    ------------------- -------------------
    Ken Williams Last Bastion of Euclidity
    ken@forum.swarthmore.edu The Math Forum
  • Tom Fawcett at Jul 12, 2001 at 5:36 pm

    Ken Williams wrote:
    I had a chance last week to read Yiming Yang's paper on feature set
    reduction:

    http://www.cs.cmu.edu/~yiming/papers.yy/icml97.ps.gz

    It contains the startling conclusion that the single biggest factor in
    getting good results by reducing feature sets is to keep frequently-used
    features (after getting rid of a stopword set) and throw away rare
    features. This algorithm was called "Document Frequency", because each
    term's "frequency" is defined as the number of corpus documents in which
    the term appears.
    [etc.]

    Just a casual comment on this. There has been a fair amount of work on text
    classification in the past few years, comparing different representations and
    algorithms. I wouldn't take any individual study's conclusions as definitive,
    since various papers have conflicting conclusions. As one example, most
    people think stopword elimination and stemming are effective, but Riloff makes
    a case against doing them:

    http://citeseer.nj.nec.com/riloff97little.html

    I have no reason to question Yang's results; I'm just pointing out that text
    classification is a big ball of wax.

    Regards,
    -Tom
  • Ken Williams at Jul 12, 2001 at 5:49 pm

    Tom Fawcett wrote:
    Just a casual comment on this. There has been a fair amount of work on text
    classification in the past few years, comparing different representations and
    algorithms. I wouldn't take any individual study's conclusions as definitive,
    since various papers have conflicting conclusions. As one example, most
    people think stopword elimination and stemming are effective, but Riloff makes
    a case against doing them:

    http://citeseer.nj.nec.com/riloff97little.html

    I have no reason to question Yang's results; I'm just pointing out that text
    classification is a big ball of wax.
    Point taken. =) The other main reason I started with Document Frequency
    as the measure of feature quality is that it's easy to understand and
    easy to do. I still do want to evaluate the other methods, if for no
    other reason than to learn their particularities.


    ------------------- -------------------
    Ken Williams Last Bastion of Euclidity
    ken@forum.swarthmore.edu The Math Forum

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupai @
categoriesperl
postedJul 6, '01 at 4:27a
activeJul 12, '01 at 5:49p
posts12
users8
websiteperl.org

People

Translate

site design / logo © 2021 Grokbase