Grokbase Groups Perl ai June 2001
FAQ
Dear Ken et al.,
I have been saving the spam email sent to me so that one day
I could use a module like this to detect it.
I could make available to anyone a *.csv (comma separated values) file
that contains about 440 spam messages. It has a total of
about 27000 lines and is 1.1 MB (400 KB compressed with zip).

This file would need to be complemented by a file of the same size,
or ideally even larger, than contain non-spam. I could not
distribute a non-spam file as most messages are company confidential.
Anyway each person would want to have their own non-spam file,
as I get messages about databases, XML, etc. and other people
would get other messages on other topics.

I want to pursue how to incorporate your code into an actual
solution. I have some theoretical questions, and some practical
ones.

* If there are only two categories (spam vs. non-spam)
is there some special algorithm that is appropriate?

* Actually a high fraction of the spam messages are in Spanish.
I copuld manually separate these out very quickly.
Would it help improve performance (i.e. better F1 score)
to have these in a separate category?

* An easy way to detect these Spanish messages is to look for
the Perl pattern / esta/i
But I am concerned that this strength of this predictor would
be "diluted" due to the many word forms.

* Probably the best way to detect spam is to look for a number
in the subject line, e.g., FREE Life Insurance Quotes 10077
However I suspect your code would treat all these numbers as
different words and so not notice the pattern. It seems desirable
to first transform the input in certain ways. I might want to
transform strings such as 19\d\d and 20\d\d to e.g. the dummy word
_date and then transform all other numbers with 4 or more digits
to e.g. _number. Then a verey string predictor of spam is
_number in the subject.

* Similarly I want to map punctuation to pseudo-words, e.g. any string
of more than one consecutive ! character would become _bang.

Some additional background: My company uses Microsoft
Outlook, which has a "rules wizard" with some limited ability
to route mail to different folders, based on who it is from, words
in the subject etc. I am currently using Outlook 98, but plan
to go to Outlook 2000 soon.

* I would like to get a list of the words most likely to be associated
with a category. Can I get this from your code? How?
E.g. for the spam category I expect to find Britney, free, etc.
This is very important because Outlook rules can move mail based
on words. I would be willing to move any email containing
"Britney" to a spam_probably folder.

* Does Outlook 2000 add much in the way of what the rules
pay attention to?

* Ideally I could set up my mail system to make a call to
some external program, and it would return a category.
Is this possible to do in the Outlook client,
or in the Exchange server?

* Can Outlook filter messages based on the domain of the sender.
My first filter would accept anything sent from within the company,
or from certain known outside addresses. Then I would
categorize all the remaining messages into spam vs non-spam.

Hopefully helpfully yours,
Steve
--
Steven Tolkin steve.tolkin@fmr.com 617-563-0516
Fidelity Investments 82 Devonshire St. V10D Boston MA 02109
There is nothing so practical as a good theory. Comments are by me,
not Fidelity Investments, its subsidiaries or affiliates.
-----Original Message-----
From: Ken Williams
Sent: Tuesday, June 19, 2001 5:14 PM
To: perl-ai@perl.org
Subject: AI::Categorize slides from YAPC online


Hi perl-ai list,

The slides from my YAPC talk on AI::Categorize are online now, at:

http://mathforum.com/~ken/categorize/

Please take a look if you're interested. The same slides will be
available on www.yapc.org when Kevin has time to put them there.

Several people at the talk expressed interest in helping with
development of AI::Categorize:: modules. Here are my thoughts:

* If you want to implement a new algorithm (besides
NaiveBayes and kNN,
which I've already done), just go ahead and do it and
release to CPAN.
You don't need to discuss it with me unless you want to.
The modules
should be in the AI::Categorize:: namespace, and subclasses of
AI::Categorize.

* Discussions & announcements should take place on this
list, so that
people with more knowledge than me can chime in. If the
traffic gets
too much, we can split off to a new list. But at least for
a while, it
would be nice to get some meat into the perl-ai list
archives. =) Let's
post often, as I'm sure there's a lot of knowledge people
have to share,
as well as a lot of people who'd like to listen.

* If anyone has additions/changes/fixes to the existing
modules, don't
hold them back. For example, there was discussion of
adding stuff to
reduce the feature sets (number of words considered important) by
looking at their cross-entropy, and I'd like to get that in there.

As I mentioned at the talk, the main reason I created this
namespace and
released the initial stuff was to jumpstart community efforts in this
area. It seemed strange that there wasn't anything on CPAN to do this
kind of NLP stuff, when Perl seems so well-known in the NLP
community.
So I hope there will be interest from people on this list
(and that the
interested people from YAPC are indeed subscribed!).


------------------- -------------------
Ken Williams Last Bastion of Euclidity
ken@forum.swarthmore.edu The Math Forum

Search Discussions

  • Lee Goddard at Jun 20, 2001 at 3:13 pm

    Dear Ken et al.,
    I have been saving the spam email sent to me so that one day
    I could use a module like this to detect it.
    I could make available to anyone a *.csv (comma separated values) file
    that contains about 440 spam messages. It has a total of
    about 27000 lines and is 1.1 MB (400 KB compressed with zip).
    PLEASE don't send it to the list!
    This file would need to be complemented by a file of the same size,
    or ideally even larger, than contain non-spam. I could not
    distribute a non-spam file as most messages are company confidential.
    Could you use a majodomo/simillar list archive? If not, I've a year
    or so worth of mails from various lists.

    I want to pursue how to incorporate your code into an actual
    solution. I have some theoretical questions, and some practical
    ones.
    Hasn't it been done, in the AI::* sapce, using a different technique?
    That's at least the only reason I didn't post the same message as you ;)

    Sorry, nothing else useful to say.

    lee

    Obligatory perl schmutter:
    perl -e "while (1){rand>0.5 ? print'\\' : print'/'}"
  • Ken Williams at Jun 20, 2001 at 4:12 pm

    Tolkin, Steve wrote:
    * If there are only two categories (spam vs. non-spam)
    is there some special algorithm that is appropriate?
    That's only one category, actually, and each message is either in the
    category or not. There may be special things to do when there's only
    one category, but so far I don't know them.
    * Actually a high fraction of the spam messages are in Spanish.
    I copuld manually separate these out very quickly.
    Would it help improve performance (i.e. better F1 score)
    to have these in a separate category?
    It's possible. The best way is to try both ways and then see which is
    better. For that, you need to have a big enough corpus that you can
    train on one portion, and test on another.
    * An easy way to detect these Spanish messages is to look for
    the Perl pattern / esta/i
    But I am concerned that this strength of this predictor would
    be "diluted" due to the many word forms.
    Not to mention the false positives, which you want to avoid in an
    application like this.
    * Probably the best way to detect spam is to look for a number
    in the subject line, e.g., FREE Life Insurance Quotes 10077
    This might be an effective way to recognize spam, but it's different
    from the two existing AI::Categorize:: classes in that it involves a
    hand-written rule.
    * I would like to get a list of the words most likely to be associated
    with a category. Can I get this from your code? How?
    E.g. for the spam category I expect to find Britney, free, etc.
    This is very important because Outlook rules can move mail based
    on words. I would be willing to move any email containing
    "Britney" to a spam_probably folder.
    So far there are no hooks for that. You could examine the
    AI::Categorize::NaiveBayes data structures and find the biggest log-prob
    numbers (they're all negative) for a crude measure. It's probably also
    worth looking at the cross-entropy, which is on the todo list but not
    implemented yet.
    * Ideally I could set up my mail system to make a call to
    some external program, and it would return a category.
    Is this possible to do in the Outlook client,
    or in the Exchange server?
    No idea, but let us know if you find out it can.


    ------------------- -------------------
    Ken Williams Last Bastion of Euclidity
    ken@forum.swarthmore.edu The Math Forum

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupai @
categoriesperl
postedJun 20, '01 at 3:03p
activeJun 20, '01 at 4:12p
posts3
users3
websiteperl.org

People

Translate

site design / logo © 2021 Grokbase