I have been saving the spam email sent to me so that one day
I could use a module like this to detect it.
I could make available to anyone a *.csv (comma separated values) file
that contains about 440 spam messages. It has a total of
about 27000 lines and is 1.1 MB (400 KB compressed with zip).
This file would need to be complemented by a file of the same size,
or ideally even larger, than contain non-spam. I could not
distribute a non-spam file as most messages are company confidential.
Anyway each person would want to have their own non-spam file,
as I get messages about databases, XML, etc. and other people
would get other messages on other topics.
I want to pursue how to incorporate your code into an actual
solution. I have some theoretical questions, and some practical
ones.
* If there are only two categories (spam vs. non-spam)
is there some special algorithm that is appropriate?
* Actually a high fraction of the spam messages are in Spanish.
I copuld manually separate these out very quickly.
Would it help improve performance (i.e. better F1 score)
to have these in a separate category?
* An easy way to detect these Spanish messages is to look for
the Perl pattern / esta/i
But I am concerned that this strength of this predictor would
be "diluted" due to the many word forms.
* Probably the best way to detect spam is to look for a number
in the subject line, e.g., FREE Life Insurance Quotes 10077
However I suspect your code would treat all these numbers as
different words and so not notice the pattern. It seems desirable
to first transform the input in certain ways. I might want to
transform strings such as 19\d\d and 20\d\d to e.g. the dummy word
_date and then transform all other numbers with 4 or more digits
to e.g. _number. Then a verey string predictor of spam is
_number in the subject.
* Similarly I want to map punctuation to pseudo-words, e.g. any string
of more than one consecutive ! character would become _bang.
Some additional background: My company uses Microsoft
Outlook, which has a "rules wizard" with some limited ability
to route mail to different folders, based on who it is from, words
in the subject etc. I am currently using Outlook 98, but plan
to go to Outlook 2000 soon.
* I would like to get a list of the words most likely to be associated
with a category. Can I get this from your code? How?
E.g. for the spam category I expect to find Britney, free, etc.
This is very important because Outlook rules can move mail based
on words. I would be willing to move any email containing
"Britney" to a spam_probably folder.
* Does Outlook 2000 add much in the way of what the rules
pay attention to?
* Ideally I could set up my mail system to make a call to
some external program, and it would return a category.
Is this possible to do in the Outlook client,
or in the Exchange server?
* Can Outlook filter messages based on the domain of the sender.
My first filter would accept anything sent from within the company,
or from certain known outside addresses. Then I would
categorize all the remaining messages into spam vs non-spam.
Hopefully helpfully yours,
Steve
--
Steven Tolkin [email protected] 617-563-0516
Fidelity Investments 82 Devonshire St. V10D Boston MA 02109
There is nothing so practical as a good theory. Comments are by me,
not Fidelity Investments, its subsidiaries or affiliates.
Steven Tolkin [email protected] 617-563-0516
Fidelity Investments 82 Devonshire St. V10D Boston MA 02109
There is nothing so practical as a good theory. Comments are by me,
not Fidelity Investments, its subsidiaries or affiliates.
-----Original Message-----
From: Ken Williams
Sent: Tuesday, June 19, 2001 5:14 PM
To: [email protected]
Subject: AI::Categorize slides from YAPC online
Hi perl-ai list,
The slides from my YAPC talk on AI::Categorize are online now, at:
http://mathforum.com/~ken/categorize/
Please take a look if you're interested. The same slides will be
available on www.yapc.org when Kevin has time to put them there.
Several people at the talk expressed interest in helping with
development of AI::Categorize:: modules. Here are my thoughts:
* If you want to implement a new algorithm (besides
NaiveBayes and kNN,
which I've already done), just go ahead and do it and
release to CPAN.
You don't need to discuss it with me unless you want to.
The modules
should be in the AI::Categorize:: namespace, and subclasses of
AI::Categorize.
* Discussions & announcements should take place on this
list, so that
people with more knowledge than me can chime in. If the
traffic gets
too much, we can split off to a new list. But at least for
a while, it
would be nice to get some meat into the perl-ai list
archives. =) Let's
post often, as I'm sure there's a lot of knowledge people
have to share,
as well as a lot of people who'd like to listen.
* If anyone has additions/changes/fixes to the existing
modules, don't
hold them back. For example, there was discussion of
adding stuff to
reduce the feature sets (number of words considered important) by
looking at their cross-entropy, and I'd like to get that in there.
As I mentioned at the talk, the main reason I created this
namespace and
released the initial stuff was to jumpstart community efforts in this
area. It seemed strange that there wasn't anything on CPAN to do this
kind of NLP stuff, when Perl seems so well-known in the NLP
community.
So I hope there will be interest from people on this list
(and that the
interested people from YAPC are indeed subscribed!).
------------------- -------------------
Ken Williams Last Bastion of Euclidity
ke[email protected] The Math Forum
From: Ken Williams
Sent: Tuesday, June 19, 2001 5:14 PM
To: [email protected]
Subject: AI::Categorize slides from YAPC online
Hi perl-ai list,
The slides from my YAPC talk on AI::Categorize are online now, at:
http://mathforum.com/~ken/categorize/
Please take a look if you're interested. The same slides will be
available on www.yapc.org when Kevin has time to put them there.
Several people at the talk expressed interest in helping with
development of AI::Categorize:: modules. Here are my thoughts:
* If you want to implement a new algorithm (besides
NaiveBayes and kNN,
which I've already done), just go ahead and do it and
release to CPAN.
You don't need to discuss it with me unless you want to.
The modules
should be in the AI::Categorize:: namespace, and subclasses of
AI::Categorize.
* Discussions & announcements should take place on this
list, so that
people with more knowledge than me can chime in. If the
traffic gets
too much, we can split off to a new list. But at least for
a while, it
would be nice to get some meat into the perl-ai list
archives. =) Let's
post often, as I'm sure there's a lot of knowledge people
have to share,
as well as a lot of people who'd like to listen.
* If anyone has additions/changes/fixes to the existing
modules, don't
hold them back. For example, there was discussion of
adding stuff to
reduce the feature sets (number of words considered important) by
looking at their cross-entropy, and I'd like to get that in there.
As I mentioned at the talk, the main reason I created this
namespace and
released the initial stuff was to jumpstart community efforts in this
area. It seemed strange that there wasn't anything on CPAN to do this
kind of NLP stuff, when Perl seems so well-known in the NLP
community.
So I hope there will be interest from people on this list
(and that the
interested people from YAPC are indeed subscribed!).
------------------- -------------------
Ken Williams Last Bastion of Euclidity
ke[email protected] The Math Forum