Grokbase Groups Perl ai April 2003
FAQ
Hi,

thought you might be interested in http://nlp.petamem.com

The aim of this site is to provide a nice portal with various NLP
services. The backend is mostly pure Perl, the frontend is
mod_perl2. It's still in development, doesn't contain many features
yet and has some flaws - but we're working on it.

--
best regards,

Dipl.-Inf. Richard Jelinek

- PetaMem s.r.o. - Ocelarska 1 - Prague - www.petamem.com -
-= 2026049 Mind Units =-

Search Discussions

  • Tolkin, Steve at Apr 2, 2003 at 3:00 pm
    Can you say more about this.

    Is the source code available?

    How to you decide which diacritics to add?
    For example both Mueller and Muller get the umlaut added
    on the "u".

    Do you know of code that removes diacritics in a reasonable
    way, e.g. for systems that can only handle ASCII.
    Ideally your approach to ading diacritics would be fully reversible,
    when processing the unaccented words,
    but that is perhaps too idealistic.

    Hopefully helpfully yours,
    Steve
    --
    Steven Tolkin steve.tolkin@fmr.com 617-563-0516
    Fidelity Investments 82 Devonshire St. V4D Boston MA 02109
    There is nothing so practical as a good theory. Comments are by me,
    not Fidelity Investments, its subsidiaries or affiliates.

    -----Original Message-----
    From: Richard Jelinek
    Sent: Wednesday, April 02, 2003 9:29 AM
    To: perl-ai@perl.org
    Subject: NLP portal nlp.petamem.com


    Hi,

    thought you might be interested in http://nlp.petamem.com

    The aim of this site is to provide a nice portal with various NLP
    services. The backend is mostly pure Perl, the frontend is
    mod_perl2. It's still in development, doesn't contain many features
    yet and has some flaws - but we're working on it.

    --
    best regards,

    Dipl.-Inf. Richard Jelinek

    - PetaMem s.r.o. - Ocelarska 1 - Prague - www.petamem.com -
    -= 2026049 Mind Units =-
  • Richard Jelinek at Apr 2, 2003 at 3:32 pm

    On Wed, Apr 02, 2003 at 09:59:45AM -0500, Tolkin, Steve wrote:
    Can you say more about this.
    "immer gerne" :-)
    Is the source code available?
    No. Especially not the NLP algorithms. But the page is based on Yawps
    (http://yawps.sourceforge.net/) which can be downloaded. But this is
    only the framework for the site.

    We have extended it and will contribute back to Yawps (if that hasn't
    happened already - should check that with the developers).
    How to you decide which diacritics to add?
    For example both Mueller and Muller get the umlaut added
    on the "u".
    Yes. :-) and as there is no Müeller, the results are correct - right?
    Basically we use some kind of expansion-reduction algorithm where we
    generate n hypotheses of a given "diacritics-less" word and then
    compare it with the statistical data we got from the analysis of large
    (and I mean large) corpora. Either irrelevant hypotheses are pruned
    out or the user gets a choice offer.

    We plan to use the feedback from the choices made by users to improve
    our statistical data. Alas large corpora doesn't always mean good
    corpora, so there is some polluted data.
    Do you know of code that removes diacritics in a reasonable
    way, e.g. for systems that can only handle ASCII.
    This is trivial - we didn't dare to put this on the web. Ask Roman (rv
    instead of rj at my email adress) - he will provide you with the snippet.
    Ideally your approach to ading diacritics would be fully reversible,
    when processing the unaccented words,
    but that is perhaps too idealistic.
    Well - for czech it is fully reversible, but for german, as there may
    be groups of chars (ss -> ß, ue -> ü etc.) that are folded to one char
    only The path isn't reversible anymore. Mueller -> Müller -> Muller.


    --
    best regards,

    Dipl.-Inf. Richard Jelinek

    - PetaMem s.r.o. - Ocelarska 1 - Prague - www.petamem.com -
    -= 2026049 Mind Units =-
  • PerlDiscuss - Perl Newsgroups and at Oct 21, 2004 at 11:31 am
    Hi,

    Steve Tolkin wrote:
    Can you say more about this.
    I hope so. However I will refer to the new functionality
    of the relaunched portal at nlp.petamem.com.
    Is the source code available?
    Partly. Some of the underlying modules for e.g. numeral
    conversion are available on CPAN, other code is proprietary.
    How to you decide which diacritics to add?
    For example both Mueller and Muller get the umlaut added
    on the "u".
    Yes. And for Muller it may be wrong sometimes. It's
    a plain statistical process where a wordlist - taken
    from a corpus of the resp. language (german in that case)
    is compared with the words given for diacritization.

    Now the system knows about some equivalents for a given
    language, so u<=>ü, ue<=>ü, "u<=>ü etc. This can be wrong
    of course without any consideration of the context.

    The system then may or may not find a list of alternatives
    with diacritics and offers these for the user to choose.
    Do you know of code that removes diacritics in a reasonable
    way, e.g. for systems that can only handle ASCII.
    Yes. Have a look at the new portal. It does exactly this
    in the diacritics operations section. In fact there are
    now three modes of operation "Choose", "Fit1st" and "Remove".
    Ideally your approach to ading diacritics would be fully reversible,
    Yes. But we have a long way to go to achieve this.
  • Tolkin, Steve at Apr 2, 2003 at 4:26 pm
    Dear Roman,
    I am interested in the code that removes diacritics.

    Although Richard said that this is "trivial", I am not so sure.
    In theory this should be driven from the character data
    in the Unicode database.
    It should remove the diacritics from Western languages
    I do not want to remove all combining characters, e.g. these
    should be preserved for alphabetic languages of India.
    I do not know the best approach for Cyrillic based languages.
    So a sketch of an algorithm would be to convert to NFD
    "Unicode Normalization Form Decomposed"
    and then remove the combining characters that followed
    (were combined with) characters on certain code pages only.
    One key question is -- which code pages?
    And are there some code pages where some combining
    characters should be removed, but not others?

    But doing all this still might be "straightforward"
    and so I am curious as to how you do it.

    Thanks,
    Steve
    --
    Steven Tolkin steve.tolkin@fmr.com 617-563-0516
    Fidelity Investments 82 Devonshire St. V4D Boston MA 02109
    There is nothing so practical as a good theory. Comments are by me,
    not Fidelity Investments, its subsidiaries or affiliates.

    -----Original Message-----
    From: Richard Jelinek
    Sent: Wednesday, April 02, 2003 10:29 AM
    To: Tolkin, Steve
    Cc: perl-ai@perl.org
    Subject: Re: NLP portal nlp.petamem.com

    On Wed, Apr 02, 2003 at 09:59:45AM -0500, Tolkin, Steve wrote:
    Can you say more about this.
    "immer gerne" :-)
    Is the source code available?
    No. Especially not the NLP algorithms. But the page is based on Yawps
    (http://yawps.sourceforge.net/) which can be downloaded. But this is
    only the framework for the site.

    We have extended it and will contribute back to Yawps (if that hasn't
    happened already - should check that with the developers).
    How to you decide which diacritics to add?
    For example both Mueller and Muller get the umlaut added
    on the "u".
    Yes. :-) and as there is no Müeller, the results are correct - right?
    Basically we use some kind of expansion-reduction algorithm where we
    generate n hypotheses of a given "diacritics-less" word and then
    compare it with the statistical data we got from the analysis of large
    (and I mean large) corpora. Either irrelevant hypotheses are pruned
    out or the user gets a choice offer.

    We plan to use the feedback from the choices made by users to improve
    our statistical data. Alas large corpora doesn't always mean good
    corpora, so there is some polluted data.
    Do you know of code that removes diacritics in a reasonable
    way, e.g. for systems that can only handle ASCII.
    This is trivial - we didn't dare to put this on the web. Ask Roman (rv
    instead of rj at my email adress) - he will provide you with
    the snippet.
    Ideally your approach to ading diacritics would be fully reversible,
    when processing the unaccented words,
    but that is perhaps too idealistic.
    Well - for czech it is fully reversible, but for german, as there may
    be groups of chars (ss -> ß, ue -> ü etc.) that are folded to one char
    only The path isn't reversible anymore. Mueller -> Müller -> Muller.


    --
    best regards,

    Dipl.-Inf. Richard Jelinek

    - PetaMem s.r.o. - Ocelarska 1 - Prague - www.petamem.com -
    -= 2026049 Mind Units =-
  • Simon Cozens at Apr 2, 2003 at 9:29 pm

    Richard Jelinek:
    thought you might be interested in http://nlp.petamem.com
    While we're talking about NLP sites, http://www.fieldmethods.net/ is
    well worth a look.

    --
    "He was a modest, good-humored boy. It was Oxford that made him insufferable."

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupai @
categoriesperl
postedApr 2, '03 at 2:32p
activeOct 21, '04 at 11:31a
posts6
users4
websiteperl.org

People

Translate

site design / logo © 2021 Grokbase