On Wed, Apr 02, 2003 at 09:59:45AM -0500, Tolkin, Steve wrote:
Can you say more about this.
"immer gerne" :-)
Is the source code available?
No. Especially not the NLP algorithms. But the page is based on Yawps
(http://yawps.sourceforge.net/) which can be downloaded. But this is
only the framework for the site.
We have extended it and will contribute back to Yawps (if that hasn't
happened already - should check that with the developers).
How to you decide which diacritics to add?
For example both Mueller and Muller get the umlaut added
on the "u".
Yes. :-) and as there is no Müeller, the results are correct - right?
Basically we use some kind of expansion-reduction algorithm where we
generate n hypotheses of a given "diacritics-less" word and then
compare it with the statistical data we got from the analysis of large
(and I mean large) corpora. Either irrelevant hypotheses are pruned
out or the user gets a choice offer.
We plan to use the feedback from the choices made by users to improve
our statistical data. Alas large corpora doesn't always mean good
corpora, so there is some polluted data.
Do you know of code that removes diacritics in a reasonable
way, e.g. for systems that can only handle ASCII.
This is trivial - we didn't dare to put this on the web. Ask Roman (rv
instead of rj at my email adress) - he will provide you with the snippet.
Ideally your approach to ading diacritics would be fully reversible,
when processing the unaccented words,
but that is perhaps too idealistic.
Well - for czech it is fully reversible, but for german, as there may
be groups of chars (ss -> ß, ue -> ü etc.) that are folded to one char
only The path isn't reversible anymore. Mueller -> Müller -> Muller.
--
best regards,
Dipl.-Inf. Richard Jelinek
- PetaMem s.r.o. - Ocelarska 1 - Prague - www.petamem.com -
-= 2026049 Mind Units =-