Hi,

I'm on PG 8.0.4, initDB and locale set to de_DE.UTF-8, FreeBSD.

My TSearch config is based on "Tsearch2 and Unicode/UTF-8" by Markus
Wollny (http://tinyurl.com/a6po4).

The following files are used:

http://hannes.imos.net/german.med [UTF-8]
http://hannes.imos.net/german.aff [ANSI]
http://hannes.imos.net/german.stop [UTF-8]
http://hannes.imos.net/german.stop.ispell [UTF-8]

german.med is from "ispell-german-compound.tar.gz", available on the
TSearch2 site, recoded to UTF-8.

The first problem is with german compound words and does not have to do
anything with UTF-8:

In german often an "s" is used to "link" two words into an compound
word. This is true for many german compound words. TSearch/ispell is not
able to break those words up, only exact matches work.

An example with "Produktionsintervall" (production interval):

fts=# SELECT ts_debug('Produktionsintervall');
ts_debug
--------------------------------------------------------------------------------------------------
(default_german,lword,"Latin
word",Produktionsintervall,"{de_ispell,de}",'produktionsintervall')

Tsearch/isepll is not able to break this word into parts, because of the
"s" in "Produktion/s/intervall". Misspelling the word as
"Produktionintervall" fixes it:

fts=# SELECT ts_debug('Produktionintervall');
ts_debug
---------------------------------------------------------------------------------------------------------------------
(default_german,lword,"Latin
word",Produktionintervall,"{de_ispell,de}","'ion' 'produkt' 'intervall'
'produktion'")

How can I fix this / get TSearch to remove/stem the last "s" on a word
before (re-)searching the dict? Can I modify my dict or hack something
else? This is a bit of a show stopper :/


The second thing is with UTF-8:

I know there is no, or no full support yet, but I need to get it as good
as it's possible /now/. Is there anything in CVS that I might be able to
backport to my version or other tips? My setup works, as for the dict
and the stop word files, but I fear the stemming and mapping of umlauts
and other special chars does not as it should. I tried recoding the
german.aff to UTF-8 as well, but that breaks it with an regex error
sometimes:

fts=# SELECT ts_debug('dass');
ERROR: Regex error in '[^sãŸ]$': brackets [] not balanced
CONTEXT: SQL function "ts_debug" statement 1

This seems while it tries to map ss to ß, but anyway, I fear, I didn't
anything good with that.

As suggested in the "Tsearch2 and Unicode/UTF-8" article I have a second
snowball dict. The first lines of the stem.h I used start with:
extern struct SN_env * german_ISO_8859_1_create_env(void);
So I guess this will not work exactly well with UTF-8 ;p Is there any
other stem.h I could use? Google hasn't returned much for me :/


Thanks for reading and all our time. I'll consider the donate button
after I get this working ;/

--
Regards,
Hannes Dorbath

Search Discussions

  • Hannes Dorbath at Nov 23, 2005 at 10:20 am
    Another UTF-8 thing I forgot:

    fts=# SELECT * FROM stat('SELECT to_tsvector(''simple'', line) FROM fts;');
    ERROR: invalid byte sequence for encoding "UNICODE": 0xe2a7

    The query inside the stat() function alone works fine. I have not set
    any client encoding. What breaks it? It works as long the inside query
    does not return UTF-8 in vectors.

    Thanks.

    --
    Regards,
    Hannes Dorbath
  • Teodor Sigaev at Nov 23, 2005 at 10:37 am

    Tsearch/isepll is not able to break this word into parts, because of the
    "s" in "Produktion/s/intervall". Misspelling the word as
    "Produktionintervall" fixes it:
    It should be affixes marked as 'affix in middle of compound word',
    Flag is '~', example look in norsk dictionary:

    flag ~\\:
    [^S] > S #~ advarsel > advarsels-

    BTW, we develop and debug compound word support on norsk (norwegian) dictionary,
    so look for example there. But we don't know Norwegian, norwegians helped us :)



    The second thing is with UTF-8:

    I know there is no, or no full support yet, but I need to get it as good
    as it's possible /now/. Is there anything in CVS that I might be able to
    backport to my version or other tips? My setup works, as for the dict
    and the stop word files, but I fear the stemming and mapping of umlauts
    and other special chars does not as it should. I tried recoding the
    german.aff to UTF-8 as well, but that breaks it with an regex error
    sometimes:
    Now in CVS it is deep alpha version and now only text parser is UTF-compliant,
    we continue development...

    fts=# SELECT ts_debug('dass');
    ERROR: Regex error in '[^sãŸ]$': brackets [] not balanced
    CONTEXT: SQL function "ts_debug" statement 1

    This seems while it tries to map ss to ß, but anyway, I fear, I didn't
    anything good with that.

    As suggested in the "Tsearch2 and Unicode/UTF-8" article I have a second
    snowball dict. The first lines of the stem.h I used start with:
    extern struct SN_env * german_ISO_8859_1_create_env(void);
    Can you use ISO-8859-1?
    So I guess this will not work exactly well with UTF-8 ;p Is there any
    other stem.h I could use? Google hasn't returned much for me :/
    http://snowball.tartarus.org/

    Snowball can generate UTF parser:
    http://snowball.tartarus.org/runtime/use.html:
    F1 [-o[utput] F2]
    [-s[yntax]]
    [-w[idechars]] [-u[tf8]] <-------- that's it!
    [-j[ava]] [-n[ame] C]
    [-ep[refix] S1] [-vp[refix] S2]
    [-i[nclude] D]
    [-r[untime] P]
    At least for Russian there is 2 parsers, for KOI8 and UTF, (
    http://snowball.tartarus.org/algorithms/russian/stem.sbl
    http://snowball.tartarus.org/algorithms/russian/stem-Unicode.sbl
    ), diff shows that they different only in stringdef section. So you can make UTF
    parser for german.
    BUT, I'm afraid that Snowball uses widechar, and postgres use multibyte for UTF
    internally.



    --
    Teodor Sigaev E-mail: teodor@sigaev.ru
    WWW: http://www.sigaev.ru/
  • Oleg Bartunov at Nov 23, 2005 at 11:07 am

    On Wed, 23 Nov 2005, Hannes Dorbath wrote:

    Hi,

    I'm on PG 8.0.4, initDB and locale set to de_DE.UTF-8, FreeBSD.

    My TSearch config is based on "Tsearch2 and Unicode/UTF-8" by Markus Wollny
    (http://tinyurl.com/a6po4).

    The following files are used:

    http://hannes.imos.net/german.med [UTF-8]
    http://hannes.imos.net/german.aff [ANSI]
    http://hannes.imos.net/german.stop [UTF-8]
    http://hannes.imos.net/german.stop.ispell [UTF-8]

    german.med is from "ispell-german-compound.tar.gz", available on the TSearch2
    site, recoded to UTF-8.

    The first problem is with german compound words and does not have to do
    anything with UTF-8:

    In german often an "s" is used to "link" two words into an compound word.
    This is true for many german compound words. TSearch/ispell is not able to
    break those words up, only exact matches work.

    An example with "Produktionsintervall" (production interval):

    fts=# SELECT ts_debug('Produktionsintervall');
    ts_debug
    --------------------------------------------------------------------------------------------------
    (default_german,lword,"Latin
    word",Produktionsintervall,"{de_ispell,de}",'produktionsintervall')

    Tsearch/isepll is not able to break this word into parts, because of the "s"
    in "Produktion/s/intervall". Misspelling the word as "Produktionintervall"
    fixes it:

    fts=# SELECT ts_debug('Produktionintervall');
    ts_debug
    ---------------------------------------------------------------------------------------------------------------------
    (default_german,lword,"Latin
    word",Produktionintervall,"{de_ispell,de}","'ion' 'produkt' 'intervall'
    'produktion'")

    How can I fix this / get TSearch to remove/stem the last "s" on a word before
    (re-)searching the dict? Can I modify my dict or hack something else? This is
    a bit of a show stopper :/

    I think the right way is to fix affix file, i.e. add appropriate rule,
    but this is out of our skill :) You, probable, should send your
    complains/suggestions to erstellt von transam email: transam45@gmx.net
    (see german.aff)

    The second thing is with UTF-8:

    I know there is no, or no full support yet, but I need to get it as good as
    it's possible /now/. Is there anything in CVS that I might be able to
    backport to my version or other tips? My setup works, as for the dict and the
    stop word files, but I fear the stemming and mapping of umlauts and other
    special chars does not as it should. I tried recoding the german.aff to UTF-8
    as well, but that breaks it with an regex error sometimes:

    fts=# SELECT ts_debug('dass');
    ERROR: Regex error in '[^s??]$': brackets [] not balanced
    CONTEXT: SQL function "ts_debug" statement 1

    This seems while it tries to map ss to ?, but anyway, I fear, I didn't
    anything good with that.
    Similar problem was discussed
    http://sourceforge.net/mailarchive/forum.php?thread_id=6271285&forum_id=7671

    As suggested in the "Tsearch2 and Unicode/UTF-8" article I have a second
    snowball dict. The first lines of the stem.h I used start with:
    extern struct SN_env * german_ISO_8859_1_create_env(void);
    So I guess this will not work exactly well with UTF-8 ;p Is there any other
    stem.h I could use? Google hasn't returned much for me :/
    As we mentioned several times, tsearch2 doesn't supports UTF-8 and
    is working only by accident :) We've got working parser with full UTF-8
    support, but we need to rewrite interfaces to dictionaries, so there is nothing
    useful to the moment. All changes are available in CVS HEAD (8.2dev).

    Backpatch for 8.1 will be available from our site as soon as we complete
    UTF-8 support for CVS HEAD. We have no deadlines yet, but we have discussed
    support of this project with OpenACS community (grant from University of
    Mannheim), so it's possible that we could complete it really soon
    (we have no answer yet).


    Regards,
    Oleg
    _____________________________________________________________
    Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
    Sternberg Astronomical Institute, Moscow University (Russia)
    Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
    phone: +007(095)939-16-83, +007(095)939-23-83

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-general @
categoriespostgresql
postedNov 23, '05 at 9:56a
activeNov 23, '05 at 11:07a
posts4
users3
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase