FAQ
Hi Hannah, Otis
I cannot help but I have excatly the same problems with special german
charcters. I used snowball analyser but this does not help because the
problem (tokenizing) appears before the analyser comes into action.
I just posted the question "Problem tokenizing UTF-8 with geman umlauts"
some minutes ago which describes my problem and Hannahs seem to be similar.
Do you have also UTF-8 encoded pages?

Peter MH

-----Ursprüngliche Nachricht-----
Von: Otis Gospodnetic
Gesendet: Mittwoch, 19. Mai 2004 17:42
An: Lucene Users List
Betreff: Re: Problem indexing Spanish Characters


It looks like Snowball project supports Spanish:
http://www.google.com/search?q=snowball spanish

If it does, take a look at Lucene Sandbox. There is a project that
allows you to use Snowball analyzers with Lucene.

Otis


--- Hannah c wrote:
Hi,

I am indexing a number of English articles on Spanish resorts. As
such
there are a number of spanish characters throught the text, most of
these
are in the place names which are the type of words I would like to
use as
queries. My problem is with the StandardTokenizer class which cuts
the word
into two when it comes across any of the spanish characters. I had a
look at
the source but the code was generated by JavaCC and so is not very
readable.
I was wondering if there was a way around this problem or which area
of the
code I would need to change to avoid this.

Thanks
Hannah Cumming
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Search Discussions

  • Hannah c at May 19, 2004 at 4:35 pm
    Hi,

    I had a quick look at the sandbox but my problem is that I don't need a
    spanish stemmer. However there must be a replacement tokenizer that supports
    foreign characters to go along with the foreign language snowball stemmers.
    Does anyone know where I could find one?

    In answer to Peters question -yes I'm also using "UTF-8" encoded XML
    documents as the source.
    I also put below an example of what is happening when I tokenize the text
    using the StandardTokenizer below.

    Thanks Hannah



    ------------------text I'm trying to index

    century palace known as la “Fundación Hospital de Na. Señora del Pilar”

    -----------------tokens outputed from StandardTokenizer

    century
    palace
    known
    as
    la
    â
    Fundacià *
    n *
    Hospital
    de
    Na
    Seà *
    ora *
    del
    Pilar
    â
    -----------------------


    From: "Peter M Cipollone" <lu1@bihvhar.com>
    To: <hannahc7@hotmail.com>
    Subject: Re: Problem indexing Spanish Characters
    Date: Wed, 19 May 2004 11:41:28 -0400

    could you send some sample text that causes this to happen?

    ----- Original Message -----
    From: "Hannah c" <hannahc7@hotmail.com>
    To: <lucene-user@jakarta.apache.org>
    Sent: Wednesday, May 19, 2004 11:30 AM
    Subject: Problem indexing Spanish Characters

    Hi,

    I am indexing a number of English articles on Spanish resorts. As such
    there are a number of spanish characters throught the text, most of these
    are in the place names which are the type of words I would like to use as
    queries. My problem is with the StandardTokenizer class which cuts the word
    into two when it comes across any of the spanish characters. I had a
    look
    at
    the source but the code was generated by JavaCC and so is not very readable.
    I was wondering if there was a way around this problem or which area of the
    code I would need to change to avoid this.

    Thanks
    Hannah Cumming



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org



    From: PEP AD Server Administrator
    <PEPADServer.Administrator@erl9.siemens.de>
    Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
    To: "'Lucene Users List'" <lucene-user@jakarta.apache.org>
    Subject: AW: Problem indexing Spanish Characters
    Date: Wed, 19 May 2004 18:08:56 +0200

    Hi Hannah, Otis
    I cannot help but I have excatly the same problems with special german
    charcters. I used snowball analyser but this does not help because the
    problem (tokenizing) appears before the analyser comes into action.
    I just posted the question "Problem tokenizing UTF-8 with geman umlauts"
    some minutes ago which describes my problem and Hannahs seem to be similar.
    Do you have also UTF-8 encoded pages?

    Peter MH

    -----Ursprüngliche Nachricht-----
    Von: Otis Gospodnetic
    Gesendet: Mittwoch, 19. Mai 2004 17:42
    An: Lucene Users List
    Betreff: Re: Problem indexing Spanish Characters


    It looks like Snowball project supports Spanish:
    http://www.google.com/search?q=snowball spanish

    If it does, take a look at Lucene Sandbox. There is a project that
    allows you to use Snowball analyzers with Lucene.

    Otis


    --- Hannah c wrote:
    Hi,

    I am indexing a number of English articles on Spanish resorts. As
    such
    there are a number of spanish characters throught the text, most of
    these
    are in the place names which are the type of words I would like to
    use as
    queries. My problem is with the StandardTokenizer class which cuts
    the word
    into two when it comes across any of the spanish characters. I had a
    look at
    the source but the code was generated by JavaCC and so is not very
    readable.
    I was wondering if there was a way around this problem or which area
    of the
    code I would need to change to avoid this.

    Thanks
    Hannah Cumming
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    --------------------------------------------------------------------------------------------------------------------------------Hannah
    Cumming
    hannahc7@hotmail.com



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Martin Remy at May 19, 2004 at 6:09 pm
    The tokenizers deal with unicode characters (CharStream, char), so the
    problem is not there. This problem must be solved at the point where the
    bytes from your source files are turned into CharSequences/Strings, i.e. by
    connecting an InputStreamReader to your FileReader (or whatever you're
    using) and specifying "UTF-8" (or whatever encoding is appropriate) in the
    InputStreamReader constructor.

    You must either detect the encoding from HTTP heaaders or XML declarations
    or, if you know that it's the same for all of your source files, then just
    hardcode UTF-8, for example.

    Martin

    -----Original Message-----
    From: Hannah c
    Sent: Wednesday, May 19, 2004 10:35 AM
    To: lucene-user@jakarta.apache.org
    Subject: RE: AW: Problem indexing Spanish Characters

    Hi,

    I had a quick look at the sandbox but my problem is that I don't need a
    spanish stemmer. However there must be a replacement tokenizer that supports
    foreign characters to go along with the foreign language snowball stemmers.
    Does anyone know where I could find one?

    In answer to Peters question -yes I'm also using "UTF-8" encoded XML
    documents as the source.
    I also put below an example of what is happening when I tokenize the text
    using the StandardTokenizer below.

    Thanks Hannah



    ------------------text I'm trying to index

    century palace known as la “Fundación Hospital de Na. Señora del Pilar”

    -----------------tokens outputed from StandardTokenizer

    century
    palace
    known
    as
    la
    â
    Fundacià *
    n *
    Hospital
    de
    Na
    Seà *
    ora *
    del
    Pilar
    â
    -----------------------


    From: "Peter M Cipollone" <lu1@bihvhar.com>
    To: <hannahc7@hotmail.com>
    Subject: Re: Problem indexing Spanish Characters
    Date: Wed, 19 May 2004 11:41:28 -0400

    could you send some sample text that causes this to happen?

    ----- Original Message -----
    From: "Hannah c" <hannahc7@hotmail.com>
    To: <lucene-user@jakarta.apache.org>
    Sent: Wednesday, May 19, 2004 11:30 AM
    Subject: Problem indexing Spanish Characters

    Hi,

    I am indexing a number of English articles on Spanish resorts. As
    such there are a number of spanish characters throught the text,
    most of these
    are in the place names which are the type of words I would like to
    use as
    queries. My problem is with the StandardTokenizer class which cuts
    the word
    into two when it comes across any of the spanish characters. I had a
    look
    at
    the source but the code was generated by JavaCC and so is not very readable.
    I was wondering if there was a way around this problem or which area
    of the
    code I would need to change to avoid this.

    Thanks
    Hannah Cumming



    --------------------------------------------------------------------
    - To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org



    From: PEP AD Server Administrator
    <PEPADServer.Administrator@erl9.siemens.de>
    Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
    To: "'Lucene Users List'" <lucene-user@jakarta.apache.org>
    Subject: AW: Problem indexing Spanish Characters
    Date: Wed, 19 May 2004 18:08:56 +0200

    Hi Hannah, Otis
    I cannot help but I have excatly the same problems with special german
    charcters. I used snowball analyser but this does not help because the
    problem (tokenizing) appears before the analyser comes into action.
    I just posted the question "Problem tokenizing UTF-8 with geman umlauts"
    some minutes ago which describes my problem and Hannahs seem to be similar.
    Do you have also UTF-8 encoded pages?

    Peter MH

    -----Ursprüngliche Nachricht-----
    Von: Otis Gospodnetic
    Gesendet: Mittwoch, 19. Mai 2004 17:42
    An: Lucene Users List
    Betreff: Re: Problem indexing Spanish Characters


    It looks like Snowball project supports Spanish:
    http://www.google.com/search?q=snowball spanish

    If it does, take a look at Lucene Sandbox. There is a project that
    allows you to use Snowball analyzers with Lucene.

    Otis


    --- Hannah c wrote:
    Hi,

    I am indexing a number of English articles on Spanish resorts. As
    such there are a number of spanish characters throught the text,
    most of these are in the place names which are the type of words I
    would like to use as queries. My problem is with the
    StandardTokenizer class which cuts the word into two when it comes
    across any of the spanish characters. I had a look at the source but
    the code was generated by JavaCC and so is not very readable.
    I was wondering if there was a way around this problem or which area
    of the code I would need to change to avoid this.

    Thanks
    Hannah Cumming
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    ----------------------------------------------------------------------------
    ----------------------------------------------------Hannah
    Cumming
    hannahc7@hotmail.com



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Wallen at May 19, 2004 at 7:05 pm
    Here is an example method in org.apache.lucene.demo.html HTMLParser that
    uses a different buffered reader for a different encoding.

    public Reader getReader() throws IOException
    {
    if (pipeIn == null)
    {
    pipeInStream = new MyPipedInputStream();
    pipeOutStream = new PipedOutputStream(pipeInStream);
    pipeIn = new InputStreamReader(pipeInStream);
    pipeOut = new OutputStreamWriter(pipeOutStream);
    //check the first 4 bytes for FFFE marker, if its
    there we know its UTF-16 encoding
    if (useUTF16)
    {
    try
    {
    pipeIn = new BufferedReader(new
    InputStreamReader(pipeInStream, "UTF-16"));
    }
    catch (Exception e)
    {
    }
    }
    Thread thread = new ParserThread(this);
    thread.start(); // start parsing
    }
    return pipeIn;
    }

    -----Original Message-----
    From: Martin Remy
    Sent: Wednesday, May 19, 2004 2:09 PM
    To: 'Lucene Users List'
    Subject: RE: AW: Problem indexing Spanish Characters


    The tokenizers deal with unicode characters (CharStream, char), so the
    problem is not there. This problem must be solved at the point where the
    bytes from your source files are turned into CharSequences/Strings, i.e. by
    connecting an InputStreamReader to your FileReader (or whatever you're
    using) and specifying "UTF-8" (or whatever encoding is appropriate) in the
    InputStreamReader constructor.

    You must either detect the encoding from HTTP heaaders or XML declarations
    or, if you know that it's the same for all of your source files, then just
    hardcode UTF-8, for example.

    Martin

    -----Original Message-----
    From: Hannah c
    Sent: Wednesday, May 19, 2004 10:35 AM
    To: lucene-user@jakarta.apache.org
    Subject: RE: AW: Problem indexing Spanish Characters

    Hi,

    I had a quick look at the sandbox but my problem is that I don't need a
    spanish stemmer. However there must be a replacement tokenizer that supports
    foreign characters to go along with the foreign language snowball stemmers.
    Does anyone know where I could find one?

    In answer to Peters question -yes I'm also using "UTF-8" encoded XML
    documents as the source.
    I also put below an example of what is happening when I tokenize the text
    using the StandardTokenizer below.

    Thanks Hannah



    ------------------text I'm trying to index

    century palace known as la "Fundación Hospital de Na. Señora del Pilar"

    -----------------tokens outputed from StandardTokenizer

    century
    palace
    known
    as
    la
    â
    Fundacià *
    n *
    Hospital
    de
    Na
    Seà *
    ora *
    del
    Pilar
    â
    -----------------------


    From: "Peter M Cipollone" <lu1@bihvhar.com>
    To: <hannahc7@hotmail.com>
    Subject: Re: Problem indexing Spanish Characters
    Date: Wed, 19 May 2004 11:41:28 -0400

    could you send some sample text that causes this to happen?

    ----- Original Message -----
    From: "Hannah c" <hannahc7@hotmail.com>
    To: <lucene-user@jakarta.apache.org>
    Sent: Wednesday, May 19, 2004 11:30 AM
    Subject: Problem indexing Spanish Characters

    Hi,

    I am indexing a number of English articles on Spanish resorts. As
    such there are a number of spanish characters throught the text,
    most of these
    are in the place names which are the type of words I would like to
    use as
    queries. My problem is with the StandardTokenizer class which cuts
    the word
    into two when it comes across any of the spanish characters. I had a
    look
    at
    the source but the code was generated by JavaCC and so is not very readable.
    I was wondering if there was a way around this problem or which area
    of the
    code I would need to change to avoid this.

    Thanks
    Hannah Cumming



    --------------------------------------------------------------------
    - To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org



    From: PEP AD Server Administrator
    <PEPADServer.Administrator@erl9.siemens.de>
    Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
    To: "'Lucene Users List'" <lucene-user@jakarta.apache.org>
    Subject: AW: Problem indexing Spanish Characters
    Date: Wed, 19 May 2004 18:08:56 +0200

    Hi Hannah, Otis
    I cannot help but I have excatly the same problems with special german
    charcters. I used snowball analyser but this does not help because the
    problem (tokenizing) appears before the analyser comes into action.
    I just posted the question "Problem tokenizing UTF-8 with geman umlauts"
    some minutes ago which describes my problem and Hannahs seem to be similar.
    Do you have also UTF-8 encoded pages?

    Peter MH

    -----Ursprüngliche Nachricht-----
    Von: Otis Gospodnetic
    Gesendet: Mittwoch, 19. Mai 2004 17:42
    An: Lucene Users List
    Betreff: Re: Problem indexing Spanish Characters


    It looks like Snowball project supports Spanish:
    http://www.google.com/search?q=snowball spanish

    If it does, take a look at Lucene Sandbox. There is a project that
    allows you to use Snowball analyzers with Lucene.

    Otis


    --- Hannah c wrote:
    Hi,

    I am indexing a number of English articles on Spanish resorts. As
    such there are a number of spanish characters throught the text,
    most of these are in the place names which are the type of words I
    would like to use as queries. My problem is with the
    StandardTokenizer class which cuts the word into two when it comes
    across any of the spanish characters. I had a look at the source but
    the code was generated by JavaCC and so is not very readable.
    I was wondering if there was a way around this problem or which area
    of the code I would need to change to avoid this.

    Thanks
    Hannah Cumming
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    ----------------------------------------------------------------------------
    ----------------------------------------------------Hannah
    Cumming
    hannahc7@hotmail.com



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • PEP AD Server Administrator at May 21, 2004 at 11:02 am
    Hi all,
    Martin was right. I just adapt the HTML demo as Wallen recommended and it
    worked. Now I have only to deal with some crazy documents which are UTF-8
    decoded mixed with entities.
    Does anyone know a class which can translate entities into UTF-8 or any
    other encoding?

    Peter MH

    -----Ursprüngliche Nachricht-----
    Von: wallen@Cyveillance.com

    Here is an example method in org.apache.lucene.demo.html HTMLParser that
    uses a different buffered reader for a different encoding.

    public Reader getReader() throws IOException
    {
    if (pipeIn == null)
    {
    pipeInStream = new MyPipedInputStream();
    pipeOutStream = new PipedOutputStream(pipeInStream);
    pipeIn = new InputStreamReader(pipeInStream);
    pipeOut = new OutputStreamWriter(pipeOutStream);
    //check the first 4 bytes for FFFE marker, if its
    there we know its UTF-16 encoding
    if (useUTF16)
    {
    try
    {
    pipeIn = new BufferedReader(new
    InputStreamReader(pipeInStream, "UTF-16"));
    }
    catch (Exception e)
    {
    }
    }
    Thread thread = new ParserThread(this);
    thread.start(); // start parsing
    }
    return pipeIn;
    }

    -----Original Message-----
    From: Martin Remy

    The tokenizers deal with unicode characters (CharStream, char), so the
    problem is not there. This problem must be solved at the point where the
    bytes from your source files are turned into CharSequences/Strings, i.e. by
    connecting an InputStreamReader to your FileReader (or whatever you're
    using) and specifying "UTF-8" (or whatever encoding is appropriate) in the
    InputStreamReader constructor.

    You must either detect the encoding from HTTP heaaders or XML declarations
    or, if you know that it's the same for all of your source files, then just
    hardcode UTF-8, for example.

    Martin

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMay 19, '04 at 4:09p
activeMay 21, '04 at 11:02a
posts5
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase