FAQ
Hi folks,
I just upgrading Hibernate Search library of my app and so I had to upgrade
Lucene too and pass from 2.2 to 2.4 version.
In Lucene 2.4 the ISOLatin1AccentFilter class has changed and I can't figure
how it works.
I use a TwoWayFieldBridge to index the data and this is my set method:

public void set(String s, Object o, Document document, Field.Store store,
Field.Index index, Float aFloat){

//MyObject has a field name
MyObject objectToIndex;

//casting from Object to MyObject
try{
objectToIndex = MyObject.class.cast(o);
}catch(ClassCastException cEx ){}



if (objectToIndex.getName() != null) {

ISOLatin1AccentFilter filter = new ISOLatin1AccentFilter(new
StandardTokenizer(new StringReader(objectToIndex.getName())));
filter.removeAccents(objectToIndex.getName().toCharArray(),
objectToIndex.getName().length());
Field name = new Field( "name",
String.valueOf(objectToIndex.getName()).toLowerCase() , Field.Store.YES,
Field.Index.UN_TOKENIZED );

document.add(name);
}
}


but it doesn't work. And if pass an accented word for the property
objectToIndex.getName(), it remains with accent :(
I think there is something wrong in my code when I create the new instance
of ISOLatin1AccentFilter but I can' t get it works properly.
Could someone help me?
thanks a lot
--
View this message in context: http://www.nabble.com/Removing-diacritics-with-ISOLatin1AccentFilter-tp24641618p24641618.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Simon Willnauer at Jul 24, 2009 at 10:14 am
    On Fri, Jul 24, 2009 at 11:41 AM, luther blissetwrote:
    Hi folks,
    I just upgrading Hibernate Search library of my app and so I had to upgrade
    Lucene too and pass from 2.2 to 2.4 version.
    In Lucene 2.4 the ISOLatin1AccentFilter class has changed and I can't figure
    how it works.
    I use a TwoWayFieldBridge to index the data and this is my set method:

    public void set(String s, Object o, Document document, Field.Store store,
    Field.Index index, Float aFloat){

    //MyObject has a field name
    MyObject objectToIndex;

    //casting from Object to MyObject
    try{
    objectToIndex = MyObject.class.cast(o);
    }catch(ClassCastException cEx ){}



    if (objectToIndex.getName() != null) {

    ISOLatin1AccentFilter filter = new ISOLatin1AccentFilter(new
    StandardTokenizer(new StringReader(objectToIndex.getName())));
    filter.removeAccents(objectToIndex.getName().toCharArray(),
    objectToIndex.getName().length());
    Field name = new Field( "name",
    String.valueOf(objectToIndex.getName()).toLowerCase() , Field.Store.YES,
    Field.Index.UN_TOKENIZED );

    document.add(name);
    }
    }
    I do not really understand what you are trying to do. do you just
    wanna remove the accents from the string and index it without passing
    it through an analyzer?! (Field.Index.UN_TOKENIZED will not pass the
    field value to an analyzer).
    do you wanna index this without an analyzer?!

    If you pass an array to ISOLantin1AccentFilter#removeAccents() the
    processed chars will be written to an private internal char array
    inside the ISOLantin1AccentFilter. You can not use the removeAccents
    method just removing the accents. what you could do as a dirty
    workaround is the following:
    String foo = "HÄllo HÄllo HÄllo HÄllo HÄllo";
    ISOLatin1AccentFilter filter = new ISOLatin1AccentFilter(
    new Tokenizer(new StringReader(foo)){
    private boolean isRead = false;
    public Token next(final Token reusableToken) throws IOException {
    if(isRead){
    return null;
    }
    BufferedReader reader = new BufferedReader(this.input);
    StringBuilder builder = new StringBuilder();

    char[] buffer = new char[1024];
    int read = -1;
    while((read = reader.read(buffer)) > 0){
    builder.append(buffer, 0, read);
    }
    reusableToken.setTermText(builder.toString());
    isRead = true;
    return reusableToken;
    }
    });
    Token t = filter.next();
    String foo_without_accents = t.term();
    System.out.println(foo_without_accents);
    yields: HAllo HAllo HAllo HAllo HAllo


    simon
    but it doesn't work. And if pass an accented word for the property
    objectToIndex.getName(), it remains with accent :(
    I think there is something wrong in my code when I create the new instance
    of ISOLatin1AccentFilter  but I can' t get it works properly.
    Could someone help me?
    thanks a lot
    --
    View this message in context: http://www.nabble.com/Removing-diacritics-with-ISOLatin1AccentFilter-tp24641618p24641618.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Luther blisset at Jul 24, 2009 at 11:37 am
    I'm trying to index all the words without accent.
    I do the same when I'm querying, I remove the accent and lower case the
    search term.
    Why should I pass the string through the analyzer?
    or what is wrong if don't pass it through the analyzer?
    and what are the benefits?
    I'm just a newbie with Lucene..
    Thanks a lot for your reply :]




    Simon Willnauer wrote:
    On Fri, Jul 24, 2009 at 11:41 AM, luther blissetwrote:
    Hi folks,
    I just upgrading Hibernate Search library of my app and so I had to
    upgrade
    Lucene too and pass from 2.2 to 2.4 version.
    In Lucene 2.4 the ISOLatin1AccentFilter class has changed and I can't
    figure
    how it works.
    I use a TwoWayFieldBridge to index the data and this is my set method:

    public void set(String s, Object o, Document document, Field.Store store,
    Field.Index index, Float aFloat){

    //MyObject has a field name
    MyObject objectToIndex;

    //casting from Object to MyObject
    try{
    objectToIndex = MyObject.class.cast(o);
    }catch(ClassCastException cEx ){}



    if (objectToIndex.getName() != null) {

    ISOLatin1AccentFilter filter = new ISOLatin1AccentFilter(new
    StandardTokenizer(new StringReader(objectToIndex.getName())));
    filter.removeAccents(objectToIndex.getName().toCharArray(),
    objectToIndex.getName().length());
    Field name = new Field( "name",
    String.valueOf(objectToIndex.getName()).toLowerCase() , Field.Store.YES,
    Field.Index.UN_TOKENIZED );

    document.add(name);
    }
    }
    I do not really understand what you are trying to do. do you just
    wanna remove the accents from the string and index it without passing
    it through an analyzer?! (Field.Index.UN_TOKENIZED will not pass the
    field value to an analyzer).
    do you wanna index this without an analyzer?!

    If you pass an array to ISOLantin1AccentFilter#removeAccents() the
    processed chars will be written to an private internal char array
    inside the ISOLantin1AccentFilter. You can not use the removeAccents
    method just removing the accents. what you could do as a dirty
    workaround is the following:
    String foo = "HÄllo HÄllo HÄllo HÄllo HÄllo";
    ISOLatin1AccentFilter filter = new ISOLatin1AccentFilter(
    new Tokenizer(new StringReader(foo)){
    private boolean isRead = false;
    public Token next(final Token reusableToken) throws IOException {
    if(isRead){
    return null;
    }
    BufferedReader reader = new BufferedReader(this.input);
    StringBuilder builder = new StringBuilder();

    char[] buffer = new char[1024];
    int read = -1;
    while((read = reader.read(buffer)) > 0){
    builder.append(buffer, 0, read);
    }
    reusableToken.setTermText(builder.toString());
    isRead = true;
    return reusableToken;
    }
    });
    Token t = filter.next();
    String foo_without_accents = t.term();
    System.out.println(foo_without_accents);
    yields: HAllo HAllo HAllo HAllo HAllo


    simon
    but it doesn't work. And if pass an accented word for the property
    objectToIndex.getName(), it remains with accent :(
    I think there is something wrong in my code when I create the new
    instance
    of ISOLatin1AccentFilter  but I can' t get it works properly.
    Could someone help me?
    thanks a lot
    --
    View this message in context:
    http://www.nabble.com/Removing-diacritics-with-ISOLatin1AccentFilter-tp24641618p24641618.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    View this message in context: http://www.nabble.com/Removing-diacritics-with-ISOLatin1AccentFilter-tp24641618p24643036.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • AHMET ARSLAN at Jul 24, 2009 at 10:25 am
    Or alternatively:

    String test = "HÄllo HÄllo HÄllo HÄllo HÄllo";

    ISOLatin1AccentFilter filter = new ISOLatin1AccentFilter(new
    KeywordTokenizer(new StringReader(test)));

    final Token reusableToken = new Token();
    Token nextToken;

    if ((nextToken = filter.next(reusableToken)) != null)
    System.out.print(nextToken.term());

    filter.close();




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Luther blisset at Jul 24, 2009 at 11:39 am
    yes Ahmet Arslan ...this works!!
    I've just tested it and works nicely...
    really thanks..




    Ahmet Arslan wrote:

    Or alternatively:

    String test = "HÄllo HÄllo HÄllo HÄllo HÄllo";

    ISOLatin1AccentFilter filter = new ISOLatin1AccentFilter(new
    KeywordTokenizer(new StringReader(test)));

    final Token reusableToken = new Token();
    Token nextToken;

    if ((nextToken = filter.next(reusableToken)) != null)
    System.out.print(nextToken.term());

    filter.close();




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    View this message in context: http://www.nabble.com/Removing-diacritics-with-ISOLatin1AccentFilter-tp24641618p24643074.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 24, '09 at 9:41a
activeJul 24, '09 at 11:39a
posts5
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase