FAQ
Hi,

I have seen the example SAX based XML processing in the Lucene sandbox (thanks to the authors for contributing!) and have successfully adapted this approach for my application. The one thing that does not sit well with me is the fact that I am using the method Field.Text(String, String) instead of the Field.Text(String, Reader) version, which means I am storing the contents in the index.

Some questions:

1. Should I care? What is the cost of storing the contents of these files versus using the Reader based method. Presumably, the index size is going to be larger, but will it adversaly effect search time? If yes, how much so (relatively speaking)?

2. If storing the content is going to adversaly effect searching, has anyone written an XMLReader that extends java.io.Reader. I guess it would need to take in the name of the tag(s) that you want the reader to retrieve and then extend all of the java.io.Reader results to return values based on just the tag values that I am interested in. Has anyone taken this approach? If not, does it at least seem like a valid approach?

Thanks for your help!

-Grant Ingersoll



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Search Discussions

  • Chong, Herb at Dec 5, 2003 at 2:53 pm
    you are storing the same information both ways. the string gets analyzed and discarded, just like with the Reader.

    Herb...

    -----Original Message-----
    From: Grant Ingersoll
    Sent: Friday, December 05, 2003 9:49 AM
    To: lucene-user@jakarta.apache.org
    Subject: Index and Field.Text


    Hi,

    I have seen the example SAX based XML processing in the Lucene sandbox (thanks to the authors for contributing!) and have successfully adapted this approach for my application. The one thing that does not sit well with me is the fact that I am using the method Field.Text(String, String) instead of the Field.Text(String, Reader) version, which means I am storing the contents in the index.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Erik Hatcher at Dec 5, 2003 at 3:23 pm

    On Friday, December 5, 2003, at 09:48 AM, Grant Ingersoll wrote:
    I have seen the example SAX based XML processing in the Lucene sandbox
    (thanks to the authors for contributing!) and have successfully
    adapted this approach for my application. The one thing that does not
    sit well with me is the fact that I am using the method
    Field.Text(String, String) instead of the Field.Text(String, Reader)
    version, which means I am storing the contents in the index.
    So use Field.UnStored(String, String) then. It is the same as
    Field.Text(String, Reader).

    The static "factory" methods on Field are merely for convenience. You
    can control all the flags yourself using the constructor:

    public Field(String name, String string,
    boolean store, boolean index, boolean token)
    2. If storing the content is going to adversaly effect searching, has
    anyone written an XMLReader that extends java.io.Reader.
    You could always use a StringReader wrapper :))

    Erik


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Tatu Saloranta at Dec 5, 2003 at 4:31 pm

    On Friday 05 December 2003 08:22, Erik Hatcher wrote:
    On Friday, December 5, 2003, at 09:48 AM, Grant Ingersoll wrote:
    ...
    Field.Text(String, String) instead of the Field.Text(String, Reader)
    version, which means I am storing the contents in the index.
    So use Field.UnStored(String, String) then. It is the same as
    Field.Text(String, Reader).

    The static "factory" methods on Field are merely for convenience. You
    can control all the flags yourself using the constructor:
    I think it's almost a bug that they act differently, although having same
    method name. I don't think method should be called Text() if it behaves like
    UnStored()? Additionally, implementation for non-public constructor relies on
    default values for isIndexed, isStored and isTokenized; it probably should
    take those from static method for clarity?

    Also, shouldn't there be at least 3 methods that take Readers; one for
    Text-like handling, another for UnStored, and last for UnIndexed. It's
    probably ok not to have one for keywords. For other types, though, it's often
    more convenient to just pass in Reader.
    (internally difference between passing in a Reader or String is not huge, as
    String will be accessed via StringReader).

    -+ Tatu +-



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Doug Cutting at Dec 5, 2003 at 5:45 pm

    Tatu Saloranta wrote:
    Also, shouldn't there be at least 3 methods that take Readers; one for
    Text-like handling, another for UnStored, and last for UnIndexed.
    How do you store the contents of a Reader? You'd have to double-buffer
    it, first reading it into a String to store, and then tokenizing the
    StringReader. A key feature of Reader values is that they're streamed:
    the entire value is never in RAM. Storing a Reader value would remove
    that advantage. The current API makes this explicit: when you want
    something streamed, you pass in a Reader, when you're willing to have
    the entire value in memory, pass in a String.

    Yes, it is a bit confusing that Text(String, String) stores its value,
    while Text(String, Reader) does not, but it is at least well documented.
    And we cannot change it: that would break too many applications. But
    we can put this on the list for Lucene 2.0 cleanups.

    When I first wrote these static methods I meant for them to be
    constructor-like. I wanted to have multiple Field(String, String)
    constructors, but that's not possible, so I used capitalized static
    methods instead. I've never seen anyone else do this (capitalize any
    method but a real constructor) so I guess I didn't start a fad! This
    should someday too be cleaned up. Lucene was the first Java program
    that I ever wrote, and thus its style is in places non-standard. Sorry.

    Doug


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Tatu Saloranta at Dec 6, 2003 at 12:46 am

    On Friday 05 December 2003 10:45, Doug Cutting wrote:
    Tatu Saloranta wrote:
    Also, shouldn't there be at least 3 methods that take Readers; one for
    Text-like handling, another for UnStored, and last for UnIndexed.
    How do you store the contents of a Reader? You'd have to double-buffer
    it, first reading it into a String to store, and then tokenizing the
    StringReader. A key feature of Reader values is that they're streamed:
    Not really, you can pass Reader to tokenizer, which then reads and tokenizes
    directly (I think that's the way code also works). This because internally
    String is read using StringReader, so passing a String looks more like a
    convenience feature?
    the entire value is never in RAM. Storing a Reader value would remove
    that advantage. The current API makes this explicit: when you want
    something streamed, you pass in a Reader, when you're willing to have
    the entire value in memory, pass in a String.
    I guess for things that are both tokenized and stored, passing a Reader can't
    really help a lot; if one wants to reduce mem usage, text needs to be read
    twice, or analyzer needs to help in writing output; or, text needs to be read
    in-memory much like what happens now. It'd simplify application code a bit,
    but wouldn't do much more.

    So.... I guess I need to downgrade my suggestion to require just 2
    Reader-taking factory methods? :-)
    I still think that index-only and store-only version would both make sense. In
    latter case, storing could be done in fully streaming fashion; in former
    tokenization can be done?
    Yes, it is a bit confusing that Text(String, String) stores its value,
    while Text(String, Reader) does not, but it is at least well documented.
    And we cannot change it: that would break too many applications. But
    we can put this on the list for Lucene 2.0 cleanups.
    Yes, I understand that. It'd not be reasonable to do such a change. But how
    about adding more intuitive factory method (UnStored(String, Reader))?
    When I first wrote these static methods I meant for them to be
    constructor-like. I wanted to have multiple Field(String, String)
    constructors, but that's not possible, so I used capitalized static
    methods instead. I've never seen anyone else do this (capitalize any
    method but a real constructor) so I guess I didn't start a fad! This :-)
    should someday too be cleaned up. Lucene was the first Java program
    that I ever wrote, and thus its style is in places non-standard. Sorry.
    Best standards are created by people doing things others use, follow or
    imitate... so it was worth a try! :-)

    -+ Tatu +-


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Esmond Pitt at Dec 7, 2003 at 11:16 pm
    When creating an index, FSDirectory assumes that the directory has no
    subdirectories. If a non-empty subdirectory is present, FSDirectory.create
    fails to delete it and throws an IOException. As the subdirectory is not a
    Lucene index file (although in my case it is a Lucene sub-index), the method
    actually has no business attempting to delete it at all. Can this behaviour
    please be changed so that it doesn't attempt to delete subdirectories in an
    index location at all?

    Applies to 1.3RC3.

    EJP



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedDec 5, '03 at 2:48p
activeDec 7, '03 at 11:16p
posts7
users6
websitelucene.apache.org

People

Translate

site design / logo © 2021 Grokbase