FAQ
Hi all!

I'm currently running a big lucene index and one of my main concerns is
the integrity of the data entered. A few things come to mind, like
enforcing that certain fields be non-blank, forcing certain formats etc...

All these validations are easy to do with lucene, since I can validate
the document before it is indexed or when it is retrieved.

The thing however that I have a hard time with, is field uniquness.

Lets say I have a field and I really want it to be unique. I can't seem
to find out how to do it during the indexation phase since everything
that is added to the index is not readable by an index reader until the
index is closed.

Add to that the fact that items can be deleted from the index during the
indexation and the only way I have to figure uniquness is to check every
unique field values using termEnums and checking for docFreq.

This has a major disadvantage that I cannot inform people who are using
the library of the unique conflit when it happens, only when the index
is closed.

Does anyone have an idea on how I could check an index that is in the
process of being indexed (things added, things deleted) for the uniquess
of a given field *at the time I index a document* ?

Daniel Shane


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Shai Erera at Aug 13, 2009 at 2:44 pm
    How many documents do you index between you refresh a reader? If it's not
    too much, I'd keep a Set of those terms and check every incoming document in
    the set and then the reader.

    Note that the set keeps only just the terms of those documents your reader
    doesn't see. You should clear() it after you've refreshed your reader.

    In 2.9, IndexWriter will expose a getReader(), so you might be able to use
    it, by checking on its reader and the on disk reader.

    If it's possible, I think I'd prefer the first approach.

    Shai
    On Thu, Aug 13, 2009 at 5:33 PM, Daniel Shane wrote:

    Hi all!

    I'm currently running a big lucene index and one of my main concerns is the
    integrity of the data entered. A few things come to mind, like enforcing
    that certain fields be non-blank, forcing certain formats etc...

    All these validations are easy to do with lucene, since I can validate the
    document before it is indexed or when it is retrieved.

    The thing however that I have a hard time with, is field uniquness.

    Lets say I have a field and I really want it to be unique. I can't seem to
    find out how to do it during the indexation phase since everything that is
    added to the index is not readable by an index reader until the index is
    closed.

    Add to that the fact that items can be deleted from the index during the
    indexation and the only way I have to figure uniquness is to check every
    unique field values using termEnums and checking for docFreq.

    This has a major disadvantage that I cannot inform people who are using the
    library of the unique conflit when it happens, only when the index is
    closed.

    Does anyone have an idea on how I could check an index that is in the
    process of being indexed (things added, things deleted) for the uniquess of
    a given field *at the time I index a document* ?

    Daniel Shane


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Daniel Shane at Aug 13, 2009 at 2:53 pm
    Users can index really a lot of stuff, so I'd like not to keep things in
    memory for too long.

    Even if I keep a set of things added, how do I know if something has
    been deleted via a delete? It seems rather difficult to keep this set of
    documents added in sync with the index reader on the index (before it
    has been written to).

    What I'd like is to have an access to the stuff the index writer has
    written but not yet commited. Is there something that can access that data?

    Daniel Shane

    Shai Erera wrote:
    How many documents do you index between you refresh a reader? If it's not
    too much, I'd keep a Set of those terms and check every incoming document in
    the set and then the reader.

    Note that the set keeps only just the terms of those documents your reader
    doesn't see. You should clear() it after you've refreshed your reader.

    In 2.9, IndexWriter will expose a getReader(), so you might be able to use
    it, by checking on its reader and the on disk reader.

    If it's possible, I think I'd prefer the first approach.

    Shai

    On Thu, Aug 13, 2009 at 5:33 PM, Daniel Shane wrote:

    Hi all!

    I'm currently running a big lucene index and one of my main concerns is the
    integrity of the data entered. A few things come to mind, like enforcing
    that certain fields be non-blank, forcing certain formats etc...

    All these validations are easy to do with lucene, since I can validate the
    document before it is indexed or when it is retrieved.

    The thing however that I have a hard time with, is field uniquness.

    Lets say I have a field and I really want it to be unique. I can't seem to
    find out how to do it during the indexation phase since everything that is
    added to the index is not readable by an index reader until the index is
    closed.

    Add to that the fact that items can be deleted from the index during the
    indexation and the only way I have to figure uniquness is to check every
    unique field values using termEnums and checking for docFreq.

    This has a major disadvantage that I cannot inform people who are using the
    library of the unique conflit when it happens, only when the index is
    closed.

    Does anyone have an idea on how I could check an index that is in the
    process of being indexed (things added, things deleted) for the uniquess of
    a given field *at the time I index a document* ?

    Daniel Shane


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Shai Erera at Aug 13, 2009 at 3:01 pm
    In 2.9 there will be - IndexWriter#getReader().

    BTW, note that even if someone deletes, your reader may not see this delete.
    If you use IndexWriter to delete docs, the open reader won't see those
    deletes. So you may still have a problem.

    I don't know how much stuff users can index, and how often you plan to
    commit. But I'd like to think humans can't generate that much traffic in a
    short period of time (say every couple of minutes). If they do, you might
    want to commit anyway, so they'll be able to search on their newly added
    data?

    Anyway, I'll give it some more thought.

    Shai
    On Thu, Aug 13, 2009 at 5:52 PM, Daniel Shane wrote:

    Users can index really a lot of stuff, so I'd like not to keep things in
    memory for too long.

    Even if I keep a set of things added, how do I know if something has been
    deleted via a delete? It seems rather difficult to keep this set of
    documents added in sync with the index reader on the index (before it has
    been written to).

    What I'd like is to have an access to the stuff the index writer has
    written but not yet commited. Is there something that can access that data?

    Daniel Shane


    Shai Erera wrote:
    How many documents do you index between you refresh a reader? If it's not
    too much, I'd keep a Set of those terms and check every incoming document
    in
    the set and then the reader.

    Note that the set keeps only just the terms of those documents your reader
    doesn't see. You should clear() it after you've refreshed your reader.

    In 2.9, IndexWriter will expose a getReader(), so you might be able to use
    it, by checking on its reader and the on disk reader.

    If it's possible, I think I'd prefer the first approach.

    Shai

    On Thu, Aug 13, 2009 at 5:33 PM, Daniel Shane <shaned@lexum.umontreal.ca
    wrote:
    Hi all!

    I'm currently running a big lucene index and one of my main concerns is
    the
    integrity of the data entered. A few things come to mind, like enforcing
    that certain fields be non-blank, forcing certain formats etc...

    All these validations are easy to do with lucene, since I can validate
    the
    document before it is indexed or when it is retrieved.

    The thing however that I have a hard time with, is field uniquness.

    Lets say I have a field and I really want it to be unique. I can't seem
    to
    find out how to do it during the indexation phase since everything that
    is
    added to the index is not readable by an index reader until the index is
    closed.

    Add to that the fact that items can be deleted from the index during the
    indexation and the only way I have to figure uniquness is to check every
    unique field values using termEnums and checking for docFreq.

    This has a major disadvantage that I cannot inform people who are using
    the
    library of the unique conflit when it happens, only when the index is
    closed.

    Does anyone have an idea on how I could check an index that is in the
    process of being indexed (things added, things deleted) for the uniquess
    of
    a given field *at the time I index a document* ?

    Daniel Shane


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Grant Ingersoll at Aug 18, 2009 at 2:28 pm

    On Aug 13, 2009, at 10:33 AM, Daniel Shane wrote:
    Does anyone have an idea on how I could check an index that is in
    the process of being indexed (things added, things deleted) for the
    uniquess of a given field *at the time I index a document* ?

    Solr has de-duplication built-in at indexing time: http://wiki.apache.org/solr/Deduplication

    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com/

    Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
    using Solr/Lucene:
    http://www.lucidimagination.com/search


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Daniel Shane at Aug 19, 2009 at 7:04 pm
    But in that case, I assume Solr does a commit per document added.

    Lets say I wanted to index a collection of 1 million pages, would it
    take much longer if I comited at each insertion rather than comiting at
    the end?

    Daniel Shane

    Grant Ingersoll wrote:
    On Aug 13, 2009, at 10:33 AM, Daniel Shane wrote:

    Does anyone have an idea on how I could check an index that is in the
    process of being indexed (things added, things deleted) for the
    uniquess of a given field *at the time I index a document* ?

    Solr has de-duplication built-in at indexing time:
    http://wiki.apache.org/solr/Deduplication

    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com/

    Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
    using Solr/Lucene:
    http://www.lucidimagination.com/search


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Hostetter at Aug 21, 2009 at 4:49 am
    : But in that case, I assume Solr does a commit per document added.

    not at all ... it computes a signature and then uses that as a unique key.
    IndexWriter.updateDocument does all the hard work.


    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Yonik Seeley at Aug 21, 2009 at 4:52 am

    On Fri, Aug 21, 2009 at 12:49 AM, Chris Hostetterwrote:

    : But in that case, I assume Solr does a commit per document added.

    not at all ... it computes a signature and then uses that as a unique key.
    IndexWriter.updateDocument does all the hard work.
    Right - Solr used to do that hard work, but we handed that over to
    Lucene when that capability was added. It involves batching either
    way (but letting Lucene handle it at a lower level is "better" since
    it can prevent inconsistencies from crashes).

    -Yonik
    http://www.lucidimagination.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Daniel Shane at Aug 26, 2009 at 4:48 pm
    Humm... there is something I dont catch..

    When you open up an index writer, you batch up add and deletes. Now if
    you create a signature for the document, as long as you add it works,
    but what happens if you delete stuff from the index using a query as
    well as adding?

    Does Solr also remember the deletions as well?

    Daniel Shane

    Yonik Seeley wrote:
    On Fri, Aug 21, 2009 at 12:49 AM, Chris
    Hostetterwrote:
    : But in that case, I assume Solr does a commit per document added.

    not at all ... it computes a signature and then uses that as a unique key.
    IndexWriter.updateDocument does all the hard work.
    Right - Solr used to do that hard work, but we handed that over to
    Lucene when that capability was added. It involves batching either
    way (but letting Lucene handle it at a lower level is "better" since
    it can prevent inconsistencies from crashes).

    -Yonik
    http://www.lucidimagination.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Yonik Seeley at Aug 26, 2009 at 7:05 pm

    On Wed, Aug 26, 2009 at 12:47 PM, Daniel Shanewrote:
    Humm... there is something I dont catch..

    When you open up an index writer, you batch up add and deletes. Now if you
    create a signature for the document, as long as you add it works, but what
    happens if you delete stuff from the index using a query as well as adding?

    Does Solr also remember the deletions as well?
    It used to - but now it delegates all that to IndexWriter as well (and
    lucene buffers them instead).

    -Yonik
    http://www.lucidimagination.com

    Daniel Shane

    Yonik Seeley wrote:
    On Fri, Aug 21, 2009 at 12:49 AM, Chris
    Hostetterwrote:
    : But in that case, I assume Solr does a commit per document added.

    not at all ... it computes a signature and then uses that as a unique
    key.
    IndexWriter.updateDocument does all the hard work.
    Right - Solr used to do that hard work, but we handed that over to
    Lucene when that capability was added.  It involves batching either
    way (but letting Lucene handle it at a lower level is "better" since
    it can prevent inconsistencies from crashes).

    -Yonik
    http://www.lucidimagination.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Jason Rutherglen at Aug 26, 2009 at 10:37 pm
    Daniel,

    You may want to look at SOLR-1375 which enables ID checking
    using a BloomFilter (with a specified errorrate of false
    positives). Otherwise for what you're trying to do, you'd need
    to create a hash map?

    -J

    On Thu, Aug 13, 2009 at 7:33 AM, Daniel Shanewrote:
    Hi all!

    I'm currently running a big lucene index and one of my main concerns is the
    integrity of the data entered. A few things come to mind, like enforcing
    that certain fields be non-blank, forcing certain formats etc...

    All these validations are easy to do with lucene, since I can validate the
    document before it is indexed or when it is retrieved.

    The thing however that I have a hard time with, is field uniquness.

    Lets say I have a field and I really want it to be unique. I can't seem to
    find out how to do it during the indexation phase since everything that is
    added to the index is not readable by an index reader until the index is
    closed.

    Add to that the fact that items can be deleted from the index during the
    indexation and the only way I have to figure uniquness is to check every
    unique field values using termEnums and checking for docFreq.

    This has a major disadvantage that I cannot inform people who are using the
    library of the unique conflit when it happens, only when the index is
    closed.

    Does anyone have an idea on how I could check an index that is in the
    process of being indexed (things added, things deleted) for the uniquess of
    a given field *at the time I index a document* ?

    Daniel Shane


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedAug 13, '09 at 2:34p
activeAug 26, '09 at 10:37p
posts11
users6
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase