FAQ
Hi all,

We import about 1.5 million documents on a nightly basis using DIH. During this time, we need to ensure that all documents make it into index otherwise rollback on any errors; which DIH takes care of for us. We also disable autoCommit in DIH but instruct it to commit at the very end of the import. This is all done through configuration of the DIH config XML file and the command issued to the request handler.

We have noticed that the tlog file appears to linger around even after DIH has issued the hard commit. My expectation would be that after the hard commit has occurred, the tlog file will be removed. I'm obviously misunderstanding how this all works.

Can someone please help me understand how this is meant to function? Thanks!

-Niran

Search Discussions

  • Michael Della Bitta at Mar 25, 2013 at 12:40 pm
    My understanding is that logs stick around for a while just in case they
    can be used to catch up a shard that rejoins the cluster.
    On Mar 24, 2013 12:03 PM, "Niran Fajemisin" wrote:

    Hi all,

    We import about 1.5 million documents on a nightly basis using DIH. During
    this time, we need to ensure that all documents make it into index
    otherwise rollback on any errors; which DIH takes care of for us. We also
    disable autoCommit in DIH but instruct it to commit at the very end of the
    import. This is all done through configuration of the DIH config XML file
    and the command issued to the request handler.

    We have noticed that the tlog file appears to linger around even after DIH
    has issued the hard commit. My expectation would be that after the hard
    commit has occurred, the tlog file will be removed. I'm obviously
    misunderstanding how this all works.

    Can someone please help me understand how this is meant to function?
    Thanks!

    -Niran
  • Erick Erickson at Mar 25, 2013 at 3:22 pm
    The tlogs will stay there to provide "peer synch" on the last 100 docs. Say
    a node somehow gets out of synch. There are two options
    1> replay from the log
    2> replicate the entire index.

    To avoid <2> if possible, the tlog is kept around. In your case, all your
    data is put in the tlog file, so the "keep the last 100 docs available"
    rule means you'll keep the entire log for the run around until the _next_
    run completes, at which point I'd expect the oldest one to be deleted.

    Best
    Erick

    On Mon, Mar 25, 2013 at 8:40 AM, Michael Della Bitta wrote:

    My understanding is that logs stick around for a while just in case they
    can be used to catch up a shard that rejoins the cluster.
    On Mar 24, 2013 12:03 PM, "Niran Fajemisin" wrote:

    Hi all,

    We import about 1.5 million documents on a nightly basis using DIH. During
    this time, we need to ensure that all documents make it into index
    otherwise rollback on any errors; which DIH takes care of for us. We also
    disable autoCommit in DIH but instruct it to commit at the very end of the
    import. This is all done through configuration of the DIH config XML file
    and the command issued to the request handler.

    We have noticed that the tlog file appears to linger around even after DIH
    has issued the hard commit. My expectation would be that after the hard
    commit has occurred, the tlog file will be removed. I'm obviously
    misunderstanding how this all works.

    Can someone please help me understand how this is meant to function?
    Thanks!

    -Niran
  • Niran Fajemisin at Mar 25, 2013 at 7:37 pm
    Thanks Erick and Michael for the prompt responses.

    Cheers,
    Niran


    ________________________________
    From: Erick Erickson <erickerickson@gmail.com>
    To: solr-user@lucene.apache.org
    Sent: Monday, March 25, 2013 10:21 AM
    Subject: Re: Tlog File not removed after hard commit

    The tlogs will stay there to provide "peer synch" on the last 100 docs. Say
    a node somehow gets out of synch. There are two options
    1> replay from the log
    2> replicate the entire index.

    To avoid <2> if possible, the tlog is kept around. In your case, all your
    data is put in the tlog file, so the "keep the last 100 docs available"
    rule means you'll keep the entire log for the run around until the _next_
    run completes, at which point I'd expect the oldest one to be deleted.

    Best
    Erick

    On Mon, Mar 25, 2013 at 8:40 AM, Michael Della Bitta wrote:

    My understanding is that logs stick around for a while just in case they
    can be used to catch up a shard that rejoins the cluster.
    On Mar 24, 2013 12:03 PM, "Niran Fajemisin" wrote:

    Hi all,

    We import about 1.5 million documents on a nightly basis using DIH. During
    this time, we need to ensure that all documents make it into index
    otherwise rollback on any errors; which DIH takes care of for us. We also
    disable autoCommit in DIH but instruct it to commit at the very end of the
    import. This is all done through configuration of the DIH config XML file
    and the command issued to the request handler.

    We have noticed that the tlog file appears to linger around even after DIH
    has issued the hard commit. My expectation would be that after the hard
    commit has occurred, the tlog file will be removed. I'm obviously
    misunderstanding how this all works.

    Can someone please help me understand how this is meant to function?
    Thanks!

    -Niran
  • Shawn Heisey at Mar 26, 2013 at 7:51 pm

    On 3/24/2013 10:02 AM, Niran Fajemisin wrote:
    We import about 1.5 million documents on a nightly basis using DIH. During this time, we need to ensure that all documents make it into index otherwise rollback on any errors; which DIH takes care of for us. We also disable autoCommit in DIH but instruct it to commit at the very end of the import. This is all done through configuration of the DIH config XML file and the command issued to the request handler.

    We have noticed that the tlog file appears to linger around even after DIH has issued the hard commit. My expectation would be that after the hard commit has occurred, the tlog file will be removed. I'm obviously misunderstanding how this all works.
    You've already gotten the reason for the giant tlog hanging around.

    The way to actually fix this problem is to turn on autoCommit with one
    of the values set relatively low. The key to enabling autoCommit
    without changing anything about how your import process works is this:
    make sure that openSearcher is set to false in the autoCommit:

    <updateHandler class="solr.DirectUpdateHandler2">
    <autoCommit>
    <maxDocs>25000</maxDocs>
    <maxTime>300000</maxTime>
    <openSearcher>false</openSearcher>
    </autoCommit>
    <updateLog />
    </updateHandler>

    I make maxDocs low rather than maxTime, but that's up to you. Each hard
    commit done by autoCommit will create a new tlog, and each tlog will be
    fairly small. Only a few of them will be kept around, so the disk space
    requirement will be small, and restarting Solr will be fast because
    there won't be a lot of data to replay.

    With openSearcher set to false, there will be NO changes in document
    visibility. Searches will continue using the old searcher, so the old
    documents will still be there and the new documents will NOT be
    searchable until DIH does its explicit commit at the end.

    The one thing that I'm not sure about is what happens if Solr or the
    machine crashes in the middle of the import. Complete rollback might
    not be possible. Someone with better knowledge may have to comment there.

    Thanks,
    Shawn
  • Erick Erickson at Mar 27, 2013 at 12:45 am
    Shawn:

    If you do hard commits, no matter what the openSearcher value, and the
    machine crashes when it comes back up you'll see those commits.

    How I'd approach it if I absolutely _had_ to do a complete rollback would
    be something like force a replication to a dedicated machine before the
    import, then I'd have a backup I could restore if things crashed.

    But most likely I'd say "we shouldn't worry about this because if our
    hardware is that flaky we have bigger problems".....

    Best
    Erick

    On Mon, Mar 25, 2013 at 5:34 PM, Shawn Heisey wrote:
    On 3/24/2013 10:02 AM, Niran Fajemisin wrote:

    We import about 1.5 million documents on a nightly basis using DIH.
    During this time, we need to ensure that all documents make it into index
    otherwise rollback on any errors; which DIH takes care of for us. We also
    disable autoCommit in DIH but instruct it to commit at the very end of the
    import. This is all done through configuration of the DIH config XML file
    and the command issued to the request handler.

    We have noticed that the tlog file appears to linger around even after
    DIH has issued the hard commit. My expectation would be that after the hard
    commit has occurred, the tlog file will be removed. I'm obviously
    misunderstanding how this all works.
    You've already gotten the reason for the giant tlog hanging around.

    The way to actually fix this problem is to turn on autoCommit with one of
    the values set relatively low. The key to enabling autoCommit without
    changing anything about how your import process works is this: make sure
    that openSearcher is set to false in the autoCommit:

    <updateHandler class="solr.**DirectUpdateHandler2">
    <autoCommit>
    <maxDocs>25000</maxDocs>
    <maxTime>300000</maxTime>
    <openSearcher>false</**openSearcher>
    </autoCommit>
    <updateLog />
    </updateHandler>

    I make maxDocs low rather than maxTime, but that's up to you. Each hard
    commit done by autoCommit will create a new tlog, and each tlog will be
    fairly small. Only a few of them will be kept around, so the disk space
    requirement will be small, and restarting Solr will be fast because there
    won't be a lot of data to replay.

    With openSearcher set to false, there will be NO changes in document
    visibility. Searches will continue using the old searcher, so the old
    documents will still be there and the new documents will NOT be searchable
    until DIH does its explicit commit at the end.

    The one thing that I'm not sure about is what happens if Solr or the
    machine crashes in the middle of the import. Complete rollback might not
    be possible. Someone with better knowledge may have to comment there.

    Thanks,
    Shawn

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupsolr-user @
categorieslucene
postedMar 24, '13 at 4:03p
activeMar 27, '13 at 12:45a
posts6
users4
websitelucene.apache.org...

People

Translate

site design / logo © 2021 Grokbase