FAQ

[Solr-user] Testing Solr4 - first impressions and problems

Shawn Heisey
Oct 14, 2012 at 8:11 pm
Please see my other thread called "Testing Solr4 - reference thread"for
general information about my config layout. If more specific information
is required, please let me know.

So far I cannot get a solr.war built without slf4j bindings to work
right. There does not seem to be any centrally configured directory I
can use for the slf4j and log4j jars. I am hesitant to use a lib entry
in solrconfig.xml, because I actually have three distinct solrconfig.xml
files and each server has 16 cores that symlink to those files. I can
have each instanceDir contain a symlink to a more central lib directory,
but I don't want each core to have its own copy of those jars loaded
into memory unless it's the only way to make it work. If anyone knows
how to make this work properly, let me know. If the instanceDir symlink
option is the only way, I will probably file an issue in Jira.

If the updateLog is turned on (I did add _version_ to my schema), doing
a full reindex (using DIH) leads to "out of memory" exceptions, and the
transaction log takes up the same amount of disk space (in a single log
file) as the partially built index. Based on the index progress before
it died, performance is terrible -- about one third the pace of Solr
3.5.0, perhaps less.

After I turned off updateLog, performance went way up and it was able to
complete without error. I think it is actually faster than it was under
3.5.0 with the exact same DIH config, as long as updateLog is turned
off. I haven't done enough testing to file an issue yet. Are there
ways to split the transaction log into multiple files and control how
much disk space the log uses? Can I do anything to increase performance?

For relative paths, instanceDir is relative to solr.home, dataDir is
relative to instanceDir, and if you are using symlinks for
solrconfig.xml, xinclude directives are relative to the symlink
location, not the real file location. These seem like reasonable
defaults to me. Is this what I should expect for the future, or should
I be filing an issue?

Thanks,
Shawn
reply

Search Discussions

9 responses

  • Erick Erickson at Oct 14, 2012 at 11:45 pm
    About your second point. Try committing more often with openSearcher
    set to false.
    There's a bit here:
    http://wiki.apache.org/solr/SolrConfigXml

    <autoCommit>
    <maxDocs>10000</maxDocs> <!-- maximum uncommited docs before
    autocommit triggered -->
    <maxTime>15000</maxTime> <!-- maximum time (in MS) after adding
    a doc before an autocommit is triggered -->
    <openSearcher>false</openSearcher> <!-- SOLR 4.0. Optionally
    don't open a searcher on hard commit. This is useful to minimize the
    size of transaction logs that keep track of uncommitted updates. -->
    </autoCommit>


    That should keep the size of the transaction log down to reasonable levels...

    Best
    Erick
    On Sun, Oct 14, 2012 at 4:11 PM, Shawn Heisey wrote:
    Please see my other thread called "Testing Solr4 - reference thread"for
    general information about my config layout. If more specific information is
    required, please let me know.

    So far I cannot get a solr.war built without slf4j bindings to work right.
    There does not seem to be any centrally configured directory I can use for
    the slf4j and log4j jars. I am hesitant to use a lib entry in
    solrconfig.xml, because I actually have three distinct solrconfig.xml files
    and each server has 16 cores that symlink to those files. I can have each
    instanceDir contain a symlink to a more central lib directory, but I don't
    want each core to have its own copy of those jars loaded into memory unless
    it's the only way to make it work. If anyone knows how to make this work
    properly, let me know. If the instanceDir symlink option is the only way, I
    will probably file an issue in Jira.

    If the updateLog is turned on (I did add _version_ to my schema), doing a
    full reindex (using DIH) leads to "out of memory" exceptions, and the
    transaction log takes up the same amount of disk space (in a single log
    file) as the partially built index. Based on the index progress before it
    died, performance is terrible -- about one third the pace of Solr 3.5.0,
    perhaps less.

    After I turned off updateLog, performance went way up and it was able to
    complete without error. I think it is actually faster than it was under
    3.5.0 with the exact same DIH config, as long as updateLog is turned off. I
    haven't done enough testing to file an issue yet. Are there ways to split
    the transaction log into multiple files and control how much disk space the
    log uses? Can I do anything to increase performance?

    For relative paths, instanceDir is relative to solr.home, dataDir is
    relative to instanceDir, and if you are using symlinks for solrconfig.xml,
    xinclude directives are relative to the symlink location, not the real file
    location. These seem like reasonable defaults to me. Is this what I should
    expect for the future, or should I be filing an issue?

    Thanks,
    Shawn
  • Alexandre Rafalovitch at Oct 15, 2012 at 3:43 am
    Do these settings apply to DIH? The example linked seems to refer to
    updateHandler, but I am not sure how/whether that affects DIH.

    Regards,
    Alex.
    P.s. I was also having OOMs on large DIH imports.

    Personal blog: http://blog.outerthoughts.com/
    LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
    - Time is the quality of nature that keeps events from happening all
    at once. Lately, it doesn't seem to be working. (Anonymous - via GTD
    book)

    On Mon, Oct 15, 2012 at 5:15 AM, Erick Erickson wrote:
    About your second point. Try committing more often with openSearcher
    set to false.
    There's a bit here:
    http://wiki.apache.org/solr/SolrConfigXml

    <autoCommit>
    <maxDocs>10000</maxDocs> <!-- maximum uncommited docs before
    autocommit triggered -->
    <maxTime>15000</maxTime> <!-- maximum time (in MS) after adding
    a doc before an autocommit is triggered -->
    <openSearcher>false</openSearcher> <!-- SOLR 4.0. Optionally
    don't open a searcher on hard commit. This is useful to minimize the
    size of transaction logs that keep track of uncommitted updates. -->
    </autoCommit>


    That should keep the size of the transaction log down to reasonable levels...

    Best
    Erick
    On Sun, Oct 14, 2012 at 4:11 PM, Shawn Heisey wrote:
    Please see my other thread called "Testing Solr4 - reference thread"for
    general information about my config layout. If more specific information is
    required, please let me know.

    So far I cannot get a solr.war built without slf4j bindings to work right.
    There does not seem to be any centrally configured directory I can use for
    the slf4j and log4j jars. I am hesitant to use a lib entry in
    solrconfig.xml, because I actually have three distinct solrconfig.xml files
    and each server has 16 cores that symlink to those files. I can have each
    instanceDir contain a symlink to a more central lib directory, but I don't
    want each core to have its own copy of those jars loaded into memory unless
    it's the only way to make it work. If anyone knows how to make this work
    properly, let me know. If the instanceDir symlink option is the only way, I
    will probably file an issue in Jira.

    If the updateLog is turned on (I did add _version_ to my schema), doing a
    full reindex (using DIH) leads to "out of memory" exceptions, and the
    transaction log takes up the same amount of disk space (in a single log
    file) as the partially built index. Based on the index progress before it
    died, performance is terrible -- about one third the pace of Solr 3.5.0,
    perhaps less.

    After I turned off updateLog, performance went way up and it was able to
    complete without error. I think it is actually faster than it was under
    3.5.0 with the exact same DIH config, as long as updateLog is turned off. I
    haven't done enough testing to file an issue yet. Are there ways to split
    the transaction log into multiple files and control how much disk space the
    log uses? Can I do anything to increase performance?

    For relative paths, instanceDir is relative to solr.home, dataDir is
    relative to instanceDir, and if you are using symlinks for solrconfig.xml,
    xinclude directives are relative to the symlink location, not the real file
    location. These seem like reasonable defaults to me. Is this what I should
    expect for the future, or should I be filing an issue?

    Thanks,
    Shawn
  • Shawn Heisey at Oct 15, 2012 at 6:04 am

    On 10/14/2012 5:45 PM, Erick Erickson wrote:
    About your second point. Try committing more often with openSearcher
    set to false.
    There's a bit here:
    http://wiki.apache.org/solr/SolrConfigXml

    <autoCommit>
    <maxDocs>10000</maxDocs> <!-- maximum uncommited docs before
    autocommit triggered -->
    <maxTime>15000</maxTime> <!-- maximum time (in MS) after adding
    a doc before an autocommit is triggered -->
    <openSearcher>false</openSearcher> <!-- SOLR 4.0. Optionally
    don't open a searcher on hard commit. This is useful to minimize the
    size of transaction logs that keep track of uncommitted updates. -->
    </autoCommit>


    That should keep the size of the transaction log down to reasonable levels...
    I have autocommit turned completely off -- both values set to zero. The
    DIH import from MySQL, over 12 million rows per shard, is done in one go
    on all my build cores at once, then I swap cores. It takes a little
    over three hours and produces a 22GB index. I have batchSize set to -1
    so that jdbc streams the records.

    When I first set this up back on 1.4.1, I had some kind of severe
    problem when autocommit was turned on. I can no longer remember what it
    caused, but it was a huge showstopper of some kind.

    Thanks,
    Shawn
  • Alan Woodward at Oct 15, 2012 at 8:40 am
    Hi Shawn,

    The transaction log is only being used to support near-real-time search at the moment, I think, so it sounds like it's surplus to requirements for your use-case. I'd just turn it off.

    Alan Woodward
    www.romseysoftware.co.uk
    On 15 Oct 2012, at 07:04, Shawn Heisey wrote:
    On 10/14/2012 5:45 PM, Erick Erickson wrote:
    About your second point. Try committing more often with openSearcher
    set to false.
    There's a bit here:
    http://wiki.apache.org/solr/SolrConfigXml

    <autoCommit>
    <maxDocs>10000</maxDocs> <!-- maximum uncommited docs before
    autocommit triggered -->
    <maxTime>15000</maxTime> <!-- maximum time (in MS) after adding
    a doc before an autocommit is triggered -->
    <openSearcher>false</openSearcher> <!-- SOLR 4.0. Optionally
    don't open a searcher on hard commit. This is useful to minimize the
    size of transaction logs that keep track of uncommitted updates. -->
    </autoCommit>


    That should keep the size of the transaction log down to reasonable levels...
    I have autocommit turned completely off -- both values set to zero. The DIH import from MySQL, over 12 million rows per shard, is done in one go on all my build cores at once, then I swap cores. It takes a little over three hours and produces a 22GB index. I have batchSize set to -1 so that jdbc streams the records.

    When I first set this up back on 1.4.1, I had some kind of severe problem when autocommit was turned on. I can no longer remember what it caused, but it was a huge showstopper of some kind.

    Thanks,
    Shawn
  • Chris Hostetter at Oct 15, 2012 at 8:52 pm
    : I have autocommit turned completely off -- both values set to zero. The DIH
    ...
    : When I first set this up back on 1.4.1, I had some kind of severe problem when
    : autocommit was turned on. I can no longer remember what it caused, but it was
    : a huge showstopper of some kind.

    the key question about using autocommit is wether or not you use
    "openSearcher" with it and wether you have the updateLog turned on.

    as i understand it: if you don't care about real time get, or transaction
    recovery of "uncommited documents" on hard crash, or any of the Solr Coud
    features, then you don't need the updateLog -- and you shouldn't add it to
    your existing configs when upgrading to Solr4. any existing usage (or
    non-usage) you had of autocommit should continue to work fine.

    If you *do* care about things that require the updateLog, then you want to
    ensure that you are doing "hard commits" (ie: perisisting the index to
    disk) relatively frequently in order to keep the size of the updateLog
    from growing w/o bound -- but in Solr 4, doing a hard commit no longer
    requires that you open a new searcher. opening a new searcher and
    dealing with the cache loading is one of the main reasons people typically
    avoided autoCommit in the past.

    So if you look at the Solr 4 example: it uses the updateLog combined with
    a 15 second autoCommit that has openSearcher=false -- meaning that the
    autocommit logic is ensuring that anytime the index has modifications they
    are written to disk every 15 seconds, but the new documents aren't exposed
    to search clients as a result of those autocommits, and if a client uses
    real time get, or if there is a a hard crash, the uncommited docs are
    still available in the udpateLog.

    For your usecase and upgrade: don't add the updateLog to your configs, and
    don't add autocommit to your configs, and things should work fine. if you
    decide you wnat to start using something that requires the updateLog, you
    should probably add a short autoCommit with openSearcher=false.


    -Hoss
  • Shawn Heisey at Oct 15, 2012 at 9:38 pm

    On 10/15/2012 2:51 PM, Chris Hostetter wrote:
    For your usecase and upgrade: don't add the updateLog to your configs, and
    don't add autocommit to your configs, and things should work fine. if you
    decide you wnat to start using something that requires the updateLog, you
    should probably add a short autoCommit with openSearcher=false.
    Thank you for your answer. Using updateLog seems to have another
    downside -- a huge hit to performance. It wouldn't be terrible on
    incremental updates. These happen once a minute and normally complete
    extremely quickly - less than a second, followed by a commit that may
    take 2-3 seconds. If it took 5-10 seconds instead of 3, that's not too
    bad. But when you are expecting a process to take three hours and it
    actually takes 8-10 hours, it's another story.

    Shawn
  • Shawn Heisey at Oct 16, 2012 at 2:09 pm

    On 10/15/2012 3:37 PM, Shawn Heisey wrote:
    On 10/15/2012 2:51 PM, Chris Hostetter wrote:
    For your usecase and upgrade: don't add the updateLog to your
    configs, and
    don't add autocommit to your configs, and things should work fine.
    if you
    decide you wnat to start using something that requires the updateLog,
    you
    should probably add a short autoCommit with openSearcher=false.
    Thank you for your answer. Using updateLog seems to have another
    downside -- a huge hit to performance. It wouldn't be terrible on
    incremental updates. These happen once a minute and normally complete
    extremely quickly - less than a second, followed by a commit that may
    take 2-3 seconds. If it took 5-10 seconds instead of 3, that's not
    too bad. But when you are expecting a process to take three hours and
    it actually takes 8-10 hours, it's another story.
    Could we create an option that would allow turning updateLog off for an
    update request? To be useful to me, it would have to be something that
    could also be specified in a dataimporthandler request. That way I
    could do a full import with no log (for performance), but then when
    SolrJ maintains the index, logging would be enabled.

    Thanks,
    Shawn
  • Tomás Fernández Löbbe at Oct 16, 2012 at 2:49 pm
    Shawn, you should create a Jira for that. Maybe it could be programatically
    activated/deactivated.

    Alan, make sure you don't confuse "near real time" with "Realtime get". As
    Hoss said, you don't need the transaction log unless you need Realtime Get
    or recovery of uncommitted docs (or Solr Cloud, which uses those things).
    You CAN use NRT without a transaction log.

    Tomás

    On Tue, Oct 16, 2012 at 11:09 AM, Shawn Heisey wrote:
    On 10/15/2012 3:37 PM, Shawn Heisey wrote:
    On 10/15/2012 2:51 PM, Chris Hostetter wrote:

    For your usecase and upgrade: don't add the updateLog to your configs,
    and
    don't add autocommit to your configs, and things should work fine. if
    you
    decide you wnat to start using something that requires the updateLog, you
    should probably add a short autoCommit with openSearcher=false.
    Thank you for your answer. Using updateLog seems to have another
    downside -- a huge hit to performance. It wouldn't be terrible on
    incremental updates. These happen once a minute and normally complete
    extremely quickly - less than a second, followed by a commit that may take
    2-3 seconds. If it took 5-10 seconds instead of 3, that's not too bad.
    But when you are expecting a process to take three hours and it actually
    takes 8-10 hours, it's another story.
    Could we create an option that would allow turning updateLog off for an
    update request? To be useful to me, it would have to be something that
    could also be specified in a dataimporthandler request. That way I could
    do a full import with no log (for performance), but then when SolrJ
    maintains the index, logging would be enabled.

    Thanks,
    Shawn
  • Shawn Heisey at Oct 16, 2012 at 4:20 pm

    On 10/16/2012 8:48 AM, Tomás Fernández Löbbe wrote:
    Shawn, you should create a Jira for that. Maybe it could be programatically
    activated/deactivated.

    Alan, make sure you don't confuse "near real time" with "Realtime get". As
    Hoss said, you don't need the transaction log unless you need Realtime Get
    or recovery of uncommitted docs (or Solr Cloud, which uses those things).
    You CAN use NRT without a transaction log.
    I filed SOLR-3954. It's been a busy day in Jira for me. :)

    Recovery of uncommitted docs is my primary motivation for updateLog. If
    it works completely as I hope it will, I won't have to worry about the
    difference between hard and soft commits in my SolrJ application, it can
    assume that anything that's successfully soft committed will be applied
    to the index next time it gets restarted. If I'm wrong about this, I
    would like to know it now, before I begin designing around soft commits.

    Thanks,
    Shawn

Related Discussions