Hi,

I'm using SvnQuery which is based on Lucene.Net to index and search my
SVN repositories. I've noticed that text doesn't get indexed in large
files. Actually the first 2000-2500 lines get indexed and the rest do
not. Is anyone aware of this problem? Is there a solution for it?

Thanks,
Tiberiu

Search Discussions

  • Ben Martz at Oct 29, 2010 at 5:31 pm
    Is it possible that you haven't overidden the default max field length setting?

    http://wiki.apache.org/lucene-java/LuceneFAQ

    "Lucene by default only indexes the first 10,000 terms of a document to avoid OutOfMemory errors. SeeIndexWriter.setMaxFieldLength(int) <http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMaxFieldLength%28int%29>."

    Tiberiu Motoc wrote:
    Hi,

    I'm using SvnQuery which is based on Lucene.Net to index and search my
    SVN repositories. I've noticed that text doesn't get indexed in large
    files. Actually the first 2000-2500 lines get indexed and the rest do
    not. Is anyone aware of this problem? Is there a solution for it?

    Thanks,
    Tiberiu
  • Franklin Simmons at Oct 29, 2010 at 5:32 pm
    Tiberiu,

    Check your IndexWriter's MaxFieldLength. The default is 10000.

    -----Original Message-----
    From: Tiberiu Motoc
    Sent: Friday, October 29, 2010 1:23 PM
    To: [email protected]
    Subject: indexing doesn't seem to work in large files

    Hi,

    I'm using SvnQuery which is based on Lucene.Net to index and search my
    SVN repositories. I've noticed that text doesn't get indexed in large
    files. Actually the first 2000-2500 lines get indexed and the rest do
    not. Is anyone aware of this problem? Is there a solution for it?

    Thanks,
    Tiberiu
  • Tiberiu Motoc at Oct 29, 2010 at 9:22 pm
    Thanks Ben and Franklin,

    I tried it and unfortunately it didn't work. SvnQuery has it set to
    50,000. I tried setting it to 60,000 and 500,000 and it still doesn't
    work: the text that I'm looking for is not indexed and not found.
    If I do a word-count in MS Word around the break-off point (where the
    indexing seems to stop) I get a count of 8,450 words and 75,000
    characters (with no spaces). It kinda makes me think that somehow the
    setting for the MaxFieldLength might not work.
    I also noticed that SvnQuery is using Lucene.NET v2.3.1.3. I see there
    are new versions of Lucene.NET available. Do you remember of any bugs
    in v2.3.1.3 that would cause this? Should I recompile SvnQuery with
    the latest version of Lucene.net?

    Thanks,
    Tiberiu

    On Fri, Oct 29, 2010 at 10:31 AM, Franklin Simmons
    wrote:
    Tiberiu,

    Check your IndexWriter's MaxFieldLength. The default is 10000.

    -----Original Message-----
    From: Tiberiu Motoc
    Sent: Friday, October 29, 2010 1:23 PM
    To: [email protected]
    Subject: indexing doesn't seem to work in large files

    Hi,

    I'm using SvnQuery which is based on Lucene.Net to index and search my
    SVN repositories. I've noticed that text doesn't get indexed in large
    files. Actually the first 2000-2500 lines get indexed and the rest do
    not. Is anyone aware of this problem? Is there a solution for it?

    Thanks,
    Tiberiu
  • Aaron Powell at Oct 30, 2010 at 12:18 am
    I'd suggest upgrading it to work with 2.9.2 of Lucene.Net.

    What exactly are you indexing, code files or plain text documents?
    Aaron Powell
    Umbraco Ninja

    http://www.aaron-powell.com | http://twitter.com/slace | Skype:
    aaron.l.powell | MSN: [email protected]

    On Sat, Oct 30, 2010 at 8:21 AM, Tiberiu Motoc wrote:

    Thanks Ben and Franklin,

    I tried it and unfortunately it didn't work. SvnQuery has it set to
    50,000. I tried setting it to 60,000 and 500,000 and it still doesn't
    work: the text that I'm looking for is not indexed and not found.
    If I do a word-count in MS Word around the break-off point (where the
    indexing seems to stop) I get a count of 8,450 words and 75,000
    characters (with no spaces). It kinda makes me think that somehow the
    setting for the MaxFieldLength might not work.
    I also noticed that SvnQuery is using Lucene.NET v2.3.1.3. I see there
    are new versions of Lucene.NET available. Do you remember of any bugs
    in v2.3.1.3 that would cause this? Should I recompile SvnQuery with
    the latest version of Lucene.net?

    Thanks,
    Tiberiu

    On Fri, Oct 29, 2010 at 10:31 AM, Franklin Simmons
    wrote:
    Tiberiu,

    Check your IndexWriter's MaxFieldLength. The default is 10000.

    -----Original Message-----
    From: Tiberiu Motoc
    Sent: Friday, October 29, 2010 1:23 PM
    To: [email protected]
    Subject: indexing doesn't seem to work in large files

    Hi,

    I'm using SvnQuery which is based on Lucene.Net to index and search my
    SVN repositories. I've noticed that text doesn't get indexed in large
    files. Actually the first 2000-2500 lines get indexed and the rest do
    not. Is anyone aware of this problem? Is there a solution for it?

    Thanks,
    Tiberiu
  • Ben Martz at Oct 30, 2010 at 1:12 am
    I am using MaxFieldLength.UNLIMITED successfully in my own product running with Lucene.Net 2.9.2 and can definitely index huge documents without an issue (given enough RAM anyways).

    Regarding the possible SvnQuery-specific issues:

    1. Have you verified that any portion of the document is actually being indexed? I noticed that SvnIndex.Indexer.FetchJob selects repository items based on a default maximum file size of 1MB (MaxDocumentSize).

    2. Have you tried changing MaxNumberOfTermsPerDocument constant in SvnIndex\Indexer.cs from 50000 to IndexWriter.MaxFieldLength.UNLIMITED? I noticed that this MaxNumberOfTermsPerDocument and MaxDocumentSize were added in r237 and MaxNumberOfTermsPerDocument is used in two places in Indexer.

    3. Is the test query that you're using simple enough to not result in a search failure because of an analyzer issue?

    If you get stuck and if the document in question can be disclosed, even privately, I would be happy to throw it at my Lucene.Net implementation (just straight Lucene, not SvnQuery) and run some queries for you if that would help.

    Cheers,
    Ben

    Aaron Powell wrote:
    I'd suggest upgrading it to work with 2.9.2 of Lucene.Net.

    What exactly are you indexing, code files or plain text documents?
    Aaron Powell
    Umbraco Ninja

    http://www.aaron-powell.com | http://twitter.com/slace | Skype:
    aaron.l.powell | MSN: [email protected]


    On Sat, Oct 30, 2010 at 8:21 AM, Tiberiu Motocwrote:
    Thanks Ben and Franklin,

    I tried it and unfortunately it didn't work. SvnQuery has it set to
    50,000. I tried setting it to 60,000 and 500,000 and it still doesn't
    work: the text that I'm looking for is not indexed and not found.
    If I do a word-count in MS Word around the break-off point (where the
    indexing seems to stop) I get a count of 8,450 words and 75,000
    characters (with no spaces). It kinda makes me think that somehow the
    setting for the MaxFieldLength might not work.
    I also noticed that SvnQuery is using Lucene.NET v2.3.1.3. I see there
    are new versions of Lucene.NET available. Do you remember of any bugs
    in v2.3.1.3 that would cause this? Should I recompile SvnQuery with
    the latest version of Lucene.net?

    Thanks,
    Tiberiu

    On Fri, Oct 29, 2010 at 10:31 AM, Franklin Simmons
    wrote:
    Tiberiu,

    Check your IndexWriter's MaxFieldLength. The default is 10000.

    -----Original Message-----
    From: Tiberiu Motoc
    Sent: Friday, October 29, 2010 1:23 PM
    To: [email protected]
    Subject: indexing doesn't seem to work in large files

    Hi,

    I'm using SvnQuery which is based on Lucene.Net to index and search my
    SVN repositories. I've noticed that text doesn't get indexed in large
    files. Actually the first 2000-2500 lines get indexed and the rest do
    not. Is anyone aware of this problem? Is there a solution for it?

    Thanks,
    Tiberiu
  • Tiberiu Motoc at Oct 31, 2010 at 5:36 pm
    Thanks so much for all the suggestions!

    I did try to change the MaxNumberOfTermsPerDocument to int.MaxValue
    and that did now work. I also tried compiling SvnQuery with
    Lucence.NET 2.9.2 but I ran into too much trouble. I started fixing
    some of the incompatibilities, but gave up after a while. I'll
    continue doing it on the side for a while, but unfortunately I'm
    running out of time with my eval of SvnQuery. I received a reply in
    the SvnQuery newsgroup saying this problem might be fixed in the next
    version, so that's good news.

    Thanks again for all the quick and helpful replies.
    Tiberiu
    On Fri, Oct 29, 2010 at 6:12 PM, Ben Martz wrote:
    I am using MaxFieldLength.UNLIMITED successfully in my own product running
    with Lucene.Net 2.9.2 and can definitely index huge documents without an
    issue (given enough RAM anyways).

    Regarding the possible SvnQuery-specific issues:

    1. Have you verified that any portion of the document is actually being
    indexed? I noticed that SvnIndex.Indexer.FetchJob selects repository items
    based on a default maximum file size of 1MB (MaxDocumentSize).

    2. Have you tried changing MaxNumberOfTermsPerDocument constant in
    SvnIndex\Indexer.cs from 50000 to IndexWriter.MaxFieldLength.UNLIMITED? I
    noticed that this MaxNumberOfTermsPerDocument and MaxDocumentSize were added
    in r237 and MaxNumberOfTermsPerDocument is used in two places in Indexer.

    3. Is the test query that you're using simple enough to not result in a
    search failure because of an analyzer issue?

    If you get stuck and if the document in question can be disclosed, even
    privately, I would be happy to throw it at my Lucene.Net implementation
    (just straight Lucene, not SvnQuery) and run some queries for you if that
    would help.

    Cheers,
    Ben

    Aaron Powell wrote:
    I'd suggest upgrading it to work with 2.9.2 of Lucene.Net.

    What exactly are you indexing, code files or plain text documents?
    Aaron Powell
    Umbraco Ninja

    http://www.aaron-powell.com | http://twitter.com/slace | Skype:
    aaron.l.powell | MSN: [email protected]


    On Sat, Oct 30, 2010 at 8:21 AM, Tiberiu
    Motocwrote:
    Thanks Ben and Franklin,

    I tried it and unfortunately it didn't work. SvnQuery has it set to
    50,000. I tried setting it to 60,000 and 500,000 and it still doesn't
    work: the text that I'm looking for is not indexed and not found.
    If I do a word-count in MS Word around the break-off point (where the
    indexing seems to stop) I get a count of 8,450 words and 75,000
    characters (with no spaces). It kinda makes me think that somehow the
    setting for the MaxFieldLength might not work.
    I also noticed that SvnQuery is using Lucene.NET v2.3.1.3. I see there
    are new versions of Lucene.NET available. Do you remember of any bugs
    in v2.3.1.3 that would cause this? Should I recompile SvnQuery with
    the latest version of Lucene.net?

    Thanks,
    Tiberiu

    On Fri, Oct 29, 2010 at 10:31 AM, Franklin Simmons
    wrote:
    Tiberiu,

    Check your IndexWriter's MaxFieldLength. The default is 10000.

    -----Original Message-----
    From: Tiberiu Motoc
    Sent: Friday, October 29, 2010 1:23 PM
    To: [email protected]
    Subject: indexing doesn't seem to work in large files

    Hi,

    I'm using SvnQuery which is based on Lucene.Net to index and search my
    SVN repositories. I've noticed that text doesn't get indexed in large
    files. Actually the first 2000-2500 lines get indexed and the rest do
    not. Is anyone aware of this problem? Is there a solution for it?

    Thanks,
    Tiberiu

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouplucene-net-user @
categorieslucene
postedOct 29, '10 at 5:23p
activeOct 31, '10 at 5:36p
posts7
users4
websitelucene.apache.org

People

Translate

site design / logo © 2023 Grokbase