FAQ
Hello all,

Is there a way to reduce the indexing time taken when the indexer is
indexing about 30,000 + files. It is roughly taking around 6-7 hours to
do this. I am using IndexHTML class to create the index out of HTML files.

Another issue that I see is every once in a while I get the following
output on the screen.

adding ../31/1104852.html
Parse Aborted: Encountered "\"" at line 7, column 1.
Was expecting one of:
<ArgName> ...
"=" ...
<TagEnd> ...

Any suggestions on preventing this from happening?

Thanks in advance.
-H


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Search Discussions

  • Stephane James Vaucher at Aug 25, 2004 at 10:01 pm
    I don't think that the demo parser is meant as a production
    system component. You can look at Tidy or NekoHtml. They cleanup your html
    and are probably optimised.

    sv
    On Wed, 25 Aug 2004, Hetan Shah wrote:

    Hello all,

    Is there a way to reduce the indexing time taken when the indexer is
    indexing about 30,000 + files. It is roughly taking around 6-7 hours to
    do this. I am using IndexHTML class to create the index out of HTML files.

    Another issue that I see is every once in a while I get the following
    output on the screen.

    adding ../31/1104852.html
    Parse Aborted: Encountered "\"" at line 7, column 1.
    Was expecting one of:
    <ArgName> ...
    "=" ...
    <TagEnd> ...

    Any suggestions on preventing this from happening?

    Thanks in advance.
    -H


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Hetan Shah at Aug 25, 2004 at 10:07 pm
    Do you have any pointers for sample code for them?
    Would highly appreciate it.
    Thanks.
    -H

    Stephane James Vaucher wrote:
    I don't think that the demo parser is meant as a production
    system component. You can look at Tidy or NekoHtml. They cleanup your html
    and are probably optimised.

    sv

    On Wed, 25 Aug 2004, Hetan Shah wrote:

    Hello all,

    Is there a way to reduce the indexing time taken when the indexer is
    indexing about 30,000 + files. It is roughly taking around 6-7 hours to
    do this. I am using IndexHTML class to create the index out of HTML files.

    Another issue that I see is every once in a while I get the following
    output on the screen.

    adding ../31/1104852.html
    Parse Aborted: Encountered "\"" at line 7, column 1.
    Was expecting one of:
    <ArgName> ...
    "=" ...
    <TagEnd> ...

    Any suggestions on preventing this from happening?

    Thanks in advance.
    -H


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Stephane James Vaucher at Aug 25, 2004 at 10:24 pm
    JGuru explanation:
    http://www.jguru.com/faq/view.jsp?EID=1074228

    I have no sample code for neko, I think nutch uses it though. For tidy,
    you can look at ant in the sandbox:

    http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/ant/src/main/org/apache/lucene/ant/HtmlDocument.java?rev=1.3&view=markup

    HTH,
    sv
    On Wed, 25 Aug 2004, Hetan Shah wrote:

    Do you have any pointers for sample code for them?
    Would highly appreciate it.
    Thanks.
    -H

    Stephane James Vaucher wrote:
    I don't think that the demo parser is meant as a production
    system component. You can look at Tidy or NekoHtml. They cleanup your html
    and are probably optimised.

    sv

    On Wed, 25 Aug 2004, Hetan Shah wrote:

    Hello all,

    Is there a way to reduce the indexing time taken when the indexer is
    indexing about 30,000 + files. It is roughly taking around 6-7 hours to
    do this. I am using IndexHTML class to create the index out of HTML files.

    Another issue that I see is every once in a while I get the following
    output on the screen.

    adding ../31/1104852.html
    Parse Aborted: Encountered "\"" at line 7, column 1.
    Was expecting one of:
    <ArgName> ...
    "=" ...
    <TagEnd> ...

    Any suggestions on preventing this from happening?

    Thanks in advance.
    -H


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Karthik N S at Aug 26, 2004 at 3:46 am
    Hi Hetan


    Th's the major Problem of non Standatrdized Tags for HTML Document's
    u are Indexing ,resulting in lag time taken for Indexing process....


    If u can Tweak the HTMLParser.jj file within lucene.zip '/demo/html'
    file
    [U have to have some Knowledge of JAVACC for this].



    Karthik

    -----Original Message-----
    From: Hetan Shah
    Sent: Thursday, August 26, 2004 3:01 AM
    To: Lucene Users List
    Subject: Time to index documents


    Hello all,

    Is there a way to reduce the indexing time taken when the indexer is
    indexing about 30,000 + files. It is roughly taking around 6-7 hours to
    do this. I am using IndexHTML class to create the index out of HTML files.

    Another issue that I see is every once in a while I get the following
    output on the screen.

    adding ../31/1104852.html
    Parse Aborted: Encountered "\"" at line 7, column 1.
    Was expecting one of:
    <ArgName> ...
    "=" ...
    <TagEnd> ...

    Any suggestions on preventing this from happening?

    Thanks in advance.
    -H


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Stephane James Vaucher at Aug 26, 2004 at 3:52 am
    Hetan,

    If you are using a corpus with multiple editors, I suggest that you
    use a cleaner like tidy as there might be weird stuff appearing in the
    html.

    sv
    On Thu, 26 Aug 2004, Karthik N S wrote:

    Hi Hetan


    Th's the major Problem of non Standatrdized Tags for HTML Document's
    u are Indexing ,resulting in lag time taken for Indexing process....


    If u can Tweak the HTMLParser.jj file within lucene.zip '/demo/html'
    file
    [U have to have some Knowledge of JAVACC for this].



    Karthik

    -----Original Message-----
    From: Hetan Shah
    Sent: Thursday, August 26, 2004 3:01 AM
    To: Lucene Users List
    Subject: Time to index documents


    Hello all,

    Is there a way to reduce the indexing time taken when the indexer is
    indexing about 30,000 + files. It is roughly taking around 6-7 hours to
    do this. I am using IndexHTML class to create the index out of HTML files.

    Another issue that I see is every once in a while I get the following
    output on the screen.

    adding ../31/1104852.html
    Parse Aborted: Encountered "\"" at line 7, column 1.
    Was expecting one of:
    <ArgName> ...
    "=" ...
    <TagEnd> ...

    Any suggestions on preventing this from happening?

    Thanks in advance.
    -H


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedAug 25, '04 at 9:31p
activeAug 26, '04 at 3:52a
posts6
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase