FAQ
Hi all,

First of all, sorry for my poor English. It's not my native language.

I'm trying to use Lucene to index hierarchical kind of information: I have
structured html and pdf/word documents and I want to index them in ways to
perform search in titles, text, paragraphs or tables only, or any
combinations of items mentioned above. At the moment I see 3 possible
solutions:

- Create the set of all possible fields, like: contents, title, heading,
table etc... And index the data in all them accordingly. Possible impacts:
- a big count of fields
- data duplication (because I need to make search looking in the
paragraphs to look inside all the inner elements, so every outer element
indexed will contain all the inner element content as well)
- Create the hierarchy of the fields, like "title", "paragraph/title",
"paragraph/title/subparagraph/table". Possible impacts:
- count of fields remains the same
- soft set of fields (not consistent)
- I'm not sure about the ways I could process required information and
perform search.
- Performance issues?
- Use one field for content and just add location prefix to content.
For example "contents:*paragraph/heading:*token1 token2". *
paragraph/heading:* here is used as additional information prefix. So, I
(possibly?) could reuse PrefixQuery functionality or smth. Impacts:
- Strong set of index fields (small)
- Additional information processing - all the queries I'll use will
have to work as PrefixQuery
- Performance issues?


So, have anyone tried to make things work like that? Or am I trying to use
wrench to hammer in nails? I assume Lucene wasn't thought to be used like
that, but it's worth trying (at least asking).
Any results / suggestions are welcome!

--
Bests regards,
Leonid Maslov!
Adrienne Gusoff - "Opportunity knocked. My doorman threw him out."

Search Discussions

  • Tom at Sep 1, 2008 at 7:27 am
    AUTOMATIC REPLY

    Tom Roberts is out of the office till 2nd September 2008.

    LUX reopens on 1st September 2008



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Leonid Maslov at Sep 1, 2008 at 1:56 pm
    Any comments, suggestions? Maybe I should rephrase my original message or
    describe it in detail?
    I really would like to get any response if possible.

    Thanks a lot in advance!
    On Mon, Sep 1, 2008 at 10:25 AM, Leonid Maslov wrote:

    Hi all,

    First of all, sorry for my poor English. It's not my native language.

    I'm trying to use Lucene to index hierarchical kind of information: I have
    structured html and pdf/word documents and I want to index them in ways to
    perform search in titles, text, paragraphs or tables only, or any
    combinations of items mentioned above. At the moment I see 3 possible
    solutions:

    - Create the set of all possible fields, like: contents, title,
    heading, table etc... And index the data in all them accordingly. Possible
    impacts:
    - a big count of fields
    - data duplication (because I need to make search looking in the
    paragraphs to look inside all the inner elements, so every outer element
    indexed will contain all the inner element content as well)
    - Create the hierarchy of the fields, like "title", "paragraph/title",
    "paragraph/title/subparagraph/table". Possible impacts:
    - count of fields remains the same
    - soft set of fields (not consistent)
    - I'm not sure about the ways I could process required information
    and perform search.
    - Performance issues?
    - Use one field for content and just add location prefix to content.
    For example "contents:*paragraph/heading:*token1 token2". *
    paragraph/heading:* here is used as additional information prefix. So,
    I (possibly?) could reuse PrefixQuery functionality or smth. Impacts:
    - Strong set of index fields (small)
    - Additional information processing - all the queries I'll use will
    have to work as PrefixQuery
    - Performance issues?


    So, have anyone tried to make things work like that? Or am I trying to use
    wrench to hammer in nails? I assume Lucene wasn't thought to be used like
    that, but it's worth trying (at least asking).
    Any results / suggestions are welcome!

    --
    Bests regards,
    Leonid Maslov!
    Adrienne Gusoff - "Opportunity knocked. My doorman threw him out."


    --
    Bests regards,
    Leonid Maslov!
    Adrienne Gusoff - "Opportunity knocked. My doorman threw him out."
  • Karsten F. at Sep 2, 2008 at 8:47 am
    Hi Leonid,

    what kind of query is your use case?

    Comlex scenario:
    You need all the hierarchical structure information in one query. This means
    you want to search with xpath in a real xml-Database. (like: All Documents
    with a subtitle XY which contains directly after this subtitle a table with
    the same column like ...)

    Normal scenario:
    You want to search for only one part of your hierarchical information like
    'Document with word xy in title' and 'Documents with word xy in table'.

    I am not familiar with lucene use in xml-Databases, but I can advice for
    "normal scenario":

    Take a look to the xml-aware search in xtf (
    http://xtf.wiki.sourceforge.net/tagRef_textIndexer_PreFilter#toctagRef_textIndexer_PreFilter7
    ).
    The idea is to use one lucene-document for each section with only two
    fields: "text" and "sectionType".
    But to collect all hits belonging to one hierarchical information (e.g. one
    html-File) and compress this to one representative hit in lucene.

    Best regards
    Karsten


    leonardinius wrote:
    Any comments, suggestions? Maybe I should rephrase my original message or
    describe it in detail?
    I really would like to get any response if possible.

    Thanks a lot in advance!
    On Mon, Sep 1, 2008 at 10:25 AM, Leonid Maslov wrote:

    Hi all,

    First of all, sorry for my poor English. It's not my native language.

    I'm trying to use Lucene to index hierarchical kind of information: I
    have
    structured html and pdf/word documents and I want to index them in ways
    to
    perform search in titles, text, paragraphs or tables only, or any
    combinations of items mentioned above. At the moment I see 3 possible
    solutions:

    - Create the set of all possible fields, like: contents, title,
    heading, table etc... And index the data in all them accordingly.
    Possible
    impacts:
    - a big count of fields
    - data duplication (because I need to make search looking in the
    paragraphs to look inside all the inner elements, so every outer
    element
    indexed will contain all the inner element content as well)
    - Create the hierarchy of the fields, like "title", "paragraph/title",
    "paragraph/title/subparagraph/table". Possible impacts:
    - count of fields remains the same
    - soft set of fields (not consistent)
    - I'm not sure about the ways I could process required information
    and perform search.
    - Performance issues?
    - Use one field for content and just add location prefix to
    content.
    For example "contents:*paragraph/heading:*token1 token2". *
    paragraph/heading:* here is used as additional information prefix. So,
    I (possibly?) could reuse PrefixQuery functionality or smth. Impacts:
    - Strong set of index fields (small)
    - Additional information processing - all the queries I'll use will
    have to work as PrefixQuery
    - Performance issues?


    So, have anyone tried to make things work like that? Or am I trying to
    use
    wrench to hammer in nails? I assume Lucene wasn't thought to be used like
    that, but it's worth trying (at least asking).
    Any results / suggestions are welcome!

    --
    Bests regards,
    Leonid Maslov!
    Adrienne Gusoff - "Opportunity knocked. My doorman threw him out."


    --
    Bests regards,
    Leonid Maslov!
    Adrienne Gusoff - "Opportunity knocked. My doorman threw him out."
    --
    View this message in context: http://www.nabble.com/Newbie-question%3A-using-Lucene-to-index-hierarchical-information.-tp19250038p19266355.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Leonid Maslov at Sep 4, 2008 at 7:31 pm
    Hi all,
    Thanks a lot for such a quick reply.

    Both scenario sounds very well for me. I would like to do my best and try to
    implement any of them (as the proof of the concept) and then incrementally
    improve, retest, investigate and rewrite then :)

    So, from the soap opera to the question part then:

    - How to implement those things (a and b) on the Lucene and Lucene
    contribs codebase?
    - I looked at the
    http://xtf.wiki.sourceforge.net/tagRef_textIndexer_PreFilter#toctagRef_textIndexer_PreFilter7
    and
    didn't like that (too big, to heavy, ready-to use solution instead of
    toolkit). And I didn't understood how to implement "Normal
    scenario" on top
    of that?
    - Any suggestions how could I begin implementing these things? Gently
    moving from "Normal" scenario to some more advanced "Complex"? What should I
    afraid off and possible impacts if any?

    Have anybody tried to use Lucene to analyse things like that? What would be
    possible solutions to store indexed data and perform queries on that? If
    Lucene isn't the right tool for this job, maybe some other toolkit would
    more useful(possibly on top of the Lucene)

    Thanks in advance for any suggestions and comments. I would appreciate any
    ideas and directions to look into.


    On Tue, Sep 2, 2008 at 11:46 AM, Karsten F.
    wrote:
    Hi Leonid,

    what kind of query is your use case?

    Comlex scenario:
    You need all the hierarchical structure information in one query. This
    means
    you want to search with xpath in a real xml-Database. (like: All Documents
    with a subtitle XY which contains directly after this subtitle a table with
    the same column like ...)

    Normal scenario:
    You want to search for only one part of your hierarchical information like
    'Document with word xy in title' and 'Documents with word xy in table'.

    I am not familiar with lucene use in xml-Databases, but I can advice for
    "normal scenario":

    Take a look to the xml-aware search in xtf (

    http://xtf.wiki.sourceforge.net/tagRef_textIndexer_PreFilter#toctagRef_textIndexer_PreFilter7
    ).
    The idea is to use one lucene-document for each section with only two
    fields: "text" and "sectionType".
    But to collect all hits belonging to one hierarchical information (e.g. one
    html-File) and compress this to one representative hit in lucene.

    Best regards
    Karsten


    leonardinius wrote:
    Any comments, suggestions? Maybe I should rephrase my original message or
    describe it in detail?
    I really would like to get any response if possible.

    Thanks a lot in advance!
    On Mon, Sep 1, 2008 at 10:25 AM, Leonid Maslov wrote:

    Hi all,

    First of all, sorry for my poor English. It's not my native language.

    I'm trying to use Lucene to index hierarchical kind of information: I
    have
    structured html and pdf/word documents and I want to index them in ways
    to
    perform search in titles, text, paragraphs or tables only, or any
    combinations of items mentioned above. At the moment I see 3 possible
    solutions:

    - Create the set of all possible fields, like: contents, title,
    heading, table etc... And index the data in all them accordingly.
    Possible
    impacts:
    - a big count of fields
    - data duplication (because I need to make search looking in the
    paragraphs to look inside all the inner elements, so every outer
    element
    indexed will contain all the inner element content as well)
    - Create the hierarchy of the fields, like "title",
    "paragraph/title",
    "paragraph/title/subparagraph/table". Possible impacts:
    - count of fields remains the same
    - soft set of fields (not consistent)
    - I'm not sure about the ways I could process required information
    and perform search.
    - Performance issues?
    - Use one field for content and just add location prefix to
    content.
    For example "contents:*paragraph/heading:*token1 token2". *
    paragraph/heading:* here is used as additional information prefix.
    So,
    I (possibly?) could reuse PrefixQuery functionality or smth. Impacts:
    - Strong set of index fields (small)
    - Additional information processing - all the queries I'll use
    will
    have to work as PrefixQuery
    - Performance issues?


    So, have anyone tried to make things work like that? Or am I trying to
    use
    wrench to hammer in nails? I assume Lucene wasn't thought to be used
    like
    that, but it's worth trying (at least asking).
    Any results / suggestions are welcome!

    --
    Bests regards,
    Leonid Maslov!
    Adrienne Gusoff - "Opportunity knocked. My doorman threw him out."


    --
    Bests regards,
    Leonid Maslov!
    Adrienne Gusoff - "Opportunity knocked. My doorman threw him out."
    --
    View this message in context:
    http://www.nabble.com/Newbie-question%3A-using-Lucene-to-index-hierarchical-information.-tp19250038p19266355.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    Bests regards,
    Leonid Maslov!
    Princess Margaret - "I have as much privacy as a goldfish in a bowl."
  • Karsten F. at Sep 8, 2008 at 9:26 pm
    Hi Leonid,

    do you really need the "Complex scenario"?
    what kind of query is your use case?

    If you really need xpath please look for xml-Databases.

    Otherwise you can possible use xtf out of the box, because "indexing of
    large structured documents" is exactly the use case for which xtf was
    developed (TEI documents, but html is less complex then TEI).
    Again the main idea:
    1. Use xml-Elements (with its descendants) to divide the structured document
    into sections.
    2. index each section as lucene document (field "text") with an extra field
    "section type"
    3. after all sections of one structured document insert one (terminal)
    lucene document with the other metadata of the structured document (e.g.
    creation date, title, ..)

    the document from point 3 is the representative of the structured document
    (and the representative is the unit of retrieval, because the user search
    for a document not for a section)
    If you search e.g. for
    sectionType:table text:words inside section
    you have hits with the lucene documents belonging to the sections.

    Possible for your use case it would be enough to insert a stored lucene
    field "document key".
    In xtf the lucene document-number of each hit is incremented until the
    representative is reached.

    This is a rough description, but source code of xtf is very readable.

    best regards

    Karsten



    leonardinius wrote:
    Hi all,
    Thanks a lot for such a quick reply.

    Both scenario sounds very well for me. I would like to do my best and try
    to
    implement any of them (as the proof of the concept) and then incrementally
    improve, retest, investigate and rewrite then :)

    So, from the soap opera to the question part then:

    - How to implement those things (a and b) on the Lucene and Lucene
    contribs codebase?
    - I looked at the

    http://xtf.wiki.sourceforge.net/tagRef_textIndexer_PreFilter#toctagRef_textIndexer_PreFilter7
    and
    didn't like that (too big, to heavy, ready-to use solution instead
    of
    toolkit). And I didn't understood how to implement "Normal
    scenario" on top
    of that?
    - Any suggestions how could I begin implementing these things? Gently
    moving from "Normal" scenario to some more advanced "Complex"? What
    should I
    afraid off and possible impacts if any?

    Have anybody tried to use Lucene to analyse things like that? What would
    be
    possible solutions to store indexed data and perform queries on that? If
    Lucene isn't the right tool for this job, maybe some other toolkit would
    more useful(possibly on top of the Lucene)

    Thanks in advance for any suggestions and comments. I would appreciate any
    ideas and directions to look into.


    On Tue, Sep 2, 2008 at 11:46 AM, Karsten F.
    wrote:
    Take a look to the xml-aware search in xtf (

    http://xtf.wiki.sourceforge.net/tagRef_textIndexer_PreFilter#toctagRef_textIndexer_PreFilter7
    ).
    The idea is to use one lucene-document for each section with only two
    fields: "text" and "sectionType".
    But to collect all hits belonging to one hierarchical information (e.g.
    one
    html-File) and compress this to one representative hit in lucene.

    Best regards
    Karsten
    --
    View this message in context: http://www.nabble.com/Newbie-question%3A-using-Lucene-to-index-hierarchical-information.-tp19250038p19381593.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Leonid M. at Sep 10, 2008 at 9:12 am
    Hi Karsten,
    Thanks a lot. I finally have got Your idea.

    Ok, I think it's worth to do the real job now :) Thanks for the advices,
    finally I have understood the directions I could go for it.
    do you really need the "Complex scenario"?
    what kind of query is your use case?

    My Query UC is smth like this: find documents where paragraphs are similar
    to this document paragraphs or paragraph or part of it (using N-Gramms or
    similar/modified tokenenizers and Stemm/NLP like similarity).

    I finally understood the idea behind XML-based approach. I think XML based
    approach isn't suitable for me anyway for some reasons:

    - DB support (MSSQL and Oracle or some Java ad-hoc solutions)
    - Speed with XPATH like queries on big datasets.

    So I assume the the variant You recommend suits me the best.
    However it's hard to understand what xtf does by just opening it's source
    code and being newbie in Lucene. But thats should be done - should be done,
    no one will do my job for me anyway. :))

    I'll try to make some time to digg in xtf code. If smth is unclear or
    questionable - I assume xtf mailing list would be the right place to ask -
    not this particularly one (java-lucene-user)?

    Thanks a lot for pointing out possible directions and solutions. I really
    appreciate You help and time You spent to provide such as helpful
    descriptions. God bless OSS community!

    On Tue, Sep 9, 2008 at 12:26 AM, Karsten F.
    wrote:
    Hi Leonid,

    do you really need the "Complex scenario"?
    what kind of query is your use case?

    If you really need xpath please look for xml-Databases.

    Otherwise you can possible use xtf out of the box, because "indexing of
    large structured documents" is exactly the use case for which xtf was
    developed (TEI documents, but html is less complex then TEI).
    Again the main idea:
    1. Use xml-Elements (with its descendants) to divide the structured
    document
    into sections.
    2. index each section as lucene document (field "text") with an extra field
    "section type"
    3. after all sections of one structured document insert one (terminal)
    lucene document with the other metadata of the structured document (e.g.
    creation date, title, ..)

    the document from point 3 is the representative of the structured document
    (and the representative is the unit of retrieval, because the user search
    for a document not for a section)
    If you search e.g. for
    sectionType:table text:words inside section
    you have hits with the lucene documents belonging to the sections.

    Possible for your use case it would be enough to insert a stored lucene
    field "document key".
    In xtf the lucene document-number of each hit is incremented until the
    representative is reached.

    This is a rough description, but source code of xtf is very readable.

    best regards

    Karsten



    leonardinius wrote:
    Hi all,
    Thanks a lot for such a quick reply.

    Both scenario sounds very well for me. I would like to do my best and try
    to
    implement any of them (as the proof of the concept) and then
    incrementally
    improve, retest, investigate and rewrite then :)

    So, from the soap opera to the question part then:

    - How to implement those things (a and b) on the Lucene and Lucene
    contribs codebase?
    - I looked at the

    http://xtf.wiki.sourceforge.net/tagRef_textIndexer_PreFilter#toctagRef_textIndexer_PreFilter7
    and
    didn't like that (too big, to heavy, ready-to use solution instead
    of
    toolkit). And I didn't understood how to implement "Normal
    scenario" on top
    of that?
    - Any suggestions how could I begin implementing these things? Gently
    moving from "Normal" scenario to some more advanced "Complex"? What
    should I
    afraid off and possible impacts if any?

    Have anybody tried to use Lucene to analyse things like that? What would
    be
    possible solutions to store indexed data and perform queries on that? If
    Lucene isn't the right tool for this job, maybe some other toolkit would
    more useful(possibly on top of the Lucene)

    Thanks in advance for any suggestions and comments. I would appreciate any
    ideas and directions to look into.


    On Tue, Sep 2, 2008 at 11:46 AM, Karsten F.
    wrote:
    Take a look to the xml-aware search in xtf (
    http://xtf.wiki.sourceforge.net/tagRef_textIndexer_PreFilter#toctagRef_textIndexer_PreFilter7
    ).
    The idea is to use one lucene-document for each section with only two
    fields: "text" and "sectionType".
    But to collect all hits belonging to one hierarchical information (e.g.
    one
    html-File) and compress this to one representative hit in lucene.

    Best regards
    Karsten
    --
    View this message in context:
    http://www.nabble.com/Newbie-question%3A-using-Lucene-to-index-hierarchical-information.-tp19250038p19381593.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    Bests regards,
    Leonid Maslov!
    Personal blog: http://leonardinius.blogspot.com/

    Random thought:
    Marcel Marceau - "Never get a mime talking. He won't stop."
  • Marcelo Ochoa at Sep 10, 2008 at 11:22 am
    Hi Leonid
    If you are not familiar with Oracle XMLDB schema mappings here an
    example of how to store WikiPedia XML dumps into Oracle database, but
    using XML-to-relational model:
    http://marceloochoa.blogspot.com/2007/12/uploading-wikipedia-dumps-to-oracle.html
    The structure of WikiPedia dumps seem to be similar to your data
    model, so if you are using Oracle you can use this example as jump
    start to eficient mapping XML inside Oracle.
    Also there is an example of how to index it with Lucene running as
    a new Domain Index for Oracle databases, to get the best things of the
    two worlds :) Lucene for getting free text searching eficiently,
    relational DB to quick sort and filter relational data.
    Best regards, Marcelo.
    On Mon, Sep 1, 2008 at 4:25 AM, Leonid Maslov wrote:
    Hi all,

    First of all, sorry for my poor English. It's not my native language.

    I'm trying to use Lucene to index hierarchical kind of information: I have
    structured html and pdf/word documents and I want to index them in ways to
    perform search in titles, text, paragraphs or tables only, or any
    combinations of items mentioned above. At the moment I see 3 possible
    solutions:

    - Create the set of all possible fields, like: contents, title, heading,
    table etc... And index the data in all them accordingly. Possible impacts:
    - a big count of fields
    - data duplication (because I need to make search looking in the
    paragraphs to look inside all the inner elements, so every outer element
    indexed will contain all the inner element content as well)
    - Create the hierarchy of the fields, like "title", "paragraph/title",
    "paragraph/title/subparagraph/table". Possible impacts:
    - count of fields remains the same
    - soft set of fields (not consistent)
    - I'm not sure about the ways I could process required information and
    perform search.
    - Performance issues?
    - Use one field for content and just add location prefix to content.
    For example "contents:*paragraph/heading:*token1 token2". *
    paragraph/heading:* here is used as additional information prefix. So, I
    (possibly?) could reuse PrefixQuery functionality or smth. Impacts:
    - Strong set of index fields (small)
    - Additional information processing - all the queries I'll use will
    have to work as PrefixQuery
    - Performance issues?


    So, have anyone tried to make things work like that? Or am I trying to use
    wrench to hammer in nails? I assume Lucene wasn't thought to be used like
    that, but it's worth trying (at least asking).
    Any results / suggestions are welcome!

    --
    Bests regards,
    Leonid Maslov!
    Adrienne Gusoff - "Opportunity knocked. My doorman threw him out."


    --
    Marcelo F. Ochoa
    http://marceloochoa.blogspot.com/
    http://marcelo.ochoa.googlepages.com/home
    ______________
    Do you Know DBPrism? Look @ DB Prism's Web Site
    http://www.dbprism.com.ar/index.html
    More info?
    Chapter 17 of the book "Programming the Oracle Database using Java &
    Web Services"
    http://www.amazon.com/gp/product/1555583296/
    Chapter 21 of the book "Professional XML Databases" - Wrox Press
    http://www.amazon.com/gp/product/1861003587/
    Chapter 8 of the book "Oracle & Open Source" - O'Reilly
    http://www.oreilly.com/catalog/oracleopen/

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedSep 1, '08 at 7:26a
activeSep 10, '08 at 11:22a
posts8
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase