FAQ
Sorry to keep posting questions like this, so many tasks on the go I
still haven't ever had the time to sit and research lucene fully. One
of these days!

Our development of a piece of software I'm working on has hit a small snag.

We currently have the following layout for our back-end data for our
projects.

Project Folder\database
Project Folder\Content index
Project Folder\Sub Folder\File Info index

The database is just for quick reads and file counts and other
information. In the database we have a file table with the following layout

FileId
FileName
FilePath
<.... additional info and flags>

The content index contains

<FileId>
<contents of a file, indexed but not stored>

The File Info index contains
<FileId>
<FileName>
<FilePath>
<File MetaData>
<... other file related data>

The reason we have 2 indexes (is that the proper plural?) is that we
don't store the contents field, so if we want to change info regarding
the file in the index, we couldn't re-create the row from the existing
data. We'd have to re-extract the information from the file (which can
be very time consuming), but we can easily re-create the FileInfo index row.

We thought that using a MultiSearcher was the best way to do a combined
search between the FileInfo index and the contents index. It worked
like a charm, until we started searching across both indexes.

When we use a MultiSearcher and search for for example
"FileName:test.txt AND contents:eml" we end up with a Hits object with
duplicate entries in FileId. This is because the hit from the FileInfo
index and the hit from the Contents index both are returned. So 1 file
gives 2 entries, 1 for each index.

Is there a way around this without looping through the entire Hits
collection and making my own collection of IDs?

Thanks in advance.

Trevor Watson

Search Discussions

  • Franklin Simmons at Jun 21, 2011 at 7:23 pm
    Since you say it is possible to weed out redundant hits as a post-process, you should be able to solve the problem with a custom Lucene.Net.Search.Filter.


    -----Original Message-----
    From: Trevor Watson
    Sent: Tuesday, June 21, 2011 2:52 PM
    To: lucene-net-user@lucene.apache.org
    Subject: [Lucene.Net] MultiSearcher & duplicate IDs

    Sorry to keep posting questions like this, so many tasks on the go I still haven't ever had the time to sit and research lucene fully. One of these days!

    Our development of a piece of software I'm working on has hit a small snag.

    We currently have the following layout for our back-end data for our projects.

    Project Folder\database
    Project Folder\Content index
    Project Folder\Sub Folder\File Info index

    The database is just for quick reads and file counts and other information. In the database we have a file table with the following layout

    FileId
    FileName
    FilePath
    <.... additional info and flags>

    The content index contains

    <FileId>
    <contents of a file, indexed but not stored>

    The File Info index contains
    <FileId>
    <FileName>
    <FilePath>
    <File MetaData>
    <... other file related data>

    The reason we have 2 indexes (is that the proper plural?) is that we don't store the contents field, so if we want to change info regarding the file in the index, we couldn't re-create the row from the existing data. We'd have to re-extract the information from the file (which can be very time consuming), but we can easily re-create the FileInfo index row.

    We thought that using a MultiSearcher was the best way to do a combined search between the FileInfo index and the contents index. It worked like a charm, until we started searching across both indexes.

    When we use a MultiSearcher and search for for example "FileName:test.txt AND contents:eml" we end up with a Hits object with duplicate entries in FileId. This is because the hit from the FileInfo index and the hit from the Contents index both are returned. So 1 file gives 2 entries, 1 for each index.

    Is there a way around this without looping through the entire Hits collection and making my own collection of IDs?

    Thanks in advance.

    Trevor Watson
  • Digy at Jun 21, 2011 at 8:17 pm

    The reason we have 2 indexes (is that the proper plural?) is that we don't
    store the contents field,
    so if we want to change info regarding the file in the index, we couldn't
    re-create the row from the existing data.



    If this is the real problem, you can construct a document roughly equal to
    the original one(assuming you use "Field.TermVector.WITH_POSITIONS").



    DIGY



    class TVM : TermVectorMapper

    {

    string[] terms;

    string text = null;



    public override void SetExpectations(string field, int numTerms,
    bool storeOffsets, bool storePositions)

    {

    terms = new string[numTerms];

    }



    public override void Map(string term, int frequency,
    TermVectorOffsetInfo[] offsets, int[] positions)

    {



    foreach(int i in positions)

    {

    if (terms.Length < i + 1)

    {

    string[] temp = new string[(int)(terms.Length *
    1.2)];

    terms.CopyTo(temp, 0);

    terms = temp;

    }

    terms[i]=term;

    }

    }



    public override string ToString()

    {

    if(text==null)

    {

    StringBuilder sb = new StringBuilder();

    foreach(string s in terms) sb.Append(s + " ");

    text = sb.ToString();

    }

    return text;

    }

    }



    ........



    TVM tvm = new TVM();

    reader.GetTermFreqVector(0, "text", tvm);

    string doc = tvm.ToString();







    -----Original Message-----
    From: Trevor Watson
    Sent: Tuesday, June 21, 2011 9:52 PM
    To: lucene-net-user@lucene.apache.org
    Subject: [Lucene.Net] MultiSearcher & duplicate IDs



    Sorry to keep posting questions like this, so many tasks on the go I

    still haven't ever had the time to sit and research lucene fully. One

    of these days!



    Our development of a piece of software I'm working on has hit a small snag.



    We currently have the following layout for our back-end data for our

    projects.



    Project Folder\database

    Project Folder\Content index

    Project Folder\Sub Folder\File Info index



    The database is just for quick reads and file counts and other

    information. In the database we have a file table with the following layout



    FileId

    FileName

    FilePath

    <.... additional info and flags>



    The content index contains



    <FileId>

    <contents of a file, indexed but not stored>



    The File Info index contains

    <FileId>

    <FileName>

    <FilePath>

    <File MetaData>

    <... other file related data>



    The reason we have 2 indexes (is that the proper plural?) is that we

    don't store the contents field, so if we want to change info regarding

    the file in the index, we couldn't re-create the row from the existing

    data. We'd have to re-extract the information from the file (which can

    be very time consuming), but we can easily re-create the FileInfo index row.



    We thought that using a MultiSearcher was the best way to do a combined

    search between the FileInfo index and the contents index. It worked

    like a charm, until we started searching across both indexes.



    When we use a MultiSearcher and search for for example

    "FileName:test.txt AND contents:eml" we end up with a Hits object with

    duplicate entries in FileId. This is because the hit from the FileInfo

    index and the hit from the Contents index both are returned. So 1 file

    gives 2 entries, 1 for each index.



    Is there a way around this without looping through the entire Hits

    collection and making my own collection of IDs?



    Thanks in advance.



    Trevor Watson
  • Trevor Watson at Jun 21, 2011 at 9:32 pm
    Wow, this is awesome, there's so much for me to learn yet.

    The only problem I'm having with this is the line that reads

    string[] temp = new string[(int)(terms.Length * 1.2)];

    I'm not sure where the 1.2 comes from and running the code as it stand
    frequently resulted in me getting an out of index range error.

    However, I changed it to read

    string[] temp = new string[(int)(i + 1)];


    Which probably isn't correct, but seems to work. I might have to see
    about modifying the software to go back to using a single index again.

    Thanks again DIGY.

    Trevor

    On 06/21/2011 4:16 PM, Digy wrote:
    The reason we have 2 indexes (is that the proper plural?) is that we don't
    store the contents field,
    so if we want to change info regarding the file in the index, we couldn't
    re-create the row from the existing data.

    If this is the real problem, you can construct a document roughly equal to
    the original one(assuming you use "Field.TermVector.WITH_POSITIONS").

    DIGY

    class TVM : TermVectorMapper
    {
    string[] terms;
    string text = null;

    public override void SetExpectations(string field, int numTerms,bool storeOffsets, bool storePositions)
    {
    terms = new string[numTerms];
    }

    public override void Map(string term, int frequency, TermVectorOffsetInfo[] offsets, int[] positions)
    {

    foreach(int i in positions)
    {
    if (terms.Length< i + 1)
    {
    string[] temp = new string[(int)(terms.Length * 1.2)];
    terms.CopyTo(temp, 0);
    terms = temp;
    }
    terms[i]=term;
    }
    }

    public override string ToString()
    {
    if(text==null)
    {
    StringBuilder sb = new StringBuilder();
    foreach(string s in terms) sb.Append(s + " ");
    text = sb.ToString();
    }
    return text;
    }
    }

    ........

    TVM tvm = new TVM();
    reader.GetTermFreqVector(0, "text", tvm);
    string doc = tvm.ToString();


    -----Original Message-----
    From: Trevor Watson
    Sent: Tuesday, June 21, 2011 9:52 PM
    To: lucene-net-user@lucene.apache.org
    Subject: [Lucene.Net] MultiSearcher& duplicate IDs


    Sorry to keep posting questions like this, so many tasks on the go I
    still haven't ever had the time to sit and research lucene fully. One
    of these days!

    Our development of a piece of software I'm working on has hit a small snag.

    We currently have the following layout for our back-end data for our
    projects.

    Project Folder\database
    Project Folder\Content index
    Project Folder\Sub Folder\File Info index

    The database is just for quick reads and file counts and other
    information. In the database we have a file table with the following layout

    FileId
    FileName
    FilePath
    <.... additional info and flags>

    The content index contains

    <FileId>
    <contents of a file, indexed but not stored>

    The File Info index contains
    <FileId>
    <FileName>
    <FilePath>
    <File MetaData>
    <... other file related data>

    The reason we have 2 indexes (is that the proper plural?) is that we
    don't store the contents field, so if we want to change info regarding
    the file in the index, we couldn't re-create the row from the existing
    data. We'd have to re-extract the information from the file (which can
    be very time consuming), but we can easily re-create the FileInfo index row.

    We thought that using a MultiSearcher was the best way to do a combined
    search between the FileInfo index and the contents index. It worked
    like a charm, until we started searching across both indexes.

    When we use a MultiSearcher and search for for example
    "FileName:test.txt AND contents:eml" we end up with a Hits object with
    duplicate entries in FileId. This is because the hit from the FileInfo
    index and the hit from the Contents index both are returned. So 1 file
    gives 2 entries, 1 for each index.

    Is there a way around this without looping through the entire Hits
    collection and making my own collection of IDs?

    Thanks in advance.

    Trevor Watson

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouplucene-net-user @
categorieslucene
postedJun 21, '11 at 6:58p
activeJun 21, '11 at 9:32p
posts4
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase