FAQ
Hi All,

I have an application that has to count the frequency that a specific
regular expression is matched on a particular field for each document in an
indexed directory.

For example.

Lets say I have 2 documents in the directory and each document has 3 fields,
"table", "column" and "data".

Example Doc(s):
//***************************************************************
Document doc1 = new Document();
doc1.add(new Field("table", "EMPLOYEE_US", Field.Store.NO,
Field.Index.ANALYZED);
doc1.add(new Field("column", "F_NAME", Field.Store.NO,
Field.Index.ANALYZED);
doc1.add(new Field("data", "Chris Hank Tony Cody Tom Tina Crystal",
Field.Store.NO, Field.Index.ANALYZED,
Field.TermVector.WITH_POSITIONS_OFFSETS);

Document doc2 = new Document();
doc2.add(new Field("table", "EMPLOYEE_CA", Field.Store.NO,
Field.Index.ANALYZED);
doc2.add(new Field("column", "F_NAME", Field.Store.NO,
Field.Index.ANALYZED);
doc2.add(new Field("data", "Bob Billy Tom Toby Charles Krista Madonna",
Field.Store.NO, Field.Index.ANALYZED,
Field.TermVector.WITH_POSITIONS_OFFSETS);

//I know I can create a query to search for a regular expression and that
will return each
//document that contains a match.

IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(),
true,

IndexWriter.MaxFieldLength.LIMITED);
writer.addDocument(doc);
writer.optimize();
writer.close();
searcher = new IndexSearcher(directory);

RegexQuery query = new RegexQuery( newTerm("data", "^T.*));
ScoreDoc[] hits = searcher.search(query, null,
maxNumOfHits).scoreDocs;//grab the score docs and go through them to find
the documents that contain a match

//*****************************************************


The code above will tell me that both doc1 and doc2 contain a match for the
constructed query.

However I need to know how many times the regular expression was matched in
each document. ie.

doc1 = 3
doc2 = 2

I hope I am being clear...and thanks in advance.


Cheers

--
View this message in context: http://old.nabble.com/Finding-frequency-of-regex-query-match-in-a-field-tp27175303p27175303.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Altimatic at Jan 15, 2010 at 1:48 pm
    I forgot to mention that I am using Lucene 3.0.0.
    --
    View this message in context: http://old.nabble.com/Finding-frequency-of-regex-query-match-in-a-field-tp27175303p27175915.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Simon Willnauer at Jan 15, 2010 at 1:57 pm
    One way to do it is to use the RegexTermEnum and iterate through your
    terms manually.

    like the following pseudo code:
    te = RegexTermEnum(reader, Term("^T.*"), regexpCapabilities)
    while( te.next() ):
    t = te.term()
    td = reader.termDocs(t)
    while(td.next()):
    freqOfTermInCurrentDoc = td.freq()
    doc = td.doc()
    #...do something with it

    does that make sense to you?

    simon

    On Fri, Jan 15, 2010 at 2:35 PM, Altimatic wrote:

    Hi All,

    I have an application that has to count the frequency that a specific
    regular expression is matched on a particular field for each document in an
    indexed directory.

    For example.

    Lets say I have 2 documents in the directory and each document has 3 fields,
    "table", "column" and "data".

    Example Doc(s):
    //***************************************************************
    Document doc1 = new Document();
    doc1.add(new Field("table", "EMPLOYEE_US", Field.Store.NO,
    Field.Index.ANALYZED);
    doc1.add(new Field("column", "F_NAME", Field.Store.NO,
    Field.Index.ANALYZED);
    doc1.add(new Field("data", "Chris Hank Tony Cody Tom Tina Crystal",
    Field.Store.NO, Field.Index.ANALYZED,
    Field.TermVector.WITH_POSITIONS_OFFSETS);

    Document doc2 = new Document();
    doc2.add(new Field("table", "EMPLOYEE_CA", Field.Store.NO,
    Field.Index.ANALYZED);
    doc2.add(new Field("column", "F_NAME", Field.Store.NO,
    Field.Index.ANALYZED);
    doc2.add(new Field("data", "Bob Billy Tom Toby Charles Krista Madonna",
    Field.Store.NO, Field.Index.ANALYZED,
    Field.TermVector.WITH_POSITIONS_OFFSETS);

    //I know I can  create a query to search for a regular expression and that
    will return each
    //document that contains a match.

    IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(),
    true,

    IndexWriter.MaxFieldLength.LIMITED);
    writer.addDocument(doc);
    writer.optimize();
    writer.close();
    searcher = new IndexSearcher(directory);

    RegexQuery query = new RegexQuery( newTerm("data", "^T.*));
    ScoreDoc[] hits = searcher.search(query, null,
    maxNumOfHits).scoreDocs;//grab the score docs and go through them to find
    the documents that contain a match

    //*****************************************************


    The code above will tell me that both doc1 and doc2 contain a match for the
    constructed query.

    However I need to know how many times the regular expression was matched in
    each document. ie.

    doc1 = 3
    doc2 = 2

    I hope I am being clear...and thanks in advance.


    Cheers

    --
    View this message in context: http://old.nabble.com/Finding-frequency-of-regex-query-match-in-a-field-tp27175303p27175303.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Altimatic at Jan 15, 2010 at 3:23 pm
    I think so. I'll try to write up a quick demo app to see if it will work for
    what I require.

    Thanks for the prompt reply.


    Simon Willnauer wrote:
    One way to do it is to use the RegexTermEnum and iterate through your
    terms manually.

    like the following pseudo code:
    te = RegexTermEnum(reader, Term("^T.*"), regexpCapabilities)
    while( te.next() ):
    t = te.term()
    td = reader.termDocs(t)
    while(td.next()):
    freqOfTermInCurrentDoc = td.freq()
    doc = td.doc()
    #...do something with it

    does that make sense to you?

    simon

    On Fri, Jan 15, 2010 at 2:35 PM, Altimatic wrote:

    Hi All,

    I have an application that has to count the frequency that a specific
    regular expression is matched on a particular field for each document in
    an
    indexed directory.

    For example.

    Lets say I have 2 documents in the directory and each document has 3
    fields,
    "table", "column" and "data".

    Example Doc(s):
    //***************************************************************
    Document doc1 = new Document();
    doc1.add(new Field("table", "EMPLOYEE_US", Field.Store.NO,
    Field.Index.ANALYZED);
    doc1.add(new Field("column", "F_NAME", Field.Store.NO,
    Field.Index.ANALYZED);
    doc1.add(new Field("data", "Chris Hank Tony Cody Tom Tina Crystal",
    Field.Store.NO, Field.Index.ANALYZED,
    Field.TermVector.WITH_POSITIONS_OFFSETS);

    Document doc2 = new Document();
    doc2.add(new Field("table", "EMPLOYEE_CA", Field.Store.NO,
    Field.Index.ANALYZED);
    doc2.add(new Field("column", "F_NAME", Field.Store.NO,
    Field.Index.ANALYZED);
    doc2.add(new Field("data", "Bob Billy Tom Toby Charles Krista Madonna",
    Field.Store.NO, Field.Index.ANALYZED,
    Field.TermVector.WITH_POSITIONS_OFFSETS);

    //I know I can  create a query to search for a regular expression and
    that
    will return each
    //document that contains a match.

    IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(),
    true,

    IndexWriter.MaxFieldLength.LIMITED);
    writer.addDocument(doc);
    writer.optimize();
    writer.close();
    searcher = new IndexSearcher(directory);

    RegexQuery query = new RegexQuery( newTerm("data", "^T.*));
    ScoreDoc[] hits = searcher.search(query, null,
    maxNumOfHits).scoreDocs;//grab the score docs and go through them to find
    the documents that contain a match

    //*****************************************************


    The code above will tell me that both doc1 and doc2 contain a match for
    the
    constructed query.

    However I need to know how many times the regular expression was matched
    in
    each document. ie.

    doc1 = 3
    doc2 = 2

    I hope I am being clear...and thanks in advance.


    Cheers

    --
    View this message in context:
    http://old.nabble.com/Finding-frequency-of-regex-query-match-in-a-field-tp27175303p27175303.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    View this message in context: http://old.nabble.com/Re%3A-Finding-frequency-of-regex-query-match-in-a-field-tp27177425p27178763.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Altimatic at Jan 15, 2010 at 2:14 pm
    Hi All,

    I have an application that has to count the frequency that a specific
    regular expression is matched on a particular field for each document in an
    indexed directory.

    For example.

    Lets say I have 2 documents in the directory and each document has 3 fields,
    "table", "column" and "data".

    Example Doc(s):
    //***************************************************************
    Document doc1 = new Document();
    doc1.add(new Field("table", "EMPLOYEE_US", Field.Store.NO,
    Field.Index.ANALYZED);
    doc1.add(new Field("column", "F_NAME", Field.Store.NO,
    Field.Index.ANALYZED);
    doc1.add(new Field("data", "Chris Hank Tony Cody Tom Tina Crystal",
    Field.Store.NO, Field.Index.ANALYZED,
    Field.TermVector.WITH_POSITIONS_OFFSETS);

    Document doc2 = new Document();
    doc2.add(new Field("table", "EMPLOYEE_CA", Field.Store.NO,
    Field.Index.ANALYZED);
    doc2.add(new Field("column", "F_NAME", Field.Store.NO,
    Field.Index.ANALYZED);
    doc2.add(new Field("data", "Bob Billy Tom Toby Charles Krista Madonna",
    Field.Store.NO, Field.Index.ANALYZED,
    Field.TermVector.WITH_POSITIONS_OFFSETS);

    //I know I can create a query to search for a regular expression and that
    will return each
    //document that contains a match.

    IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(),
    true,

    IndexWriter.MaxFieldLength.LIMITED);
    writer.addDocument(doc);
    writer.optimize();
    writer.close();
    searcher = new IndexSearcher(directory);

    RegexQuery query = new RegexQuery( newTerm("data", "^T.*));
    ScoreDoc[] hits = searcher.search(query, null,
    maxNumOfHits).scoreDocs;//grab the score docs and go through them to find
    the documents that contain a match

    //*****************************************************


    The code above will tell me that both doc1 and doc2 contain a match for the
    constructed query.

    However I need to know how many times the regular expression was matched in
    each document. ie.

    doc1 = 3
    doc2 = 2

    I hope I am being clear...and thanks in advance.


    Cheers

    --
    View this message in context: http://old.nabble.com/Finding-frequency-of-regex-query-match-in-a-field-tp27176800p27176800.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJan 15, '10 at 1:35p
activeJan 15, '10 at 3:23p
posts5
users2
websitelucene.apache.org

2 users in discussion

Altimatic: 4 posts Simon Willnauer: 1 post

People

Translate

site design / logo © 2021 Grokbase