3 XML files that I parse using SAX:
<?xml version="1.0" encoding="UTF-8"?>
<person>
<name>bob bob bob
</name>
<name>3m
</name>
<height>3m
</height>
<height>bob
</height>
</person>
<?xml version="1.0" encoding="UTF-8"?>
<person>
<name>bob
</name>
<name>bob
</name>
<name>bob bob
</name>
<height>3m
</height>
<height>bob
</height>
</person>
<?xml version="1.0" encoding="UTF-8"?>
<person>
<name>bob
</name>
<name>bob
</name>
<height>bob
</height>
</person>
I am currently indexing these under separate fields for the duplicate <name>
tag. so I have in total 3 /person/name fields: /person/name0, /person/name1,
/person/name2.
I am wanting to compute how many times, in a given unique field
(/person/name) a query appears. Let's say the query is "bob"
I want to see, for total times appearing: 9
I want to also see how many times it appeared in all documents): 6
My current solution is to call TermDocs for the first question and iterate
through counting the docFreq() of the given field(/person/namex) (there are
two loops then).
This gets very slow, and ideally, I would like to index them all under
/person/name, but I still really need these answers. Does anyone have any
ideas? I can offer more clarification and some source code, but my current
method is very slow (I need to index ~4million files and run compute these
quantities--very slow when you have 150 fields of
/person/actor/movie_acted_in and 4 million documents...
Thank you very much!
--
View this message in context: http://lucene.472066.n3.nabble.com/Computing-document-frequencies-for-specific-queries-in-Lucene-tp3101450p3101450.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org