Hi Boris,

Thank you!

One small note on "b) improve cacheing. Now it is implemented via java
object serialization; make it via CSV files".
If you'll use some library for CSV, you might as well think about Google
Protocol Buffers. They are pretty fast.


On Wed, Mar 28, 2012 at 10:42 PM, Boris Galitsky wrote:

Hi guys
per Aliaksandr's suggestion, below are the minutes of our conversation
with Jorn about Similarity component and other related issues
1) Prepare Similarity fro release from sandbox:

a) improve readme.txt, add 'The entry point to
Similarity component is

SentencePairMatchResult matchRes =

where matchRes includes the similarity score (weighted number of common
terms) and the set of maximum
common parse trees.
b) improve cacheing. Now it is implemented via java object
serialization; make it via CSV files
c) proper location for cache files and resources: joernkottmann:
src/test/resources d) verify porter stemmer (remove lucene
dependecies, remove porter stemmer from /similarity e)re-format code,
use eclipse template for re-format joernkottmann:
http://opennlp.apache.org/code-conventions.html f) package into
separate jar/ src using Maven
2) Next major feature of Similarity: taxonomy auto learning and using
taxonomy to improve search relevance a) see how Similarity component
can help with search tasks' b) integration with SOLR
(compare/complement github.com/tamingtext of Grant Ingersoll with
Similarity). there are some JIRA issue opened for hooking in some of
tamingtext stuff to the analyzers modules in Solr 3) More examples and
docs for similarity component a) examples for finding similar news at
allvoices.com email the code which generates search query
for news articles b)email the link to the papers on
joernkottmann: https://cwiki.apache.org/OPENNLP/nlp-papers.html
4) Other future features/improvements for Similarity a) how can we
create a more accurate Parse object running chunker separately and then
applying alignment algorithm b) Coreference component
joernkottmann: TreebankNameFinder c) apply machine learning to parse
trees + coreferences. " parse forest": is it a good name?
joernkottmann: CorefSample.

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 11 of 13 | next ›
Discussion Overview
groupdev @
postedMar 23, '12 at 4:05a
activeApr 6, '12 at 12:25a



site design / logo © 2021 Grokbase