I'm using Hadoop for distributed text mining of large collection of
documents, and in my optimizing process, I want to speed things up a bit,
and I want to know how can I do this step with Hadoop...
Each Map process takes a group of documents, analyses each sentence, and for
certain patterns it queries a database and some indexes to provide a proper
answer. This step can take a while, so I'm caching the results in a
LinkedHashMap, which works pretty well for standalone jobs, and avoids
repeated queries for same patterns in a documet.
I think it would be great to share this LinkedHashMap cache object for all
Map instances, so that if the #2 Map object finds the same pattern as the #1
Map object previously noticed on other document, it can use the cached
result that #1 Map placed there for all Map objects, saving some time.
Right now, the DistributedCache just shares files, archives and jar files.
Is there any way to share such a Java object such as a LinkedHashMap,
synchronized or not?
View this message in context: http://www.nabble.com/DistributedCache-on-Java-Objects--tp16985074p16985074.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.