FAQ
Hello,

I'm using Hadoop for distributed text mining of large collection of
documents, and in my optimizing process, I want to speed things up a bit,
and I want to know how can I do this step with Hadoop...

Each Map process takes a group of documents, analyses each sentence, and for
certain patterns it queries a database and some indexes to provide a proper
answer. This step can take a while, so I'm caching the results in a
LinkedHashMap, which works pretty well for standalone jobs, and avoids
repeated queries for same patterns in a documet.

I think it would be great to share this LinkedHashMap cache object for all
Map instances, so that if the #2 Map object finds the same pattern as the #1
Map object previously noticed on other document, it can use the cached
result that #1 Map placed there for all Map objects, saving some time.

Right now, the DistributedCache just shares files, archives and jar files.
Is there any way to share such a Java object such as a LinkedHashMap,
synchronized or not?

--
View this message in context: http://www.nabble.com/DistributedCache-on-Java-Objects--tp16985074p16985074.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Search Discussions

  • Arun C Murthy at Apr 30, 2008 at 4:38 pm

    On Apr 30, 2008, at 8:14 AM, ncardoso wrote:
    Hello,

    I'm using Hadoop for distributed text mining of large collection of
    documents, and in my optimizing process, I want to speed things up
    a bit,
    and I want to know how can I do this step with Hadoop...

    Each Map process takes a group of documents, analyses each
    sentence, and for
    certain patterns it queries a database and some indexes to provide
    a proper
    answer. This step can take a while, so I'm caching the results in a
    LinkedHashMap, which works pretty well for standalone jobs, and avoids
    repeated queries for same patterns in a documet.

    I think it would be great to share this LinkedHashMap cache object
    for all
    Map instances, so that if the #2 Map object finds the same pattern
    as the #1
    Map object previously noticed on other document, it can use the cached
    result that #1 Map placed there for all Map objects, saving some time.

    Right now, the DistributedCache just shares files, archives and jar
    files.
    Is there any way to share such a Java object such as a LinkedHashMap,
    synchronized or not?
    Not directly. You could serialize your LinkedHashMap into a file via
    something like java.io.DataOutput, store it on HDFS and then
    deserialize it via java.io.DataInput in your map's configure method.
    Does that work?

    Arun
    --
    View this message in context: http://www.nabble.com/
    DistributedCache-on-Java-Objects--tp16985074p16985074.html
    Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
  • Richard K. Turner at Apr 30, 2008 at 7:45 pm
    Maybe memcahced can help http://www.danga.com/memcached/


    -----Original Message-----
    From: Arun C Murthy
    Sent: Wed 4/30/2008 12:35 PM
    To: core-user@hadoop.apache.org
    Subject: Re: DistributedCache on Java Objects?

    On Apr 30, 2008, at 8:14 AM, ncardoso wrote:


    Hello,

    I'm using Hadoop for distributed text mining of large collection of
    documents, and in my optimizing process, I want to speed things up
    a bit,
    and I want to know how can I do this step with Hadoop...

    Each Map process takes a group of documents, analyses each
    sentence, and for
    certain patterns it queries a database and some indexes to provide
    a proper
    answer. This step can take a while, so I'm caching the results in a
    LinkedHashMap, which works pretty well for standalone jobs, and avoids
    repeated queries for same patterns in a documet.

    I think it would be great to share this LinkedHashMap cache object
    for all
    Map instances, so that if the #2 Map object finds the same pattern
    as the #1
    Map object previously noticed on other document, it can use the cached
    result that #1 Map placed there for all Map objects, saving some time.

    Right now, the DistributedCache just shares files, archives and jar
    files.
    Is there any way to share such a Java object such as a LinkedHashMap,
    synchronized or not?
    Not directly. You could serialize your LinkedHashMap into a file via
    something like java.io.DataOutput, store it on HDFS and then
    deserialize it via java.io.DataInput in your map's configure method.
    Does that work?

    Arun
    --
    View this message in context: http://www.nabble.com/
    DistributedCache-on-Java-Objects--tp16985074p16985074.html
    Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
  • Otis Gospodnetic at Apr 30, 2008 at 7:06 pm
    How about using an out of process cache, such as memcached?

    Otis
    --
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
    ----- Original Message ----
    From: ncardoso <ncardoso@xldb.di.fc.ul.pt>
    To: hadoop-user@lucene.apache.org
    Sent: Wednesday, April 30, 2008 11:14:54 AM
    Subject: DistributedCache on Java Objects?


    Hello,

    I'm using Hadoop for distributed text mining of large collection of
    documents, and in my optimizing process, I want to speed things up a bit,
    and I want to know how can I do this step with Hadoop...

    Each Map process takes a group of documents, analyses each sentence, and for
    certain patterns it queries a database and some indexes to provide a proper
    answer. This step can take a while, so I'm caching the results in a
    LinkedHashMap, which works pretty well for standalone jobs, and avoids
    repeated queries for same patterns in a documet.

    I think it would be great to share this LinkedHashMap cache object for all
    Map instances, so that if the #2 Map object finds the same pattern as the #1
    Map object previously noticed on other document, it can use the cached
    result that #1 Map placed there for all Map objects, saving some time.

    Right now, the DistributedCache just shares files, archives and jar files.
    Is there any way to share such a Java object such as a LinkedHashMap,
    synchronized or not?

    --
    View this message in context:
    http://www.nabble.com/DistributedCache-on-Java-Objects--tp16985074p16985074.html
    Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedApr 30, '08 at 3:15p
activeApr 30, '08 at 7:45p
posts4
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase