FAQ
Good morning,

I would like to store some files in the distributed cache, in order to be opened and read from the mappers.
The files are produced by an other Job and are sequence files.
I am not sure if that format is proper for the distributed cache, as the files in distr.cache are stored and read locally. Should I change the format of the files in the previous Job and make them Text Files maybe and read them from the Distr.Cache using tha simple Java API?
Or can I still handle them with the usual way we use sequence files, even if they reside in the local directory? Performance is extremely important for my project, so I don't know what the best solution would be.

Thank you in advance,
Sofia Georgiakaki

Search Discussions

  • Dino Kečo at Aug 12, 2011 at 8:31 am
    Hi Sofia,

    I assume that output of first job is stored on HDFS. In that case I would
    directly read file from Mappers without using distributed cache. If you put
    file into distributed cache that would add one more copy operation into your
    process.

    Thanks,
    dino


    On Fri, Aug 12, 2011 at 9:53 AM, Sofia Georgiakaki
    wrote:
    Good morning,

    I would like to store some files in the distributed cache, in order to be
    opened and read from the mappers.
    The files are produced by an other Job and are sequence files.
    I am not sure if that format is proper for the distributed cache, as the
    files in distr.cache are stored and read locally. Should I change the format
    of the files in the previous Job and make them Text Files maybe and read
    them from the Distr.Cache using tha simple Java API?
    Or can I still handle them with the usual way we use sequence files, even
    if they reside in the local directory? Performance is extremely important
    for my project, so I don't know what the best solution would be.

    Thank you in advance,
    Sofia Georgiakaki
  • Sofia Georgiakaki at Aug 12, 2011 at 8:57 am
    Thank you for the reply!
    In each map(), I need to open-read-close these files (more than 2 in the general case, and maybe up to 20 or more), in order to make some checks. Considering the huge amount of data in the input, making all these file operations on HDFS will kill the performance!!! So I think it would be better to store these files in distributed Cache, so that the whole process would be more efficient -I guess this is the point of using Distributed Cache in the first place!

    My question is, if I can store sequence files in distributed Cache and handle them using e.g. the SequenceFile.Reader class, or if I should only keep regular text files in distributed Cache and handle them using the usual java API.

    Thank you very much
    Sofia

    PS: The files have small size, a few KB to few MB maximum.



    ________________________________
    From: Dino Kečo <dino.keco@gmail.com>
    To: common-user@hadoop.apache.org; Sofia Georgiakaki <geosofie_tuc@yahoo.com>
    Sent: Friday, August 12, 2011 11:30 AM
    Subject: Re: Hadoop--store a sequence file in distributed cache?

    Hi Sofia,

    I assume that output of first job is stored on HDFS. In that case I would
    directly read file from Mappers without using distributed cache. If you put
    file into distributed cache that would add one more copy operation into your
    process.

    Thanks,
    dino


    On Fri, Aug 12, 2011 at 9:53 AM, Sofia Georgiakaki
    wrote:
    Good morning,

    I would like to store some files in the distributed cache, in order to be
    opened and read from the mappers.
    The files are produced by an other Job and are sequence files.
    I am not sure if that format is proper for the distributed cache, as the
    files in distr.cache are stored and read locally. Should I change the format
    of the files in the previous Job and make them Text Files maybe and read
    them from the Distr.Cache using tha simple Java API?
    Or can I still handle them with the usual way we use sequence files, even
    if they reside in the local directory? Performance is extremely important
    for my project, so I don't know what the best solution would be.

    Thank you in advance,
    Sofia Georgiakaki
  • Joey Echeverria at Aug 12, 2011 at 10:29 am
    You can use any kind of format for files in the distributed cache, so
    yes you can use sequence files. They should be faster to parse than
    most text formats.

    -Joey

    On Fri, Aug 12, 2011 at 4:56 AM, Sofia Georgiakaki
    wrote:
    Thank you for the reply!
    In each map(), I need to open-read-close these files (more than 2 in the general case, and maybe up to 20 or more), in order to make some checks. Considering the huge amount of data in the input, making all these file operations on HDFS will kill the performance!!! So I think it would be better to store these files in distributed Cache, so that the whole process would be more efficient -I guess this is the point of using Distributed Cache in the first place!

    My question is, if I can store sequence files in distributed Cache and handle them using e.g. the SequenceFile.Reader class, or if I should only keep regular text files in distributed Cache and handle them using the usual java API.

    Thank you very much
    Sofia

    PS: The files have small size, a few KB to few MB maximum.



    ________________________________
    From: Dino Kečo <dino.keco@gmail.com>
    To: common-user@hadoop.apache.org; Sofia Georgiakaki <geosofie_tuc@yahoo.com>
    Sent: Friday, August 12, 2011 11:30 AM
    Subject: Re: Hadoop--store a sequence file in distributed cache?

    Hi Sofia,

    I assume that output of first job is stored on HDFS. In that case I would
    directly read file from Mappers without using distributed cache. If you put
    file into distributed cache that would add one more copy operation into your
    process.

    Thanks,
    dino


    On Fri, Aug 12, 2011 at 9:53 AM, Sofia Georgiakaki
    wrote:
    Good morning,

    I would like to store some files in the distributed cache, in order to be
    opened and read from the mappers.
    The files are produced by an other Job and are sequence files.
    I am not sure if that format is proper for the distributed cache, as the
    files in distr.cache are stored and read locally. Should I change the format
    of the files in the previous Job and make them Text Files maybe and read
    them from the Distr.Cache using tha simple Java API?
    Or can I still handle them with the usual way we use sequence files, even
    if they reside in the local directory? Performance is extremely important
    for my project, so I don't know what the best solution would be.

    Thank you in advance,
    Sofia Georgiakaki


    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434
  • Adam Shook at Aug 12, 2011 at 1:12 pm
    If you are looking for performance gains, then possibly reading these files once during the setup() call in your Mapper and storing them in some data structure like a Map or a List will give you benefits. Having to open/close the files during each map call will have a lot of unneeded I/O.

    You have to be conscious of your java heap size though since you are basically storing the files in RAM. If your files are a few MB in size as you said, then it shouldn't be a problem. If the amount of data you need to store won't fit, consider using HBase as a solution to get access to the data you need.

    But as Joey said, you can put whatever you want in the Distributed Cache -- as long as you have a reader for it. You should have no problems using the SequenceFile.Reader.

    -- Adam


    -----Original Message-----
    From: Joey Echeverria
    Sent: Friday, August 12, 2011 6:28 AM
    To: common-user@hadoop.apache.org; Sofia Georgiakaki
    Subject: Re: Hadoop--store a sequence file in distributed cache?

    You can use any kind of format for files in the distributed cache, so
    yes you can use sequence files. They should be faster to parse than
    most text formats.

    -Joey

    On Fri, Aug 12, 2011 at 4:56 AM, Sofia Georgiakaki
    wrote:
    Thank you for the reply!
    In each map(), I need to open-read-close these files (more than 2 in the general case, and maybe up to 20 or more), in order to make some checks. Considering the huge amount of data in the input, making all these file operations on HDFS will kill the performance!!! So I think it would be better to store these files in distributed Cache, so that the whole process would be more efficient -I guess this is the point of using Distributed Cache in the first place!

    My question is, if I can store sequence files in distributed Cache and handle them using e.g. the SequenceFile.Reader class, or if I should only keep regular text files in distributed Cache and handle them using the usual java API.

    Thank you very much
    Sofia

    PS: The files have small size, a few KB to few MB maximum.



    ________________________________
    From: Dino Kečo <dino.keco@gmail.com>
    To: common-user@hadoop.apache.org; Sofia Georgiakaki <geosofie_tuc@yahoo.com>
    Sent: Friday, August 12, 2011 11:30 AM
    Subject: Re: Hadoop--store a sequence file in distributed cache?

    Hi Sofia,

    I assume that output of first job is stored on HDFS. In that case I would
    directly read file from Mappers without using distributed cache. If you put
    file into distributed cache that would add one more copy operation into your
    process.

    Thanks,
    dino


    On Fri, Aug 12, 2011 at 9:53 AM, Sofia Georgiakaki
    wrote:
    Good morning,

    I would like to store some files in the distributed cache, in order to be
    opened and read from the mappers.
    The files are produced by an other Job and are sequence files.
    I am not sure if that format is proper for the distributed cache, as the
    files in distr.cache are stored and read locally. Should I change the format
    of the files in the previous Job and make them Text Files maybe and read
    them from the Distr.Cache using tha simple Java API?
    Or can I still handle them with the usual way we use sequence files, even
    if they reside in the local directory? Performance is extremely important
    for my project, so I don't know what the best solution would be.

    Thank you in advance,
    Sofia Georgiakaki


    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434

    -----
    No virus found in this message.
    Checked by AVG - www.avg.com
    Version: 10.0.1392 / Virus Database: 1520/3828 - Release Date: 08/11/11
  • Ian Michael Gumby at Aug 12, 2011 at 3:54 pm
    This whole thread doesn't make a lot of sense.

    If your first m/r job creates the sequence files, which you then use as input files to your second job, you don't need to use distributed cache since the output of the first m/r job is going to be in HDFS.
    (Dino is correct on that account.)

    Sofia replied saying that she needed to open and close the sequence file to access the data in each Mapper.map() call.
    Without knowing more about the specific app, Ashook is correct that you could read the file in Mapper.setup() and then access it in memory.
    Joey is correct you can put anything in distributed cache, but you don't want to put an HDFS file in to distributed cache. Distributed cache is a tool for taking something from your job and distributing it to each job tracker as a local object. It does have a bit of overhead.

    A better example is if you're distributing binary objects that you want on each node. A c++ .so file that you want to call from within your java m/r.

    If you're not using all of the data in the sequence file, what about using HBase?

    From: ashook@clearedgeit.com
    To: common-user@hadoop.apache.org
    Date: Fri, 12 Aug 2011 09:06:39 -0400
    Subject: RE: Hadoop--store a sequence file in distributed cache?

    If you are looking for performance gains, then possibly reading these files once during the setup() call in your Mapper and storing them in some data structure like a Map or a List will give you benefits. Having to open/close the files during each map call will have a lot of unneeded I/O.

    You have to be conscious of your java heap size though since you are basically storing the files in RAM. If your files are a few MB in size as you said, then it shouldn't be a problem. If the amount of data you need to store won't fit, consider using HBase as a solution to get access to the data you need.

    But as Joey said, you can put whatever you want in the Distributed Cache -- as long as you have a reader for it. You should have no problems using the SequenceFile.Reader.

    -- Adam


    -----Original Message-----
    From: Joey Echeverria
    Sent: Friday, August 12, 2011 6:28 AM
    To: common-user@hadoop.apache.org; Sofia Georgiakaki
    Subject: Re: Hadoop--store a sequence file in distributed cache?

    You can use any kind of format for files in the distributed cache, so
    yes you can use sequence files. They should be faster to parse than
    most text formats.

    -Joey

    On Fri, Aug 12, 2011 at 4:56 AM, Sofia Georgiakaki
    wrote:
    Thank you for the reply!
    In each map(), I need to open-read-close these files (more than 2 in the general case, and maybe up to 20 or more), in order to make some checks. Considering the huge amount of data in the input, making all these file operations on HDFS will kill the performance!!! So I think it would be better to store these files in distributed Cache, so that the whole process would be more efficient -I guess this is the point of using Distributed Cache in the first place!

    My question is, if I can store sequence files in distributed Cache and handle them using e.g. the SequenceFile.Reader class, or if I should only keep regular text files in distributed Cache and handle them using the usual java API.

    Thank you very much
    Sofia

    PS: The files have small size, a few KB to few MB maximum.



    ________________________________
    From: Dino Kečo <dino.keco@gmail.com>
    To: common-user@hadoop.apache.org; Sofia Georgiakaki <geosofie_tuc@yahoo.com>
    Sent: Friday, August 12, 2011 11:30 AM
    Subject: Re: Hadoop--store a sequence file in distributed cache?

    Hi Sofia,

    I assume that output of first job is stored on HDFS. In that case I would
    directly read file from Mappers without using distributed cache. If you put
    file into distributed cache that would add one more copy operation into your
    process.

    Thanks,
    dino


    On Fri, Aug 12, 2011 at 9:53 AM, Sofia Georgiakaki
    wrote:
    Good morning,

    I would like to store some files in the distributed cache, in order to be
    opened and read from the mappers.
    The files are produced by an other Job and are sequence files.
    I am not sure if that format is proper for the distributed cache, as the
    files in distr.cache are stored and read locally. Should I change the format
    of the files in the previous Job and make them Text Files maybe and read
    them from the Distr.Cache using tha simple Java API?
    Or can I still handle them with the usual way we use sequence files, even
    if they reside in the local directory? Performance is extremely important
    for my project, so I don't know what the best solution would be.

    Thank you in advance,
    Sofia Georgiakaki


    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434

    -----
    No virus found in this message.
    Checked by AVG - www.avg.com
    Version: 10.0.1392 / Virus Database: 1520/3828 - Release Date: 08/11/11
  • Jonathan Hwang at Aug 12, 2011 at 3:59 pm
    Hi All,

    I'm trying to decommission data node from my cluster. I put the data node in the /usr/lib/hadoop/conf/dfs.hosts.exclude list and restarted the name nodes. The under-replicated blocks are starting to replicate, but it's going down in a very slow pace. For 1 TB of data it takes over 1 day to complete. We change the settings as below and try to increase the replication rate.

    Added this to hdfs-site.xml on all the nodes on the cluster and restarted the data nodes and name node processes.
    <property>
    <!-- 100Mbit/s -->
    <name>dfs.balance.bandwidthPerSec</name>
    <value>131072000</value>
    </property>

    Speed didn't seem to pick up. Do you know what may be happening?

    Thanks!
    Jonathan

    This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited.
  • Charles Wimmer at Aug 12, 2011 at 4:10 pm
    The balancer bandwidth setting does not affect decommissioning nodes. Decommisssioning nodes replicate as fast as the cluster is capable.

    The replication pace has many variables.
    Number nodes that are participating in the replication.
    The amount of network bandwidth each has.
    The amount of other HDFS activity at the time.
    Total blocks being replicated.
    Total data being replicated.
    Many others.


    On 8/12/11 8:58 AM, "jonathan.hwang@accenture.com" wrote:

    Hi All,

    I'm trying to decommission data node from my cluster. I put the data node in the /usr/lib/hadoop/conf/dfs.hosts.exclude list and restarted the name nodes. The under-replicated blocks are starting to replicate, but it's going down in a very slow pace. For 1 TB of data it takes over 1 day to complete. We change the settings as below and try to increase the replication rate.

    Added this to hdfs-site.xml on all the nodes on the cluster and restarted the data nodes and name node processes.
    <property>
    <!-- 100Mbit/s -->
    <name>dfs.balance.bandwidthPerSec</name>
    <value>131072000</value>
    </property>

    Speed didn't seem to pick up. Do you know what may be happening?

    Thanks!
    Jonathan

    This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited.
  • Joey Echeverria at Aug 12, 2011 at 4:14 pm
    You can configure the undocumented variable dfs.max-repl-streams to
    increase the number of replications a data-node is allowed to handle
    at one time. The default value is 2. [1]

    -Joey

    [1] https://issues.apache.org/jira/browse/HADOOP-2606?focusedCommentId=12578700&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12578700
    On Fri, Aug 12, 2011 at 12:09 PM, Charles Wimmer wrote:
    The balancer bandwidth setting does not affect decommissioning nodes.  Decommisssioning nodes replicate as fast as the cluster is capable.

    The replication pace has many variables.
    Number nodes that are participating in the replication.
    The amount of network bandwidth each has.
    The amount of other HDFS activity at the time.
    Total blocks being replicated.
    Total data being replicated.
    Many others.


    On 8/12/11 8:58 AM, "jonathan.hwang@accenture.com" wrote:

    Hi All,

    I'm trying to decommission data node from my cluster.  I put the data node in the /usr/lib/hadoop/conf/dfs.hosts.exclude list and restarted the name nodes.  The under-replicated blocks are starting to replicate, but it's going down in a very slow pace.  For 1 TB of data it takes over 1 day to complete.   We change the settings as below and try to increase the replication rate.

    Added this to hdfs-site.xml on all the nodes on the cluster and restarted the data nodes and name node processes.
    <property>
    <!-- 100Mbit/s -->
    <name>dfs.balance.bandwidthPerSec</name>
    <value>131072000</value>
    </property>

    Speed didn't seem to pick up. Do you know what may be happening?

    Thanks!
    Jonathan

    This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information.  If you have received it in error, please notify the sender immediately and delete the original.  Any other use of the email by you is prohibited.


    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434
  • Jonathan Hwang at Aug 12, 2011 at 4:44 pm
    I did have these settings on all the hdfs-site.xml nodes:
    <property>
    <!-- 100Mbit/s -->
    <name>dfs.balance.bandwidthPerSec</name>
    <value>131072000</value>
    </property>
    <property>
    <name>dfs.max-repl-streams</name>
    <value>50</value>
    </property>

    It is still taking over 1 day or longer for 1TB of under replicated blocks to replicate.

    Thanks!
    Jonathan


    -----Original Message-----
    From: Joey Echeverria
    Sent: Friday, August 12, 2011 9:14 AM
    To: common-user@hadoop.apache.org
    Subject: Re: Speed up node under replicated block during decomission

    You can configure the undocumented variable dfs.max-repl-streams to
    increase the number of replications a data-node is allowed to handle
    at one time. The default value is 2. [1]

    -Joey

    [1] https://issues.apache.org/jira/browse/HADOOP-2606?focusedCommentId=12578700&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12578700
    On Fri, Aug 12, 2011 at 12:09 PM, Charles Wimmer wrote:
    The balancer bandwidth setting does not affect decommissioning nodes.  Decommisssioning nodes replicate as fast as the cluster is capable.

    The replication pace has many variables.
    Number nodes that are participating in the replication.
    The amount of network bandwidth each has.
    The amount of other HDFS activity at the time.
    Total blocks being replicated.
    Total data being replicated.
    Many others.


    On 8/12/11 8:58 AM, "jonathan.hwang@accenture.com" wrote:

    Hi All,

    I'm trying to decommission data node from my cluster.  I put the data node in the /usr/lib/hadoop/conf/dfs.hosts.exclude list and restarted the name nodes.  The under-replicated blocks are starting to replicate, but it's going down in a very slow pace.  For 1 TB of data it takes over 1 day to complete.   We change the settings as below and try to increase the replication rate.

    Added this to hdfs-site.xml on all the nodes on the cluster and restarted the data nodes and name node processes.
    <property>
    <!-- 100Mbit/s -->
    <name>dfs.balance.bandwidthPerSec</name>
    <value>131072000</value>
    </property>

    Speed didn't seem to pick up. Do you know what may be happening?

    Thanks!
    Jonathan

    This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information.  If you have received it in error, please notify the sender immediately and delete the original.  Any other use of the email by you is prohibited.


    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434
  • Sridhar basam at Aug 12, 2011 at 4:12 pm

    On Fri, Aug 12, 2011 at 11:58 AM, wrote:

    Hi All,

    I'm trying to decommission data node from my cluster. I put the data node
    in the /usr/lib/hadoop/conf/dfs.hosts.exclude list and restarted the name
    nodes. The under-replicated blocks are starting to replicate, but it's
    going down in a very slow pace. For 1 TB of data it takes over 1 day to
    complete. We change the settings as below and try to increase the
    replication rate.

    Added this to hdfs-site.xml on all the nodes on the cluster and restarted
    the data nodes and name node processes.
    <property>
    <!-- 100Mbit/s -->
    <name>dfs.balance.bandwidthPerSec</name>
    <value>131072000</value>
    </property>

    Speed didn't seem to pick up. Do you know what may be happening?
    Are you seeing any sort of resource starvation on your data nodes? I/O,
    network or CPU?

    Sridhar


    Thanks!
    Jonathan

    This message is for the designated recipient only and may contain
    privileged, proprietary, or otherwise private information. If you have
    received it in error, please notify the sender immediately and delete the
    original. Any other use of the email by you is prohibited.
  • Harsh J at Aug 12, 2011 at 5:08 pm
    It could be that your process has hung cause a particular resident
    block (file) requires a very large replication factor, and your
    remaining # of nodes is less than that value. This is a genuine reason
    for hang (but must be fixed). The process usually waits until there
    are no under-replicated blocks, so I'd use fsck to check if any such
    ones are present and setrep them to a lower value.
    On Fri, Aug 12, 2011 at 9:28 PM, wrote:
    Hi All,

    I'm trying to decommission data node from my cluster.  I put the data node in the /usr/lib/hadoop/conf/dfs.hosts.exclude list and restarted the name nodes.  The under-replicated blocks are starting to replicate, but it's going down in a very slow pace.  For 1 TB of data it takes over 1 day to complete.   We change the settings as below and try to increase the replication rate.

    Added this to hdfs-site.xml on all the nodes on the cluster and restarted the data nodes and name node processes.
    <property>
    <!-- 100Mbit/s -->
    <name>dfs.balance.bandwidthPerSec</name>
    <value>131072000</value>
    </property>

    Speed didn't seem to pick up. Do you know what may be happening?

    Thanks!
    Jonathan

    This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information.  If you have received it in error, please notify the sender immediately and delete the original.  Any other use of the email by you is prohibited.


    --
    Harsh J
  • Michael Segel at Aug 12, 2011 at 5:51 pm
    Just a thought...

    Really quick and dirty thing to do is to turn off the node.
    Within 10 minutes the node looks down to the JT and NN so it gets marked as down.
    Run an fsck and it will show the files as under replicated and then will do the replication at the faster speed to rebalance the cluster.
    (100MB/sec should be ok on a 1GBe link)

    Then you can drop the next node... much faster than trying to decomission the node.

    Its not the best way to do it, but it works.

    From: harsh@cloudera.com
    Date: Fri, 12 Aug 2011 22:38:08 +0530
    Subject: Re: Speed up node under replicated block during decomission
    To: common-user@hadoop.apache.org

    It could be that your process has hung cause a particular resident
    block (file) requires a very large replication factor, and your
    remaining # of nodes is less than that value. This is a genuine reason
    for hang (but must be fixed). The process usually waits until there
    are no under-replicated blocks, so I'd use fsck to check if any such
    ones are present and setrep them to a lower value.
    On Fri, Aug 12, 2011 at 9:28 PM, wrote:
    Hi All,

    I'm trying to decommission data node from my cluster. I put the data node in the /usr/lib/hadoop/conf/dfs.hosts.exclude list and restarted the name nodes. The under-replicated blocks are starting to replicate, but it's going down in a very slow pace. For 1 TB of data it takes over 1 day to complete. We change the settings as below and try to increase the replication rate.

    Added this to hdfs-site.xml on all the nodes on the cluster and restarted the data nodes and name node processes.
    <property>
    <!-- 100Mbit/s -->
    <name>dfs.balance.bandwidthPerSec</name>
    <value>131072000</value>
    </property>

    Speed didn't seem to pick up. Do you know what may be happening?

    Thanks!
    Jonathan

    This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited.


    --
    Harsh J
  • GOEKE, MATTHEW (AG/1000) at Aug 12, 2011 at 4:06 pm
    Sofia correct me if I am wrong, but Mike I think this thread was about using the output of a previous job, in this case already in sequence file format, as in memory join data for another job.

    Side note: does anyone know what the rule of thumb on file size is when using the distributed cache vs just reading from HDFS (join data not binary files)? I always thought that having a setup phase on a mapper read directly from HDFS was a asking for trouble and that you should always distribute to each node but I am hearing more and more people say to just read directly from HDFS for larger file sizes to avoid the IO cost of the distributed cache.

    Matt

    -----Original Message-----
    From: Ian Michael Gumby
    Sent: Friday, August 12, 2011 10:54 AM
    To: common-user@hadoop.apache.org
    Subject: RE: Hadoop--store a sequence file in distributed cache?


    This whole thread doesn't make a lot of sense.

    If your first m/r job creates the sequence files, which you then use as input files to your second job, you don't need to use distributed cache since the output of the first m/r job is going to be in HDFS.
    (Dino is correct on that account.)

    Sofia replied saying that she needed to open and close the sequence file to access the data in each Mapper.map() call.
    Without knowing more about the specific app, Ashook is correct that you could read the file in Mapper.setup() and then access it in memory.
    Joey is correct you can put anything in distributed cache, but you don't want to put an HDFS file in to distributed cache. Distributed cache is a tool for taking something from your job and distributing it to each job tracker as a local object. It does have a bit of overhead.

    A better example is if you're distributing binary objects that you want on each node. A c++ .so file that you want to call from within your java m/r.

    If you're not using all of the data in the sequence file, what about using HBase?

    From: ashook@clearedgeit.com
    To: common-user@hadoop.apache.org
    Date: Fri, 12 Aug 2011 09:06:39 -0400
    Subject: RE: Hadoop--store a sequence file in distributed cache?

    If you are looking for performance gains, then possibly reading these files once during the setup() call in your Mapper and storing them in some data structure like a Map or a List will give you benefits. Having to open/close the files during each map call will have a lot of unneeded I/O.

    You have to be conscious of your java heap size though since you are basically storing the files in RAM. If your files are a few MB in size as you said, then it shouldn't be a problem. If the amount of data you need to store won't fit, consider using HBase as a solution to get access to the data you need.

    But as Joey said, you can put whatever you want in the Distributed Cache -- as long as you have a reader for it. You should have no problems using the SequenceFile.Reader.

    -- Adam


    -----Original Message-----
    From: Joey Echeverria
    Sent: Friday, August 12, 2011 6:28 AM
    To: common-user@hadoop.apache.org; Sofia Georgiakaki
    Subject: Re: Hadoop--store a sequence file in distributed cache?

    You can use any kind of format for files in the distributed cache, so
    yes you can use sequence files. They should be faster to parse than
    most text formats.

    -Joey

    On Fri, Aug 12, 2011 at 4:56 AM, Sofia Georgiakaki
    wrote:
    Thank you for the reply!
    In each map(), I need to open-read-close these files (more than 2 in the general case, and maybe up to 20 or more), in order to make some checks. Considering the huge amount of data in the input, making all these file operations on HDFS will kill the performance!!! So I think it would be better to store these files in distributed Cache, so that the whole process would be more efficient -I guess this is the point of using Distributed Cache in the first place!

    My question is, if I can store sequence files in distributed Cache and handle them using e.g. the SequenceFile.Reader class, or if I should only keep regular text files in distributed Cache and handle them using the usual java API.

    Thank you very much
    Sofia

    PS: The files have small size, a few KB to few MB maximum.



    ________________________________
    From: Dino Kečo <dino.keco@gmail.com>
    To: common-user@hadoop.apache.org; Sofia Georgiakaki <geosofie_tuc@yahoo.com>
    Sent: Friday, August 12, 2011 11:30 AM
    Subject: Re: Hadoop--store a sequence file in distributed cache?

    Hi Sofia,

    I assume that output of first job is stored on HDFS. In that case I would
    directly read file from Mappers without using distributed cache. If you put
    file into distributed cache that would add one more copy operation into your
    process.

    Thanks,
    dino


    On Fri, Aug 12, 2011 at 9:53 AM, Sofia Georgiakaki
    wrote:
    Good morning,

    I would like to store some files in the distributed cache, in order to be
    opened and read from the mappers.
    The files are produced by an other Job and are sequence files.
    I am not sure if that format is proper for the distributed cache, as the
    files in distr.cache are stored and read locally. Should I change the format
    of the files in the previous Job and make them Text Files maybe and read
    them from the Distr.Cache using tha simple Java API?
    Or can I still handle them with the usual way we use sequence files, even
    if they reside in the local directory? Performance is extremely important
    for my project, so I don't know what the best solution would be.

    Thank you in advance,
    Sofia Georgiakaki


    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434

    -----
    No virus found in this message.
    Checked by AVG - www.avg.com
    Version: 10.0.1392 / Virus Database: 1520/3828 - Release Date: 08/11/11
    This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled
    to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and
    all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited.

    All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its
    subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware".
    Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying
    this e-mail or any attachment.


    The information contained in this email may be subject to the export control laws and regulations of the United States, potentially
    including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of
    Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all
    applicable U.S. export laws and regulations.
  • Sofia Georgiakaki at Aug 13, 2011 at 10:03 am
    Good morning,

    I am a little confused, I have to say.

    A summury of the project first: I want to examine how an Rtree on HDFS would speed up spatial queries like point/range queries, that normally target a very small part of the original input.

    I have built my Rtree on HDFS, and now I need to answer queries using it. I thought I could make an MR Job that takes as input a text file where each line is a query (for example we have 20000 queries). To answer the queries efficiently, I need to check some information about the root nodes of the tree, which are stored in R files (R=the #reducers of the previous job). These files are small in size and are read from every mapper, thus the idea of distributed cache fits, right?

    I have built an ArrayList during setup() to avoid opening all the files in distributed cache, and open only 3-4 of them for example. I agree, though, that opening and closing these files so many times is an important overhead. I think however, that opening these files from HDFS rather than distributed cache would be even worse, since the file accessing operations in HDFS are much more "expensive" than accessing files locally.

    Thank you all for your response, I would be glad to have more feedback.
    Sofia





    ________________________________
    From: "GOEKE, MATTHEW (AG/1000)" <matthew.goeke@monsanto.com>
    To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
    Sent: Friday, August 12, 2011 7:05 PM
    Subject: RE: Hadoop--store a sequence file in distributed cache?

    Sofia correct me if I am wrong, but Mike I think this thread was about using the output of a previous job, in this case already in sequence file format, as in memory join data for another job.

    Side note: does anyone know what the rule of thumb on file size is when using the distributed cache vs just reading from HDFS (join data not binary files)? I always thought that having a setup phase on a mapper read directly from HDFS was a asking for trouble and that you should always distribute to each node but I am hearing more and more people say to just read directly from HDFS for larger file sizes to avoid the IO cost of the distributed cache.

    Matt

    -----Original Message-----
    From: Ian Michael Gumby
    Sent: Friday, August 12, 2011 10:54 AM
    To: common-user@hadoop.apache.org
    Subject: RE: Hadoop--store a sequence file in distributed cache?


    This whole thread doesn't make a lot of sense.

    If your first m/r job creates the sequence files, which you then use as input files to your second job, you don't need to use distributed cache since the output of the first m/r job is going to be in HDFS.
    (Dino is correct on that account.)

    Sofia replied saying that she needed to open and close the sequence file to access the data in each Mapper.map() call.
    Without knowing more about the specific app, Ashook is correct that you could read the file in Mapper.setup() and then access it in memory.
    Joey is correct you can put anything in distributed cache, but you don't want to put an HDFS file in to distributed cache. Distributed cache is a tool for taking something from your job and distributing it to each job tracker as a local object. It does have a bit of overhead.

    A better example is if you're distributing binary objects  that you want on each node. A c++ .so file that you want to call from within your java m/r.

    If you're not using all of the data in the sequence file, what about using HBase?

    From: ashook@clearedgeit.com
    To: common-user@hadoop.apache.org
    Date: Fri, 12 Aug 2011 09:06:39 -0400
    Subject: RE: Hadoop--store a sequence file in distributed cache?

    If you are looking for performance gains, then possibly reading these files once during the setup() call in your Mapper and storing them in some data structure like a Map or a List will give you benefits.  Having to open/close the files during each map call will have a lot of unneeded I/O.

    You have to be conscious of your java heap size though since you are basically storing the files in RAM. If your files are a few MB in size as you said, then it shouldn't be a problem.  If the amount of data you need to store won't fit, consider using HBase as a solution to get access to the data you need.

    But as Joey said, you can put whatever you want in the Distributed Cache -- as long as you have a reader for it.  You should have no problems using the SequenceFile.Reader.

    -- Adam


    -----Original Message-----
    From: Joey Echeverria
    Sent: Friday, August 12, 2011 6:28 AM
    To: common-user@hadoop.apache.org; Sofia Georgiakaki
    Subject: Re: Hadoop--store a sequence file in distributed cache?

    You can use any kind of format for files in the distributed cache, so
    yes you can use sequence files. They should be faster to parse than
    most text formats.

    -Joey

    On Fri, Aug 12, 2011 at 4:56 AM, Sofia Georgiakaki
    wrote:
    Thank you for the reply!
    In each map(), I need to open-read-close these files (more than 2 in the general case, and maybe up to 20 or more), in order to make some checks. Considering the huge amount of data in the input, making all these file operations on HDFS will kill the performance!!! So I think it would be better to store these files in distributed Cache, so that the whole process would be more efficient -I guess this is the point of using Distributed Cache in the first place!

    My question is, if I can store sequence files in distributed Cache and handle them using e.g. the SequenceFile.Reader class, or if I should only keep regular text files in distributed Cache and handle them using the usual java API.

    Thank you very much
    Sofia

    PS: The files have small size, a few KB to few MB maximum.



    ________________________________
    From: Dino Kečo <dino.keco@gmail.com>
    To: common-user@hadoop.apache.org; Sofia Georgiakaki <geosofie_tuc@yahoo.com>
    Sent: Friday, August 12, 2011 11:30 AM
    Subject: Re: Hadoop--store a sequence file in distributed cache?

    Hi Sofia,

    I assume that output of first job is stored on HDFS. In that case I would
    directly read file from Mappers without using distributed cache. If you put
    file into distributed cache that would add one more copy operation into your
    process.

    Thanks,
    dino


    On Fri, Aug 12, 2011 at 9:53 AM, Sofia Georgiakaki
    wrote:
    Good morning,

    I would like to store some files in the distributed cache, in order to be
    opened and read from the mappers.
    The files are produced by an other Job and are sequence files.
    I am not sure if that format is proper for the distributed cache, as the
    files in distr.cache are stored and read locally. Should I change the format
    of the files in the previous Job and make them Text Files maybe and read
    them from the Distr.Cache using tha simple Java API?
    Or can I still handle them with the usual way we use sequence files, even
    if they reside in the local directory? Performance is extremely important
    for my project, so I don't know what the best solution would be.

    Thank you in advance,
    Sofia Georgiakaki


    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434

    -----
    No virus found in this message.
    Checked by AVG - www.avg.com
    Version: 10.0.1392 / Virus Database: 1520/3828 - Release Date: 08/11/11
    This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled
    to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and
    all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited.

    All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its
    subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware".
    Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying
    this e-mail or any attachment.


    The information contained in this email may be subject to the export control laws and regulations of the United States, potentially
    including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of
    Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this information you are obligated to comply with all
    applicable U.S. export laws and regulations.
  • Michael Segel at Aug 13, 2011 at 5:37 pm
    Sofia,

    I was about to say that if your file is already on hdfs, you should just be able to open it.
    But as I type this, I have this thing kicking me in the back of the head reminding me that you may not be able to access the hdfs file at the same time someone else is accessing it? (Going from memory, is there an exclusive lock on the file when you open it in HDFS?)

    If not, you can just use your file.
    If so, you will need to use distributed cache which copies a copy of the file to some place local on each node running the task. Within your task you need to query the distributed cache for your file and get the path to the file so you can open it.
    Depending on the size of your index... which can get large, you need to open the file once and just reset to the beginning of the file.

    My suggestion is to consider putting your RTree into HBase. So HBase contains your index.

    Date: Sat, 13 Aug 2011 03:02:32 -0700
    From: geosofie_tuc@yahoo.com
    Subject: Re: Hadoop--store a sequence file in distributed cache?
    To: common-user@hadoop.apache.org

    Good morning,

    I am a little confused, I have to say.

    A summury of the project first: I want to examine how an Rtree on HDFS would speed up spatial queries like point/range queries, that normally target a very small part of the original input.

    I have built my Rtree on HDFS, and now I need to answer queries using it. I thought I could make an MR Job that takes as input a text file where each line is a query (for example we have 20000 queries). To answer the queries efficiently, I need to check some information about the root nodes of the tree, which are stored in R files (R=the #reducers of the previous job). These files are small in size and are read from every mapper, thus the idea of distributed cache fits, right?

    I have built an ArrayList during setup() to avoid opening all the files in distributed cache, and open only 3-4 of them for example. I agree, though, that opening and closing these files so many times is an important overhead. I think however, that opening these files from HDFS rather than distributed cache would be even worse, since the file accessing operations in HDFS are much more "expensive" than accessing files locally.

    Thank you all for your response, I would be glad to have more feedback.
    Sofia





    ________________________________
    From: "GOEKE, MATTHEW (AG/1000)" <matthew.goeke@monsanto.com>
    To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
    Sent: Friday, August 12, 2011 7:05 PM
    Subject: RE: Hadoop--store a sequence file in distributed cache?

    Sofia correct me if I am wrong, but Mike I think this thread was about using the output of a previous job, in this case already in sequence file format, as in memory join data for another job.

    Side note: does anyone know what the rule of thumb on file size is when using the distributed cache vs just reading from HDFS (join data not binary files)? I always thought that having a setup phase on a mapper read directly from HDFS was a asking for trouble and that you should always distribute to each node but I am hearing more and more people say to just read directly from HDFS for larger file sizes to avoid the IO cost of the distributed cache.

    Matt

    -----Original Message-----
    From: Ian Michael Gumby
    Sent: Friday, August 12, 2011 10:54 AM
    To: common-user@hadoop.apache.org
    Subject: RE: Hadoop--store a sequence file in distributed cache?


    This whole thread doesn't make a lot of sense.

    If your first m/r job creates the sequence files, which you then use as input files to your second job, you don't need to use distributed cache since the output of the first m/r job is going to be in HDFS.
    (Dino is correct on that account.)

    Sofia replied saying that she needed to open and close the sequence file to access the data in each Mapper.map() call.
    Without knowing more about the specific app, Ashook is correct that you could read the file in Mapper.setup() and then access it in memory.
    Joey is correct you can put anything in distributed cache, but you don't want to put an HDFS file in to distributed cache. Distributed cache is a tool for taking something from your job and distributing it to each job tracker as a local object. It does have a bit of overhead.

    A better example is if you're distributing binary objects that you want on each node. A c++ .so file that you want to call from within your java m/r.

    If you're not using all of the data in the sequence file, what about using HBase?

    From: ashook@clearedgeit.com
    To: common-user@hadoop.apache.org
    Date: Fri, 12 Aug 2011 09:06:39 -0400
    Subject: RE: Hadoop--store a sequence file in distributed cache?

    If you are looking for performance gains, then possibly reading these files once during the setup() call in your Mapper and storing them in some data structure like a Map or a List will give you benefits. Having to open/close the files during each map call will have a lot of unneeded I/O.

    You have to be conscious of your java heap size though since you are basically storing the files in RAM. If your files are a few MB in size as you said, then it shouldn't be a problem. If the amount of data you need to store won't fit, consider using HBase as a solution to get access to the data you need.

    But as Joey said, you can put whatever you want in the Distributed Cache -- as long as you have a reader for it. You should have no problems using the SequenceFile.Reader.

    -- Adam


    -----Original Message-----
    From: Joey Echeverria
    Sent: Friday, August 12, 2011 6:28 AM
    To: common-user@hadoop.apache.org; Sofia Georgiakaki
    Subject: Re: Hadoop--store a sequence file in distributed cache?

    You can use any kind of format for files in the distributed cache, so
    yes you can use sequence files. They should be faster to parse than
    most text formats.

    -Joey

    On Fri, Aug 12, 2011 at 4:56 AM, Sofia Georgiakaki
    wrote:
    Thank you for the reply!
    In each map(), I need to open-read-close these files (more than 2 in the general case, and maybe up to 20 or more), in order to make some checks. Considering the huge amount of data in the input, making all these file operations on HDFS will kill the performance!!! So I think it would be better to store these files in distributed Cache, so that the whole process would be more efficient -I guess this is the point of using Distributed Cache in the first place!

    My question is, if I can store sequence files in distributed Cache and handle them using e.g. the SequenceFile.Reader class, or if I should only keep regular text files in distributed Cache and handle them using the usual java API.

    Thank you very much
    Sofia

    PS: The files have small size, a few KB to few MB maximum.



    ________________________________
    From: Dino Kečo <dino.keco@gmail.com>
    To: common-user@hadoop.apache.org; Sofia Georgiakaki <geosofie_tuc@yahoo.com>
    Sent: Friday, August 12, 2011 11:30 AM
    Subject: Re: Hadoop--store a sequence file in distributed cache?

    Hi Sofia,

    I assume that output of first job is stored on HDFS. In that case I would
    directly read file from Mappers without using distributed cache. If you put
    file into distributed cache that would add one more copy operation into your
    process.

    Thanks,
    dino


    On Fri, Aug 12, 2011 at 9:53 AM, Sofia Georgiakaki
    wrote:
    Good morning,

    I would like to store some files in the distributed cache, in order to be
    opened and read from the mappers.
    The files are produced by an other Job and are sequence files.
    I am not sure if that format is proper for the distributed cache, as the
    files in distr.cache are stored and read locally. Should I change the format
    of the files in the previous Job and make them Text Files maybe and read
    them from the Distr.Cache using tha simple Java API?
    Or can I still handle them with the usual way we use sequence files, even
    if they reside in the local directory? Performance is extremely important
    for my project, so I don't know what the best solution would be.

    Thank you in advance,
    Sofia Georgiakaki


    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434

    -----
    No virus found in this message.
    Checked by AVG - www.avg.com
    Version: 10.0.1392 / Virus Database: 1520/3828 - Release Date: 08/11/11
    This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled
    to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and
    all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited.

    All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its
    subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware".
    Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying
    this e-mail or any attachment.


    The information contained in this email may be subject to the export control laws and regulations of the United States, potentially
    including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of
    Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all
    applicable U.S. export laws and regulations.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedAug 12, '11 at 7:54a
activeAug 13, '11 at 5:37p
posts16
users10
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase