FAQ
Can we use something like RAM FS to share static data across map tasks.

Scenario,
1) Quadcore machine
2) 2 1-TB Disk
3) 8 GB ram,

Now Ii need ~2.7 GB ram per Map process to load some static data in memory
using which i would be processing data.(cpu intensive jobs)

Can i share memory across mappers on the same machine so that memory
footprint is less and i can run more than 4 mappers simultaneously
utilizing all 4 cores.

Can we use stuff like RamFS

Search Discussions

  • Amit Kumar Singh at Sep 5, 2008 at 8:01 am
    Can we use something like RAM FS to share static data across map tasks.

    Scenario,
    1) Quadcore machine
    2) 2 1-TB Disk
    3) 8 GB ram,

    Now Ii need ~2.7 GB ram per Map process to load some static data in memory
    using which i would be processing data.(cpu intensive jobs)

    Can i share memory across mappers on the same machine so that memory
    footprint is less and i can run more than 4 mappers simultaneously
    utilizing all 4 cores.

    Can we use stuff like RamFS
  • Devaraj Das at Sep 5, 2008 at 8:31 am
    Hadoop doesn't support this natively. So if you need this kind of a
    functionality, you'd need to code your application in such a way. But I am
    worried about the race conditions in determining which task should first
    create the ramfs and load the data.
    If you can provide atomicity in determining whether the ramfs has been
    created and data loaded, and if not, then do the creation/load, then things
    should work.
    If atomicity cannot be guaranteed, you might consider this -
    1) Run a job with only maps that creates the ramfs and loads the data (if
    your cluster is small you can do this manually). You can use distributed
    cache to store the data you want to load.
    2) Run your job that processes the data
    3) Run a third job to delete the ramfs.

    On 9/5/08 1:29 PM, "Amit Kumar Singh" wrote:

    Can we use something like RAM FS to share static data across map tasks.

    Scenario,
    1) Quadcore machine
    2) 2 1-TB Disk
    3) 8 GB ram,

    Now Ii need ~2.7 GB ram per Map process to load some static data in memory
    using which i would be processing data.(cpu intensive jobs)

    Can i share memory across mappers on the same machine so that memory
    footprint is less and i can run more than 4 mappers simultaneously
    utilizing all 4 cores.

    Can we use stuff like RamFS
  • Andreas Kostyrka at Sep 5, 2008 at 12:14 pm
    Well a classical solution to that on Linux would be to mmap a cache file into
    multiple processes. No idea if you can do something like that with Java.

    Andreas
    On Friday 05 September 2008 10:28:37 Devaraj Das wrote:
    Hadoop doesn't support this natively. So if you need this kind of a
    functionality, you'd need to code your application in such a way. But I am
    worried about the race conditions in determining which task should first
    create the ramfs and load the data.
    If you can provide atomicity in determining whether the ramfs has been
    created and data loaded, and if not, then do the creation/load, then things
    should work.
    If atomicity cannot be guaranteed, you might consider this -
    1) Run a job with only maps that creates the ramfs and loads the data (if
    your cluster is small you can do this manually). You can use distributed
    cache to store the data you want to load.
    2) Run your job that processes the data
    3) Run a third job to delete the ramfs.
    On 9/5/08 1:29 PM, "Amit Kumar Singh" wrote:
    Can we use something like RAM FS to share static data across map tasks.

    Scenario,
    1) Quadcore machine
    2) 2 1-TB Disk
    3) 8 GB ram,

    Now Ii need ~2.7 GB ram per Map process to load some static data in
    memory using which i would be processing data.(cpu intensive jobs)

    Can i share memory across mappers on the same machine so that memory
    footprint is less and i can run more than 4 mappers simultaneously
    utilizing all 4 cores.

    Can we use stuff like RamFS
  • Owen O'Malley at Sep 5, 2008 at 4:15 pm

    On Fri, Sep 5, 2008 at 12:59 AM, Amit Kumar Singh wrote:

    Can we use something like RAM FS to share static data across map tasks.

    As others have said, this won't work right. You probably should look at
    MultiThreadMapRunner<http://hadoop.apache.org/core/docs/r0.17.2/api/org/apache/hadoop/mapred/lib/MultithreadedMapRunner.html>,
    which uses a thread pool to process the inputs. It is typically used for
    crawling or other map methods that take long times per a record. If you have
    substantial work inside the map, you can saturate CPUs that way. Of course
    the downside is that you have a single RecordReader feeding you inputs, so
    you are limited by the reading speed of a single HDFS client.

    -- Owen
  • Amit Simgh at Sep 6, 2008 at 7:05 pm
    Hi,

    I have thousands of webpages each represented as serialized tree object
    compressed (ZLIB) together (file size varying from 2.5 GB to 4.5GB).
    I have to do some heavy text processing on these pages.

    What the the best way to read /access these pages.

    Method1
    ***************
    1) Write Custom Splitter that
    1. uncompresses the file(2.5GB to 4GB) and then parses it(time :
    around 10 minutes )
    2. Splits the binary data in to parts 10-20
    2) Implement specific readers to read a page and present it to mapper

    OR.

    Method -2
    ***************
    Read the entire file w/o splitting : one one Map task per file.
    Implement specific readers to read a page and present it to mapper

    Slight detour:
    I was browing thru code in FileInputFormat and TextInputFormat. In
    getSplit method the file is broken at arbitary byte boundaries.
    So in case of TextInputFormat what if last line of mapper is truncated
    (incomplete byte sequence). what happens.
    Can someone explain and give pointers in code where this happens?

    I also saw classes like Records . What are these used for?

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedSep 5, '08 at 8:00a
activeSep 6, '08 at 7:05p
posts6
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2023 Grokbase