FAQ
Hi all,

I am currently processing a lot of raw CSV data and producing a
summary text file which I load into mysql. On top of this I have a
PHP application to generate tiles for google mapping (sample tile:
http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800).
Here is a (dev server) example of the final map client:
http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800 - the
dynamic grids as you zoom are all pre-calculated.

I am considering (for better throughput as maps generate huge request
volumes) pregenerating all my tiles (PNG) and storing them in S3 with
cloudfront. There will be billions of PNGs produced each at 1-3KB
each.

Could someone please recommend the best place to generate the PNGs and
when to push them to S3 in a MR system?
If I did the PNG generation and upload to S3 in the reduce the same
task on multiple machines will compete with each other right? Should
I generate the PNGs to a local directory and then on Task success push
the lot up? I am assuming billions of 1-3KB files on HDFS is not a
good idea.

I will use EC2 for the MR for the time being, but this will be moved
to a local cluster still pushing to S3...

Cheers,

Tim

Search Discussions

  • Brian Bockelman at Apr 14, 2009 at 12:38 pm
    Hey Tim,

    Why don't you put the PNGs in a SequenceFile in the output of your
    reduce task? You could then have a post-processing step that unpacks
    the PNG and places it onto S3. (If my numbers are correct, you're
    looking at around 3TB of data; is this right? With that much, you
    might want another separate Map task to unpack all the files in
    parallel ... really depends on the throughput you get to Amazon)

    Brian
    On Apr 14, 2009, at 4:35 AM, tim robertson wrote:

    Hi all,

    I am currently processing a lot of raw CSV data and producing a
    summary text file which I load into mysql. On top of this I have a
    PHP application to generate tiles for google mapping (sample tile:
    http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800).
    Here is a (dev server) example of the final map client:
    http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800 - the
    dynamic grids as you zoom are all pre-calculated.

    I am considering (for better throughput as maps generate huge request
    volumes) pregenerating all my tiles (PNG) and storing them in S3 with
    cloudfront. There will be billions of PNGs produced each at 1-3KB
    each.

    Could someone please recommend the best place to generate the PNGs and
    when to push them to S3 in a MR system?
    If I did the PNG generation and upload to S3 in the reduce the same
    task on multiple machines will compete with each other right? Should
    I generate the PNGs to a local directory and then on Task success push
    the lot up? I am assuming billions of 1-3KB files on HDFS is not a
    good idea.

    I will use EC2 for the MR for the time being, but this will be moved
    to a local cluster still pushing to S3...

    Cheers,

    Tim
  • Tim robertson at Apr 14, 2009 at 12:45 pm
    Thanks Brian,

    This is pretty much what I was looking for.

    Your calculations are correct but based on the assumption that at all
    zoom levels we will need all tiles generated. Given the sparsity of
    data, it actually results in only a few 100GBs. I'll run a second MR
    job with the map pushing to S3 then to make use of parallel loading.

    Cheers,

    Tim

    On Tue, Apr 14, 2009 at 2:37 PM, Brian Bockelman wrote:
    Hey Tim,

    Why don't you put the PNGs in a SequenceFile in the output of your reduce
    task?  You could then have a post-processing step that unpacks the PNG and
    places it onto S3.  (If my numbers are correct, you're looking at around 3TB
    of data; is this right?  With that much, you might want another separate Map
    task to unpack all the files in parallel ... really depends on the
    throughput you get to Amazon)

    Brian
    On Apr 14, 2009, at 4:35 AM, tim robertson wrote:

    Hi all,

    I am currently processing a lot of raw CSV data and producing a
    summary text file which I load into mysql.  On top of this I have a
    PHP application to generate tiles for google mapping (sample tile:
    http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800).
    Here is a (dev server) example of the final map client:
    http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800 - the
    dynamic grids as you zoom are all pre-calculated.

    I am considering (for better throughput as maps generate huge request
    volumes) pregenerating all my tiles (PNG) and storing them in S3 with
    cloudfront.  There will be billions of PNGs produced each at 1-3KB
    each.

    Could someone please recommend the best place to generate the PNGs and
    when to push them to S3 in a MR system?
    If I did the PNG generation and upload to S3 in the reduce the same
    task on multiple machines will compete with each other right?  Should
    I generate the PNGs to a local directory and then on Task success push
    the lot up?  I am assuming billions of 1-3KB files on HDFS is not a
    good idea.

    I will use EC2 for the MR for the time being, but this will be moved
    to a local cluster still pushing to S3...

    Cheers,

    Tim
  • Tim robertson at Apr 14, 2009 at 2:10 pm
    Sorry Brian, can I just ask please...

    I have the PNGs in the Sequence file for my sample set. If I use a
    second MR job and push to S3 in the map, surely I run into the
    scenario where multiple tasks are running on the same section of the
    sequence file and thus pushing the same data to S3. Am I missing
    something obvious (e.g. can I disable this behavior)?

    Cheers

    Tim


    On Tue, Apr 14, 2009 at 2:44 PM, tim robertson
    wrote:
    Thanks Brian,

    This is pretty much what I was looking for.

    Your calculations are correct but based on the assumption that at all
    zoom levels we will need all tiles generated.  Given the sparsity of
    data, it actually results in only a few 100GBs.  I'll run a second MR
    job with the map pushing to S3 then to make use of parallel loading.

    Cheers,

    Tim

    On Tue, Apr 14, 2009 at 2:37 PM, Brian Bockelman wrote:
    Hey Tim,

    Why don't you put the PNGs in a SequenceFile in the output of your reduce
    task?  You could then have a post-processing step that unpacks the PNG and
    places it onto S3.  (If my numbers are correct, you're looking at around 3TB
    of data; is this right?  With that much, you might want another separate Map
    task to unpack all the files in parallel ... really depends on the
    throughput you get to Amazon)

    Brian
    On Apr 14, 2009, at 4:35 AM, tim robertson wrote:

    Hi all,

    I am currently processing a lot of raw CSV data and producing a
    summary text file which I load into mysql.  On top of this I have a
    PHP application to generate tiles for google mapping (sample tile:
    http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800).
    Here is a (dev server) example of the final map client:
    http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800 - the
    dynamic grids as you zoom are all pre-calculated.

    I am considering (for better throughput as maps generate huge request
    volumes) pregenerating all my tiles (PNG) and storing them in S3 with
    cloudfront.  There will be billions of PNGs produced each at 1-3KB
    each.

    Could someone please recommend the best place to generate the PNGs and
    when to push them to S3 in a MR system?
    If I did the PNG generation and upload to S3 in the reduce the same
    task on multiple machines will compete with each other right?  Should
    I generate the PNGs to a local directory and then on Task success push
    the lot up?  I am assuming billions of 1-3KB files on HDFS is not a
    good idea.

    I will use EC2 for the MR for the time being, but this will be moved
    to a local cluster still pushing to S3...

    Cheers,

    Tim
  • Tim robertson at Apr 16, 2009 at 8:28 am
    Hi Chuck,

    Thank you very much for this opportunity. I also think it is a nice
    case study; it goes beyond the typical wordcount example by generating
    something that people can actually see and play with immediately
    afterwards (e.g. maps). It is also showcasing nicely the community
    effort to collectively bring together information on the worlds
    biodiversity - the GBIF network really is a nice example of a free and
    open access community who are collectively addressing interoperability
    globally. Can you please tell me what kind of time frame you would
    need the case study in?

    I have just got my Java PNG generation code down to 130msec on the
    Mac, so I am pretty much ready to start running on EC2 and do the
    volume tile generation, so will blog the whole thing on
    http://biodivertido.blogspot.com at some point soon. I have to travel
    to the US on Saturday for a week so this will delay it somewhat.

    What is not 100% clear to me is when to push to S3:
    In the Map I will output the TileId-ZoomLevel-SpeciesId as the key,
    along with the count, and in the Reduce I group the counts into larger
    tiles, and create the PNG. I could write to Sequencefile here... but
    I suspect I could just push to the s3 bucket here also - as long as
    the task tracker does not send the same Keys to multiple reduce tasks
    - my Hadoop naivity showing here (I wrote an in memory threaded
    MapReduceLite which does not compete reducers, but not got into the
    Hadoop code quite so much yet).


    Cheers,

    Tim


    On Thu, Apr 16, 2009 at 1:49 AM, Chuck Lam wrote:
    Hi Tim,

    I'm really interested in your application at gbif.org. I'm in the middle of
    writing Hadoop in Action ( http://www.manning.com/lam/ ) and think this may
    make for an interesting hadoop case study, since you're taking advantage of
    a lot of different pieces (EC2, S3, cloudfront, SequenceFiles,
    PHP/streaming). Would you be interested in discussing making a 4-5 page case
    study out of this?

    As to your question, I don't know if it's been properly answered, but I
    don't know why you think that "multiple tasks are running on the same
    section of the sequence file." Maybe you can elaborate further and I'll see
    if I can offer any thoughts.



    On Tue, Apr 14, 2009 at 7:10 AM, tim robertson wrote:

    Sorry Brian, can I just ask please...

    I have the PNGs in the Sequence file for my sample set.  If I use a
    second MR job and push to S3 in the map, surely I run into the
    scenario where multiple tasks are running on the same section of the
    sequence file and thus pushing the same data to S3.  Am I missing
    something obvious (e.g. can I disable this behavior)?

    Cheers

    Tim


    On Tue, Apr 14, 2009 at 2:44 PM, tim robertson
    wrote:
    Thanks Brian,

    This is pretty much what I was looking for.

    Your calculations are correct but based on the assumption that at all
    zoom levels we will need all tiles generated.  Given the sparsity of
    data, it actually results in only a few 100GBs.  I'll run a second MR
    job with the map pushing to S3 then to make use of parallel loading.

    Cheers,

    Tim


    On Tue, Apr 14, 2009 at 2:37 PM, Brian Bockelman <bbockelm@cse.unl.edu>
    wrote:
    Hey Tim,

    Why don't you put the PNGs in a SequenceFile in the output of your
    reduce
    task?  You could then have a post-processing step that unpacks the PNG
    and
    places it onto S3.  (If my numbers are correct, you're looking at
    around 3TB
    of data; is this right?  With that much, you might want another
    separate Map
    task to unpack all the files in parallel ... really depends on the
    throughput you get to Amazon)

    Brian
    On Apr 14, 2009, at 4:35 AM, tim robertson wrote:

    Hi all,

    I am currently processing a lot of raw CSV data and producing a
    summary text file which I load into mysql.  On top of this I have a
    PHP application to generate tiles for google mapping (sample tile:
    http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800).
    Here is a (dev server) example of the final map client:
    http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800 - the
    dynamic grids as you zoom are all pre-calculated.

    I am considering (for better throughput as maps generate huge request
    volumes) pregenerating all my tiles (PNG) and storing them in S3 with
    cloudfront.  There will be billions of PNGs produced each at 1-3KB
    each.

    Could someone please recommend the best place to generate the PNGs and
    when to push them to S3 in a MR system?
    If I did the PNG generation and upload to S3 in the reduce the same
    task on multiple machines will compete with each other right?  Should
    I generate the PNGs to a local directory and then on Task success push
    the lot up?  I am assuming billions of 1-3KB files on HDFS is not a
    good idea.

    I will use EC2 for the MR for the time being, but this will be moved
    to a local cluster still pushing to S3...

    Cheers,

    Tim
  • Todd Lipcon at Apr 16, 2009 at 5:28 pm

    On Thu, Apr 16, 2009 at 1:27 AM, tim robertson wrote:
    What is not 100% clear to me is when to push to S3:
    In the Map I will output the TileId-ZoomLevel-SpeciesId as the key,
    along with the count, and in the Reduce I group the counts into larger
    tiles, and create the PNG. I could write to Sequencefile here... but
    I suspect I could just push to the s3 bucket here also - as long as
    the task tracker does not send the same Keys to multiple reduce tasks
    - my Hadoop naivity showing here (I wrote an in memory threaded
    MapReduceLite which does not compete reducers, but not got into the
    Hadoop code quite so much yet).
    Hi Tim,

    If I understand what you mean by "compete reducers", then you're referring
    to the feature called "speculative execution", in which Hadoop schedules
    multiple TaskTrackers to perform the same task. When one of the
    multiply-scheduled tasks finishes, the other one is killed. As you seem to
    already understand, this might cause issues if your tasks have
    non-idempotent side effects on the outside world.

    The configuration variable you need to look at is
    mapred.reduce.tasks.speculative.execution. If this is set to false, only one
    reduce task will be run on each key. If it is true, it's possible that some
    reduce tasks will be scheduled twice to try to reduce variance in job
    completion times due to slow machines.

    There's an equivalent configuration variable
    mapred.map.tasks.speculative.execution that controls this behavior for your
    map tasks.

    Hope that helps,
    -Todd
  • Tim robertson at Apr 16, 2009 at 7:49 pm
    Thanks Todd and Chuck - sorry, my terminology was wrong... exactly
    what I was looking for.

    I am letting mysql chuck throught the zoom levels now to get some
    final numbers on the tiles and cost to S3 PUT. Looks like zoom level
    8 is feasible for our current data volume but not a long term option
    if the input data explodes in volume.

    Cheers,

    Tim


    On Thu, Apr 16, 2009 at 9:05 PM, Chuck Lam wrote:
    ar.. i totally missed the point you had said about "compete reducers". it
    didn't occur to me that you were talking about hadoop's speculative
    execution. todd's solution to turn off speculative execution is correct.

    i'll respond to the rest of your email later today.


    On Thu, Apr 16, 2009 at 5:23 AM, tim robertson wrote:

    Thanks Chuck,
    I'm shooting for finishing the case studies by the end of May, but it'll
    be
    nice to have a draft done by mid-May so we can edit it to have a
    consistent
    style with the other case studies.
    I will do what I can!
    I read your blog and found a couple posts on spatial joining. It wasn't
    clear to me from reading the posts whether the work was just
    experimental or
    if it led to some application. If it led to an application, then we may
    incorporate that into the case study too.
    It led to http://widgets.gbif.org/test/PACountry.html#/area/2571 which
    shows a statistical summary for our data (latitude longitude)
    cross-referenced with the polygons on the protected areas of the
    world.  In truth though, we processed it in PostGIS and Hadoop and
    found that the PostGIS approach, while way slower was fine for now and
    we developed the scripts for that quicker.  So you can say it was
    experimental... I do have ambitions to do a basic geospatial join
    (points in polygons) for PIG, Cloudbase or Hive2.0 but alas have not
    found time.  Also - the blog is always a late Sunday night effort so
    really is not written well.
    BTW, where in the US are you traveling to? I'm in Silicon Valley, so
    maybe
    we can meet up if you'll happen to be in the area and can squeeze a
    little
    time out.
    Would have loved to... but in Boston and DC this time.  In a few weeks
    will be in Chicago, but for some reason I have never make it over your
    neck of the woods.
    I don't know what data you need to produce a single PNG file, so I don't
    know whether having map output TileId-ZoomLevel-SpeciesId as key is the
    right factoring. To me it looks like each PNG represents one tile at one
    zoom level but includes multiple species.
    We do individual species and higher levels of taxa (up to all data).
    This is all data, grouped to 1x1 degree cells (think 100x100 km) with
    counts.  Currently preprocessed with mysql, but another hadoop
    candidate as we grow.

    http://maps.gbif.org/mapserver/draw.pl?dtype=box&imgonly=1&path=http%3A%2F%2Fdata.gbif.org%2Fmaplayer%2Ftaxon%2F13140803&extent=-180.0+-90.0+180.0+90.0&mode=browse&refresh=Refresh&layer=countryborders
    In any case, under Hadoop/MapReduce, all key/value pairs outputted by
    the
    mappers are grouped by key before being sent to the reducer, so it's
    guaranteed that the same key will not go to multiple reducers.
    That is good to know.  I knew Map tasks would get run on multiple
    machines if it detects a machine is idle, but wasn't sure if Hadoop
    would put reducers on machines to compete against each other and kill
    the one that did not finish first.
    You may also want to think more about the actual volume and cost of all
    this. You initially said that you will have "billions of PNGs produced
    each
    at 1-3KB" but then later said the data size is only a few 100GB due to
    sparsity. Either you're not really creating billions of PNGs, or a lot
    of
    them are actually less than 1KB. Kevin brought up a good point that S3
    charges $0.01 for every 1000 files ("objects") created, so generating 1
    billion files will already set you back $10K plus storage cost (and
    transfer
    cost if you're not using EC2).
    Right - my bad... Having not processed this all I am not 100% sure yet
    what the size will be and to what zoom level I will preprocess to.
    The challenge is our data is growing continuously, so billions of PNGs
    was looking into the coming months.  Sorry for the contradiction.

    You have clearly spotted that I am doing this as a project on the side
    (evenings really) and not devoting enough time to this!!!  By day I am
    mysql and postgis still but I am hitting limits and looking to our
    scalability.
    I kind of overlooked the PUT cost on S3 thinking stupidly that EC2->S3 was
    free.

    I actually have the stuff processed for species only using mysql
    (http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800) but not
    the higher groupings of species (familys of species etc).  It could be
    that I end up only processing all the summary data in Hadoop and then
    load back into a light DB to render the maps in real time like the
    link I just provided.  TIles render in around 150msecs so with some
    hardware we could probably scale....

    Thanks for your inputs - I appreciate it a lot since I'm working
    mostly alone on the processing.

    Cheers,

    Tim


    On Thu, Apr 16, 2009 at 1:27 AM, tim robertson
    <timrobertson100@gmail.com>
    wrote:
    Hi Chuck,

    Thank you very much for this opportunity.   I also think it is a nice
    case study; it goes beyond the typical wordcount example by generating
    something that people can actually see and play with immediately
    afterwards (e.g. maps).  It is also showcasing nicely the community
    effort to collectively bring together information on the worlds
    biodiversity - the GBIF network really is a nice example of a free and
    open access community who are collectively addressing interoperability
    globally.  Can you please tell me what kind of time frame you would
    need the case study in?

    I have just got my Java PNG generation code down to 130msec on the
    Mac, so I am pretty much ready to start running on EC2 and do the
    volume tile generation, so will blog the whole thing on
    http://biodivertido.blogspot.com at some point soon.  I have to travel
    to the US on Saturday for a week so this will delay it somewhat.

    What is not 100% clear to me is when to push to S3:
    In the Map I will output the TileId-ZoomLevel-SpeciesId as the key,
    along with the count, and in the Reduce I group the counts into larger
    tiles, and create the PNG.  I could write to Sequencefile here... but
    I suspect I could just push to the s3 bucket here also - as long as
    the task tracker does not send the same Keys to multiple reduce tasks
    - my Hadoop naivity showing here (I wrote an in memory threaded
    MapReduceLite which does not compete reducers, but not got into the
    Hadoop code quite so much yet).


    Cheers,

    Tim


    On Thu, Apr 16, 2009 at 1:49 AM, Chuck Lam wrote:
    Hi Tim,

    I'm really interested in your application at gbif.org. I'm in the
    middle
    of
    writing Hadoop in Action ( http://www.manning.com/lam/ ) and think
    this
    may
    make for an interesting hadoop case study, since you're taking
    advantage
    of
    a lot of different pieces (EC2, S3, cloudfront, SequenceFiles,
    PHP/streaming). Would you be interested in discussing making a 4-5
    page
    case
    study out of this?

    As to your question, I don't know if it's been properly answered, but
    I
    don't know why you think that "multiple tasks are running on the same
    section of the sequence file." Maybe you can elaborate further and
    I'll
    see
    if I can offer any thoughts.




    On Tue, Apr 14, 2009 at 7:10 AM, tim robertson
    <timrobertson100@gmail.com>
    wrote:
    Sorry Brian, can I just ask please...

    I have the PNGs in the Sequence file for my sample set.  If I use a
    second MR job and push to S3 in the map, surely I run into the
    scenario where multiple tasks are running on the same section of the
    sequence file and thus pushing the same data to S3.  Am I missing
    something obvious (e.g. can I disable this behavior)?

    Cheers

    Tim


    On Tue, Apr 14, 2009 at 2:44 PM, tim robertson
    wrote:
    Thanks Brian,

    This is pretty much what I was looking for.

    Your calculations are correct but based on the assumption that at
    all
    zoom levels we will need all tiles generated.  Given the sparsity
    of
    data, it actually results in only a few 100GBs.  I'll run a second
    MR
    job with the map pushing to S3 then to make use of parallel
    loading.

    Cheers,

    Tim


    On Tue, Apr 14, 2009 at 2:37 PM, Brian Bockelman
    <bbockelm@cse.unl.edu>
    wrote:
    Hey Tim,

    Why don't you put the PNGs in a SequenceFile in the output of
    your
    reduce
    task?  You could then have a post-processing step that unpacks
    the
    PNG
    and
    places it onto S3.  (If my numbers are correct, you're looking at
    around 3TB
    of data; is this right?  With that much, you might want another
    separate Map
    task to unpack all the files in parallel ... really depends on
    the
    throughput you get to Amazon)

    Brian
    On Apr 14, 2009, at 4:35 AM, tim robertson wrote:

    Hi all,

    I am currently processing a lot of raw CSV data and producing a
    summary text file which I load into mysql.  On top of this I
    have a
    PHP application to generate tiles for google mapping (sample
    tile:


    http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800).
    Here is a (dev server) example of the final map client:
    http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800 -
    the
    dynamic grids as you zoom are all pre-calculated.

    I am considering (for better throughput as maps generate huge
    request
    volumes) pregenerating all my tiles (PNG) and storing them in S3
    with
    cloudfront.  There will be billions of PNGs produced each at
    1-3KB
    each.

    Could someone please recommend the best place to generate the
    PNGs
    and
    when to push them to S3 in a MR system?
    If I did the PNG generation and upload to S3 in the reduce the
    same
    task on multiple machines will compete with each other right?
    Should
    I generate the PNGs to a local directory and then on Task
    success
    push
    the lot up?  I am assuming billions of 1-3KB files on HDFS is
    not a
    good idea.

    I will use EC2 for the MR for the time being, but this will be
    moved
    to a local cluster still pushing to S3...

    Cheers,

    Tim
  • Kevin Peterson at Apr 16, 2009 at 12:21 am

    On Tue, Apr 14, 2009 at 2:35 AM, tim robertson wrote:
    I am considering (for better throughput as maps generate huge request
    volumes) pregenerating all my tiles (PNG) and storing them in S3 with
    cloudfront. There will be billions of PNGs produced each at 1-3KB
    each.
    Storing billions of PNGs each at 1-3kb each into S3 will be perfectly fine,
    there is no need to generate them and then push them at once, if you are
    storing them each in their own S3 object (which they must be, if you intend
    to fetch them using cloudfront). Each S3 object is unique, and can be
    written fully in parallel. If you are writing to the same S3 object twice,
    ... well, you're doing it wrong.

    However, do the math on the costs for S3. We were doing something similar,
    and found that we were spending a fortune on our put requests at $0.01 per
    1000, and next to nothing on storage. I've since moved to a more complicated
    model where I pack many small items in each object and store an index in
    simpledb. You'll need to partition your SimpleDBs if you do this.
  • Tim robertson at Apr 16, 2009 at 8:15 am
    Thanks Kevin,

    "... well, you're doing it wrong." This is what I'm afraid of :o)

    I know the TaskTracker for the Maps for example can run on the same
    part of the input file but not so sure on the Reduce. In the reduce,
    will the same keys be run on multiple machines in competition?



    On Thu, Apr 16, 2009 at 2:21 AM, Kevin Peterson wrote:
    On Tue, Apr 14, 2009 at 2:35 AM, tim robertson wrote:


    I am considering (for better throughput as maps generate huge request
    volumes) pregenerating all my tiles (PNG) and storing them in S3 with
    cloudfront.  There will be billions of PNGs produced each at 1-3KB
    each.
    Storing billions of PNGs each at 1-3kb each into S3 will be perfectly fine,
    there is no need to generate them and then push them at once, if you are
    storing them each in their own S3 object (which they must be, if you intend
    to fetch them using cloudfront). Each S3 object is unique, and can be
    written fully in parallel. If you are writing to the same S3 object twice,
    ... well, you're doing it wrong.

    However, do the math on the costs for S3. We were doing something similar,
    and found that we were spending a fortune on our put requests at $0.01 per
    1000, and next to nothing on storage. I've since moved to a more complicated
    model where I pack many small items in each object and store an index in
    simpledb. You'll need to partition your SimpleDBs if you do this.
  • Tim robertson at Apr 16, 2009 at 12:38 pm

    However, do the math on the costs for S3. We were doing something similar,
    and found that we were spending a fortune on our put requests at $0.01 per
    1000, and next to nothing on storage. I've since moved to a more complicated
    model where I pack many small items in each object and store an index in
    simpledb. You'll need to partition your SimpleDBs if you do this.
    Thanks a lot for Kevin for this - I stupidly overlooked the S3 put
    cost thinking EC2->S3 transfer was free, without realising there is
    still a PUT cost...

    I will reconsider and look at copying your approach and compare it
    with a few rendering EC2 instances running off mysql or so.

    Thanks again.

    Tim
  • Stuart Sierra at Apr 23, 2009 at 2:08 pm

    On Wed, Apr 15, 2009 at 8:21 PM, Kevin Peterson wrote:
    However, do the math on the costs for S3. We were doing something similar,
    and found that we were spending a fortune on our put requests at $0.01 per
    1000, and next to nothing on storage.
    I made a similar discovery. The cost of PUT adds up fast. One
    billion PUTs will cost you $10 million!

    -Stuart Sierra
  • Andrew Hitchcock at Apr 23, 2009 at 9:16 pm
    How do you figure? Puts are one penny per thousand, so I think it'd
    only cost $10,000. Here's the math I'm using:

    1 billion * ($0.01 / 1000) = 10,000
    Math courtesy of Google:
    http://www.google.com/search?q=1+billion+*+(0.01+%2F+1000)

    Still expensive, but not unreasonably so.

    Andrew

    On Thu, Apr 23, 2009 at 7:08 AM, Stuart Sierra
    wrote:
    On Wed, Apr 15, 2009 at 8:21 PM, Kevin Peterson wrote:
    However, do the math on the costs for S3. We were doing something similar,
    and found that we were spending a fortune on our put requests at $0.01 per
    1000, and next to nothing on storage.
    I made a similar discovery.  The cost of PUT adds up fast.  One
    billion PUTs will cost you $10 million!

    -Stuart Sierra
  • Stuart Sierra at Apr 23, 2009 at 9:46 pm

    On Thu, Apr 23, 2009 at 5:02 PM, Andrew Hitchcock wrote:
    1 billion * ($0.01 / 1000) = 10,000
    Oh yeah, I was thinking $0.01 for a single PUT. Silly me.

    -S
  • Tim robertson at Apr 24, 2009 at 3:47 am
    If anyone is interested I did finally get round to processing it all,
    and due to the sparsity of data we have, for all 23 zoom levels and
    all species we have information on, the result was 807 million PNGs,
    which is $8,000 to PUT to S3 - too much for me to pay.

    So like most things I will probably go for a compromise and pre
    process 10 zoom levels into S3 which will only come in at $457 (only
    the PUT into S3) and then render the rest on the fly. Only people
    browsing beyond zoom 10 are then hitting the real time rendering
    servers so I think this will work out ok performance wise.

    Cheers,

    Tim


    On Thu, Apr 23, 2009 at 5:45 PM, Stuart Sierra
    wrote:
    On Thu, Apr 23, 2009 at 5:02 PM, Andrew Hitchcock wrote:
    1 billion * ($0.01 / 1000) = 10,000
    Oh yeah, I was thinking $0.01 for a single PUT.  Silly me.

    -S

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedApr 14, '09 at 9:35a
activeApr 24, '09 at 3:47a
posts14
users6
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase