Hey Everyone,
I've got a relatively basic question, and I was hoping to get some
guidance from the group as to the best path. I'm very familiar with clojure
and also the java hadoop api, but very new to cascading/cascalog.

I've got a large number of JSON files (~5MB apiece) sitting on S3, and I'm
trying out cascalog to do some basic processing on them. I think it's
easier to assume they're not splittable in the context of this problem. I'd
like to set up a source tap for them, but I'm not sure of the best way in
the context of cascalog.

If I were using the vanilla hadoop api and I didn't want to learn new
things, I would write a job that loaded these s3 objects into an hdfs
sequence file, specifying a FileInputFormat that is not splittable, in such
a way that my later mappers' inputs could look like Key=Filename Value=Byte
Array Of File. however, this seems sub-optimal to me.

Looking at the potential source taps for cascalog, it looks like there is
textdelimited and seq file taps. what strategy would y'all recommend for
setting up a source tap for a large number of small json files from s3? Is
there another technology that could help me here?

thanks in advance,

--paul

--
You received this message because you are subscribed to the Google Groups "cascalog-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Search Discussions

  • Paul Lam at Feb 22, 2013 at 11:38 am
    Are your json one record per line? If it is then that would be simpler. You
    can use hfs-textline to read each line then parse it with a clojure's json
    library to return tuples.
    If it's not one record per line, then take a look
    at https://github.com/gmarabout/cascading.json which I never tried. but
    you'll need a custom cascading scheme/tap to return your multi-line json
    into cascading tuples.

    once you have your data in tuples one way or another, it's just regular
    cascading/cascalog afterward.


    On Thursday, February 21, 2013 4:43:31 PM UTC, Paul Sanwald wrote:

    Hey Everyone,
    I've got a relatively basic question, and I was hoping to get some
    guidance from the group as to the best path. I'm very familiar with clojure
    and also the java hadoop api, but very new to cascading/cascalog.

    I've got a large number of JSON files (~5MB apiece) sitting on S3, and I'm
    trying out cascalog to do some basic processing on them. I think it's
    easier to assume they're not splittable in the context of this problem. I'd
    like to set up a source tap for them, but I'm not sure of the best way in
    the context of cascalog.

    If I were using the vanilla hadoop api and I didn't want to learn new
    things, I would write a job that loaded these s3 objects into an hdfs
    sequence file, specifying a FileInputFormat that is not splittable, in such
    a way that my later mappers' inputs could look like Key=Filename Value=Byte
    Array Of File. however, this seems sub-optimal to me.

    Looking at the potential source taps for cascalog, it looks like there is
    textdelimited and seq file taps. what strategy would y'all recommend for
    setting up a source tap for a large number of small json files from s3? Is
    there another technology that could help me here?

    thanks in advance,

    --paul
    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Paul Sanwald at Feb 22, 2013 at 3:24 pm
    Thanks Paul. my source JSON files don't have newlines at all, so I think
    hfs-textline might work for me. If not, cascading.json looks exactly like
    what I was hoping for, so I'll have a look. much appreciated! looking
    forward to getting past this initial step so I can get to the interesting
    parts of cascalog!

    --paul
    On Friday, February 22, 2013 6:38:00 AM UTC-5, Paul Lam wrote:

    Are your json one record per line? If it is then that would be simpler.
    You can use hfs-textline to read each line then parse it with a clojure's
    json library to return tuples.
    If it's not one record per line, then take a look at
    https://github.com/gmarabout/cascading.json which I never tried. but
    you'll need a custom cascading scheme/tap to return your multi-line json
    into cascading tuples.

    once you have your data in tuples one way or another, it's just regular
    cascading/cascalog afterward.


    On Thursday, February 21, 2013 4:43:31 PM UTC, Paul Sanwald wrote:

    Hey Everyone,
    I've got a relatively basic question, and I was hoping to get some
    guidance from the group as to the best path. I'm very familiar with clojure
    and also the java hadoop api, but very new to cascading/cascalog.

    I've got a large number of JSON files (~5MB apiece) sitting on S3, and
    I'm trying out cascalog to do some basic processing on them. I think it's
    easier to assume they're not splittable in the context of this problem. I'd
    like to set up a source tap for them, but I'm not sure of the best way in
    the context of cascalog.

    If I were using the vanilla hadoop api and I didn't want to learn new
    things, I would write a job that loaded these s3 objects into an hdfs
    sequence file, specifying a FileInputFormat that is not splittable, in such
    a way that my later mappers' inputs could look like Key=Filename Value=Byte
    Array Of File. however, this seems sub-optimal to me.

    Looking at the potential source taps for cascalog, it looks like there is
    textdelimited and seq file taps. what strategy would y'all recommend for
    setting up a source tap for a large number of small json files from s3? Is
    there another technology that could help me here?

    thanks in advance,

    --paul
    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedFeb 21, '13 at 8:12p
activeFeb 22, '13 at 3:24p
posts3
users2
websiteclojure.org
irc#clojure

2 users in discussion

Paul Sanwald: 2 posts Paul Lam: 1 post

People

Translate

site design / logo © 2021 Grokbase