I've got a relatively basic question, and I was hoping to get some
guidance from the group as to the best path. I'm very familiar with clojure
and also the java hadoop api, but very new to cascading/cascalog.
I've got a large number of JSON files (~5MB apiece) sitting on S3, and I'm
trying out cascalog to do some basic processing on them. I think it's
easier to assume they're not splittable in the context of this problem. I'd
like to set up a source tap for them, but I'm not sure of the best way in
the context of cascalog.
If I were using the vanilla hadoop api and I didn't want to learn new
things, I would write a job that loaded these s3 objects into an hdfs
sequence file, specifying a FileInputFormat that is not splittable, in such
a way that my later mappers' inputs could look like Key=Filename Value=Byte
Array Of File. however, this seems sub-optimal to me.
Looking at the potential source taps for cascalog, it looks like there is
textdelimited and seq file taps. what strategy would y'all recommend for
setting up a source tap for a large number of small json files from s3? Is
there another technology that could help me here?
thanks in advance,
You received this message because you are subscribed to the Google Groups "cascalog-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to email@example.com.
For more options, visit https://groups.google.com/groups/opt_out.