FAQ
My datasets consist of many Avro files in HDFS; I typically prototype by
copying one of these files locally and working against that. When time
comes to run against the full dataset in HDFS, I flip a switch in my code.

Has anyone automated this process (or have a better one)?

--
You received this message because you are subscribed to the Google Groups "cascalog-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Search Discussions

  • Paco Nathan at Jul 15, 2013 at 5:41 pm
    If I understand correctly, that requires a recompile?
    You might use cmd line args for the taps, then you can change out the files
    without changing code.

    Paco

    On Mon, Jul 15, 2013 at 10:35 AM, Mason wrote:

    My datasets consist of many Avro files in HDFS; I typically prototype by
    copying one of these files locally and working against that. When time
    comes to run against the full dataset in HDFS, I flip a switch in my code.

    Has anyone automated this process (or have a better one)?

    --
    You received this message because you are subscribed to the Google Groups
    "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to cascalog-user+unsubscribe@**googlegroups.com<cascalog-user%2Bunsubscribe@googlegroups.com>
    .
    For more options, visit https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>
    .

    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Mason at Jul 15, 2013 at 5:51 pm
    Yes, that would be an improvement. The main thing I'm wondering about is
    if anyone has automated the process of pulling out a subset of data to
    prototype against. Pig has ILLUSTRATE, for instance, which does
    something like this.
    On 7/15/13 10:40 AM, Paco Nathan wrote:
    If I understand correctly, that requires a recompile?
    You might use cmd line args for the taps, then you can change out the
    files without changing code.

    Paco


    On Mon, Jul 15, 2013 at 10:35 AM, Mason wrote:

    My datasets consist of many Avro files in HDFS; I typically
    prototype by copying one of these files locally and working
    against that. When time comes to run against the full dataset in
    HDFS, I flip a switch in my code.

    Has anyone automated this process (or have a better one)?

    --
    You received this message because you are subscribed to the Google
    Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to cascalog-user+unsubscribe@googlegroups.com
    For more options, visit https://groups.google.com/groups/opt_out.



    --
    You received this message because you are subscribed to the Google
    Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Jeroen van Dijk at Jul 17, 2013 at 7:29 am
    Hi Mason,

    If you are feeling adventurous you could try my library which does the
    thing you describe (and a lot more)
    https://github.com/jeroenvandijk/cascalog-graph

    I've chosen to use "file"-extensions such as my-data.hfs-seqfile to
    indicate the type of a tap, where that's not possible (e.g. for previous
    taps) you can use hfs-seqfile:my-data . This allows to write queries (and
    in this case complete workflows) without ever writing the type of output
    (and input). And regarding prototyping, this allows you to easily use
    subsets or sets of different types to test before launching a job on the
    real set.

    The library is in use here in production by me and my team, but it might
    still be a bit young for external usage.

    Please let me know if you have feedback or need help using the library.

    Jeroen

    On Mon, Jul 15, 2013 at 7:51 PM, Mason wrote:

    Yes, that would be an improvement. The main thing I'm wondering about is
    if anyone has automated the process of pulling out a subset of data to
    prototype against. Pig has ILLUSTRATE, for instance, which does something
    like this.


    On 7/15/13 10:40 AM, Paco Nathan wrote:

    If I understand correctly, that requires a recompile?
    You might use cmd line args for the taps, then you can change out the
    files without changing code.

    Paco

    On Mon, Jul 15, 2013 at 10:35 AM, Mason wrote:

    My datasets consist of many Avro files in HDFS; I typically prototype by
    copying one of these files locally and working against that. When time
    comes to run against the full dataset in HDFS, I flip a switch in my code.

    Has anyone automated this process (or have a better one)?

    --
    You received this message because you are subscribed to the Google Groups
    "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups
    "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.




    --
    You received this message because you are subscribed to the Google Groups
    "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Mason at Aug 5, 2013 at 5:41 pm
    Thanks Jeroen. I'll check it out.
    On 7/17/13 00:29 AM, Jeroen van Dijk wrote:
    Hi Mason,

    If you are feeling adventurous you could try my library which does the
    thing you describe (and a lot more)
    https://github.com/jeroenvandijk/cascalog-graph

    I've chosen to use "file"-extensions such as my-data.hfs-seqfile to
    indicate the type of a tap, where that's not possible (e.g. for
    previous taps) you can use hfs-seqfile:my-data . This allows to write
    queries (and in this case complete workflows) without ever writing the
    type of output (and input). And regarding prototyping, this allows you
    to easily use subsets or sets of different types to test before
    launching a job on the real set.

    The library is in use here in production by me and my team, but it
    might still be a bit young for external usage.

    Please let me know if you have feedback or need help using the library.

    Jeroen


    On Mon, Jul 15, 2013 at 7:51 PM, Mason wrote:

    Yes, that would be an improvement. The main thing I'm wondering
    about is if anyone has automated the process of pulling out a
    subset of data to prototype against. Pig has ILLUSTRATE, for
    instance, which does something like this.

    On 7/15/13 10:40 AM, Paco Nathan wrote:
    If I understand correctly, that requires a recompile?
    You might use cmd line args for the taps, then you can change out
    the files without changing code.

    Paco


    On Mon, Jul 15, 2013 at 10:35 AM, Mason <mason@verbasoftware.com
    wrote:

    My datasets consist of many Avro files in HDFS; I typically
    prototype by copying one of these files locally and working
    against that. When time comes to run against the full dataset
    in HDFS, I flip a switch in my code.

    Has anyone automated this process (or have a better one)?

    --
    You received this message because you are subscribed to the
    Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from
    it, send an email to
    cascalog-user+unsubscribe@googlegroups.com


    --
    You received this message because you are subscribed to the
    Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to cascalog-user+unsubscribe@googlegroups.com
    --
    You received this message because you are subscribed to the Google
    Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to cascalog-user+unsubscribe@googlegroups.com
    For more options, visit https://groups.google.com/groups/opt_out.



    --
    You received this message because you are subscribed to the Google
    Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedJul 15, '13 at 5:35p
activeAug 5, '13 at 5:41p
posts5
users3
websiteclojure.org
irc#clojure

People

Translate

site design / logo © 2021 Grokbase