FAQ
Is there an example (or recommendation) on how to process (transform)
and query large xml file

Scenario is the following
a) have hierarchical XML with 5 level deep hierarchy (meaning max 4
nodes underneath a level 1 node). Each node can contain up to 50
values, and around 100 attributes

b) file size is about 40 GB , there are up to 100 files like that with
expected processing time per file 2 hours per file ( use case c-1
below), and all 100 files are processed concurrently. No relationship
across files and overhead of 'starting' a c-1 processing is about 10
minutes


c) Two types of processing
    1) CSV output as result of transfromations (where transformation
looks at one level 1 node (and its children) at a time. CSV expected
to be around 1GB in size per each 40GB xml.

    2) adhoc query capability


Another question (not relevant to cascalog per say -- but will
appreciate thoughts -- on the sizing of the cluster for to finish the
100 files under 4 hours.


thank you in advance

Search Discussions

  • Paul Lam at Mar 25, 2012 at 8:54 am
    The problem here is that XML is multi-line. Easiest way I can think of is
    just to pre-process the files so that the XML are one node per line before
    dumping onto HDFS. Then use a XML parsing library as you read it in line by
    line with hfs-textline.

    About the cluster size needed, you said 2 hr per file. You have 100 files.
    So 50 machines?


    On Saturday, March 24, 2012 11:57:36 PM UTC, vladislav p wrote:

    Is there an example (or recommendation) on how to process (transform)
    and query large xml file

    Scenario is the following
    a) have hierarchical XML with 5 level deep hierarchy (meaning max 4
    nodes underneath a level 1 node). Each node can contain up to 50
    values, and around 100 attributes

    b) file size is about 40 GB , there are up to 100 files like that with
    expected processing time per file 2 hours per file ( use case c-1
    below), and all 100 files are processed concurrently. No relationship
    across files and overhead of 'starting' a c-1 processing is about 10
    minutes


    c) Two types of processing
    1) CSV output as result of transfromations (where transformation
    looks at one level 1 node (and its children) at a time. CSV expected
    to be around 1GB in size per each 40GB xml.

    2) adhoc query capability


    Another question (not relevant to cascalog per say -- but will
    appreciate thoughts -- on the sizing of the cluster for to finish the
    100 files under 4 hours.


    thank you in advance
  • Vladislav p at Mar 26, 2012 at 3:11 am
    Hi, preprocessing the XML into one-node-per-line (combining data from
    up to 4 levels deep)
      would add another IO read and another IO write (doubling IO
    essentially).
    Is it possible some how in cascalog pre-declare an xml structure up to
    5 nodes deep, and have it act as the 'row oriented container' for the
    rest of the operations.

    For cluster size, thank you for recommendation. I do not have a view,
    I though some capacity planning formulae or method is out there



    On Mar 25, 4:54 am, Paul Lam wrote:
    The problem here is that XML is multi-line. Easiest way I can think of is
    just to pre-process the files so that the XML are one node per line before
    dumping onto HDFS. Then use a XML parsing library as you read it in line by
    line with hfs-textline.

    About the cluster size needed, you said 2 hr per file. You have 100 files.
    So 50 machines?






    On Saturday, March 24, 2012 11:57:36 PM UTC, vladislav p wrote:

    Is there an example (or recommendation) on how to process (transform)
    and query large xml file
    Scenario is the following
    a) have hierarchical XML with 5 level deep hierarchy (meaning max 4
    nodes underneath a level 1 node).  Each node can contain up to 50
    values, and around 100 attributes
    b) file size is about 40 GB , there are up to 100 files like that with
    expected processing time per file 2 hours per file ( use case c-1
    below), and all 100 files are processed concurrently. No relationship
    across files and overhead of 'starting' a c-1 processing is about 10
    minutes
    c) Two types of processing
    1) CSV output as result of transfromations (where transformation
    looks at one level 1 node (and its children) at a time. CSV expected
    to be around 1GB in size per each 40GB xml.
    2) adhoc query capability
    Another question (not relevant to cascalog per say -- but will
    appreciate thoughts -- on the sizing of the cluster for to finish the
    100 files under 4 hours.
    thank you in advance
  • Paul Lam at Mar 26, 2012 at 9:02 am
    try this
    https://github.com/apache/mahout/blob/trunk/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java


    On Monday, 26 March 2012 04:11:14 UTC+1, vladislav p wrote:

    Hi, preprocessing the XML into one-node-per-line (combining data from
    up to 4 levels deep)
    would add another IO read and another IO write (doubling IO
    essentially).
    Is it possible some how in cascalog pre-declare an xml structure up to
    5 nodes deep, and have it act as the 'row oriented container' for the
    rest of the operations.

    For cluster size, thank you for recommendation. I do not have a view,
    I though some capacity planning formulae or method is out there



    On Mar 25, 4:54 am, Paul Lam wrote:
    The problem here is that XML is multi-line. Easiest way I can think of is
    just to pre-process the files so that the XML are one node per line before
    dumping onto HDFS. Then use a XML parsing library as you read it in line by
    line with hfs-textline.

    About the cluster size needed, you said 2 hr per file. You have 100 files.
    So 50 machines?






    On Saturday, March 24, 2012 11:57:36 PM UTC, vladislav p wrote:

    Is there an example (or recommendation) on how to process (transform)
    and query large xml file
    Scenario is the following
    a) have hierarchical XML with 5 level deep hierarchy (meaning max 4
    nodes underneath a level 1 node). Each node can contain up to 50
    values, and around 100 attributes
    b) file size is about 40 GB , there are up to 100 files like that with
    expected processing time per file 2 hours per file ( use case c-1
    below), and all 100 files are processed concurrently. No relationship
    across files and overhead of 'starting' a c-1 processing is about 10
    minutes
    c) Two types of processing
    1) CSV output as result of transfromations (where transformation
    looks at one level 1 node (and its children) at a time. CSV expected
    to be around 1GB in size per each 40GB xml.
    2) adhoc query capability
    Another question (not relevant to cascalog per say -- but will
    appreciate thoughts -- on the sizing of the cluster for to finish the
    100 files under 4 hours.
    thank you in advance
  • Sam Ritchie at Mar 26, 2012 at 4:51 pm
    I'd love an XML tap in cascalog-contrib if you guys tackle this one. Let me
    know if you're up for building it publicly and I'll add you to the
    Cascalog-Contrib repo.

    Cheers,
    Sam
    On Mon, Mar 26, 2012 at 2:02 AM, Paul Lam wrote:

    try this

    https://github.com/apache/mahout/blob/trunk/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java


    On Monday, 26 March 2012 04:11:14 UTC+1, vladislav p wrote:

    Hi, preprocessing the XML into one-node-per-line (combining data from
    up to 4 levels deep)
    would add another IO read and another IO write (doubling IO
    essentially).
    Is it possible some how in cascalog pre-declare an xml structure up to
    5 nodes deep, and have it act as the 'row oriented container' for the
    rest of the operations.

    For cluster size, thank you for recommendation. I do not have a view,
    I though some capacity planning formulae or method is out there



    On Mar 25, 4:54 am, Paul Lam wrote:
    The problem here is that XML is multi-line. Easiest way I can think of is
    just to pre-process the files so that the XML are one node per line before
    dumping onto HDFS. Then use a XML parsing library as you read it in line by
    line with hfs-textline.

    About the cluster size needed, you said 2 hr per file. You have 100 files.
    So 50 machines?






    On Saturday, March 24, 2012 11:57:36 PM UTC, vladislav p wrote:

    Is there an example (or recommendation) on how to process (transform)
    and query large xml file
    Scenario is the following
    a) have hierarchical XML with 5 level deep hierarchy (meaning max 4
    nodes underneath a level 1 node). Each node can contain up to 50
    values, and around 100 attributes
    b) file size is about 40 GB , there are up to 100 files like that
    with
    expected processing time per file 2 hours per file ( use case c-1
    below), and all 100 files are processed concurrently. No relationship
    across files and overhead of 'starting' a c-1 processing is about 10
    minutes
    c) Two types of processing
    1) CSV output as result of transfromations (where transformation
    looks at one level 1 node (and its children) at a time. CSV expected
    to be around 1GB in size per each 40GB xml.
    2) adhoc query capability
    Another question (not relevant to cascalog per say -- but will
    appreciate thoughts -- on the sizing of the cluster for to finish the
    100 files under 4 hours.
    thank you in advance

    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why! http://emailcharter.org)
  • Kevin at Jul 18, 2013 at 10:34 pm
    Do you happen to have an example of how I would use this?
    On Monday, March 26, 2012 2:02:39 AM UTC-7, Paul Lam wrote:

    try this

    https://github.com/apache/mahout/blob/trunk/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java


    On Monday, 26 March 2012 04:11:14 UTC+1, vladislav p wrote:

    Hi, preprocessing the XML into one-node-per-line (combining data from
    up to 4 levels deep)
    would add another IO read and another IO write (doubling IO
    essentially).
    Is it possible some how in cascalog pre-declare an xml structure up to
    5 nodes deep, and have it act as the 'row oriented container' for the
    rest of the operations.

    For cluster size, thank you for recommendation. I do not have a view,
    I though some capacity planning formulae or method is out there



    On Mar 25, 4:54 am, Paul Lam wrote:
    The problem here is that XML is multi-line. Easiest way I can think of is
    just to pre-process the files so that the XML are one node per line before
    dumping onto HDFS. Then use a XML parsing library as you read it in line by
    line with hfs-textline.

    About the cluster size needed, you said 2 hr per file. You have 100 files.
    So 50 machines?






    On Saturday, March 24, 2012 11:57:36 PM UTC, vladislav p wrote:

    Is there an example (or recommendation) on how to process (transform)
    and query large xml file
    Scenario is the following
    a) have hierarchical XML with 5 level deep hierarchy (meaning max 4
    nodes underneath a level 1 node). Each node can contain up to 50
    values, and around 100 attributes
    b) file size is about 40 GB , there are up to 100 files like that
    with
    expected processing time per file 2 hours per file ( use case c-1
    below), and all 100 files are processed concurrently. No relationship
    across files and overhead of 'starting' a c-1 processing is about 10
    minutes
    c) Two types of processing
    1) CSV output as result of transfromations (where transformation
    looks at one level 1 node (and its children) at a time. CSV expected
    to be around 1GB in size per each 40GB xml.
    2) adhoc query capability
    Another question (not relevant to cascalog per say -- but will
    appreciate thoughts -- on the sizing of the cluster for to finish the
    100 files under 4 hours.
    thank you in advance
    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Tom_b at Mar 26, 2012 at 4:52 pm
    I have used nested sets <http://en.wikipedia.org/wiki/Nested_set_model> for
    transforming XML into a more relational (or tuple) friendly format. My
    experience is with 1,000's of XML files that need to be easy to ad-hoc
    query and join with relational data. Nested sets preserve the hierarchical
    relationships by augmenting the nodes with a little extra data.

    You'll have to pre-process your XML into the nested-set structured data.
    Once you have done that, you can then load into your data store for
    querying with cascalog. I did this<https://github.com/tom-b/xml2nestedsets/blob/master/xml2nestedset.lisp>in Common Lisp, but the XML parser I use won't work for you (it is
    in-memory and your 40GB XML would probably choke it).

    Transforming the output to CSV is too painful in my case - what I really
    wanted was something that would resemble taking the 1 row per node type and
    value and having a pivot-table operation that made the CSV more usable for
    end-users. But my XML data is too variable - a parent node type A could
    have from 0 to N child nodes (with each child node also having a similar
    number of child nodes). So the pivot had to preserve the relationships
    between node types and child nodes. Maybe you are luckier and have a known
    number of child nodes for each parent node type.

    I find that the nested-sets representation works really well for querying
    and indexing by node type, etc.

    BTW, although it is very SQL and RDBMs-oriented, you might find that Joe
    Celko's book on trees and hierarchies would be useful in your case.
    On Saturday, March 24, 2012 7:57:36 PM UTC-4, vladislav p wrote:

    Is there an example (or recommendation) on how to process (transform)
    and query large xml file

    Scenario is the following
    a) have hierarchical XML with 5 level deep hierarchy (meaning max 4
    nodes underneath a level 1 node). Each node can contain up to 50
    values, and around 100 attributes

    b) file size is about 40 GB , there are up to 100 files like that with
    expected processing time per file 2 hours per file ( use case c-1
    below), and all 100 files are processed concurrently. No relationship
    across files and overhead of 'starting' a c-1 processing is about 10
    minutes


    c) Two types of processing
    1) CSV output as result of transfromations (where transformation
    looks at one level 1 node (and its children) at a time. CSV expected
    to be around 1GB in size per each 40GB xml.

    2) adhoc query capability


    Another question (not relevant to cascalog per say -- but will
    appreciate thoughts -- on the sizing of the cluster for to finish the
    100 files under 4 hours.


    thank you in advance

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedMar 24, '12 at 11:57p
activeJul 18, '13 at 10:34p
posts7
users6
websiteclojure.org
irc#clojure

People

Translate

site design / logo © 2021 Grokbase