I have used nested sets <http://en.wikipedia.org/wiki/Nested_set_model
transforming XML into a more relational (or tuple) friendly format. My
experience is with 1,000's of XML files that need to be easy to ad-hoc
query and join with relational data. Nested sets preserve the hierarchical
relationships by augmenting the nodes with a little extra data.
You'll have to pre-process your XML into the nested-set structured data.
Once you have done that, you can then load into your data store for
querying with cascalog. I did this<https://github.com/tom-b/xml2nestedsets/blob/master/xml2nestedset.lisp
>in Common Lisp, but the XML parser I use won't work for you (it is
in-memory and your 40GB XML would probably choke it).
Transforming the output to CSV is too painful in my case - what I really
wanted was something that would resemble taking the 1 row per node type and
value and having a pivot-table operation that made the CSV more usable for
end-users. But my XML data is too variable - a parent node type A could
have from 0 to N child nodes (with each child node also having a similar
number of child nodes). So the pivot had to preserve the relationships
between node types and child nodes. Maybe you are luckier and have a known
number of child nodes for each parent node type.
I find that the nested-sets representation works really well for querying
and indexing by node type, etc.
BTW, although it is very SQL and RDBMs-oriented, you might find that Joe
Celko's book on trees and hierarchies would be useful in your case.
On Saturday, March 24, 2012 7:57:36 PM UTC-4, vladislav p wrote:
Is there an example (or recommendation) on how to process (transform)
and query large xml file
Scenario is the following
a) have hierarchical XML with 5 level deep hierarchy (meaning max 4
nodes underneath a level 1 node). Each node can contain up to 50
values, and around 100 attributes
b) file size is about 40 GB , there are up to 100 files like that with
expected processing time per file 2 hours per file ( use case c-1
below), and all 100 files are processed concurrently. No relationship
across files and overhead of 'starting' a c-1 processing is about 10
c) Two types of processing
1) CSV output as result of transfromations (where transformation
looks at one level 1 node (and its children) at a time. CSV expected
to be around 1GB in size per each 40GB xml.
2) adhoc query capability
Another question (not relevant to cascalog per say -- but will
appreciate thoughts -- on the sizing of the cluster for to finish the
100 files under 4 hours.
thank you in advance