FAQ
Hi Rich,

  I am very curious about your work. With node 0.10 release I have been
searching for an ETL tools using Node Streams.

  I am seeking the equivalent to http://pandas.pydata.org/ (amazing python
tool for ETL) in nodejs.

  Could you post an example of what youve acomplished already??

Thank youu




El viernes, 22 de abril de 2011 22:58:41 UTC+2, Rich Schiavi escribió:
Luke,

Not sure if my other reply got sent, but the approach I've explored
and is working well is to stage each part of the ETL process, instead
of trying to interleave the db writes.

Basically, I'm ingestion about 300MB worth of XML that has all sorts
of elements/transforms before it gets into mysql. I first parse that
with SAX, generating each set of objects. From those objects instead
of doing row-by-row or mysql inserts, I dump them out to TSV files,
that can then be loaded more efficiently with mysql load infile.

For me though, the key to not slamming the database or node even, is
to make sure each stage completes before trying to do async/parallel
database writes. For ETL, this is probably better anyway to avoid
failures and partial writes if parsing fails or something like that.

Lemme know if you have questions, I can post my example code. Its
surprising small for the amount of work it processes: 300MB out to
mysql load files in about 160 seconds



On Mar 15, 1:26 am, Luke Monahan wrote:
A quick update.

I figured it out that I could call Stream.pause() on the internal
stream in the CSV library very easily. Once I had this in mind, it
wasn't hard to then keep an "currentlyRunning" counter of CSV rows
that had been parsed but not inserted into the database. Once
currentlyRunning gets above 100k I pause the stream and resume it when
that counter drops below 10k.

This way everything runs smoothly without huge amounts of memory. I
can drop the number of riak client connections down to a more sane
level (4 or so works for me) and the whole lot is much faster.

Thanks,

Luke.

On Mar 10, 9:40 am, Luke Monahan wrote:






node-inspector is working well thanks.
I've been able to dump the heap before processing and during
processing. I'm seeing large numbers of String, object and closure
constructors, which between them contribute 99% of the heap. The tool
doesn't let me drill down any further than that as far as I can tell.
But the total heap size is well and truly less than a problem -- 20MB
or so with a largeish file, wheras the operating system is telling me
the entire process is using over 200MB at the same time. I am I
misunderstanding the output? Or is this pointing in another direction,
such as a recursive call that is creating a huge stack?
I've managed to make it work fine by cutting my CSV files into smaller
chunks (100MB each) and processing them serially by ensuring the
number of unsaved CSV rows is 0 before starting on the next file. I've
also worked out why I can't use to protobuf api (need to install
thishttp://code.google.com/p/protobuf-for-node/). I'll spend a little
while getting that working, but then I'll probably move on to the more
interesting parts of my idea than just loading up data.
Thanks again,
Luke.
On Mar 10, 2:51 am, Nicholas Campbell <nicholas.j.campb...@gmail.com>
wrote:
Valgrind is good but there is alsohttps://
github.com/dannycoates/node-inspector. I haven't yet had a need for
it but I know others who have used it and love it.
- Nick Campbell
http://digitaltumbleweed.com
On Wed, Mar 9, 2011 at 2:08 AM, Luke Monahan wrote:
Thanks for the suggestions. It looks like a custom stream would be
a
good idea, as the "pump" between streams already has throttling
baked
in. I haven't looked at this in-depth though, so I'll do that
shortly
to figure out if it's viable. A custom CSV parser to specifically
support throttling will be my next option.
I'm using the REST API, as the protobuf support in riak-js doesn't
seem to work for me -- I just get an error trying to instantiate
the
connection. I might have to switch to git head or something
instead of
the npm to try that out?
I'll see if I can profile to find where the memory is specifically
going to before all the above anyway. I just found I could only
about
200MB of the file fully processed and saved before "Allocation
failed
- process out of memory", which is 1GB I think due to V8. There's
obviously more to the story than just rows of CSV being kept in
memory
to cause this. Is there a tool/method recommended to trace memory
leaks in nodejs -- the valgrind method is all I could find in this
group and it seemed beyond me...
Luke.
On Mar 9, 5:11 pm, Nicholas Campbell <
nicholas.j.campb...@gmail.com>
wrote:
Luke,
A couple gb of data shouldn't crush the db when writing to it
(in theory
;D).
You've identified that Riak is the slowdown? What backend to
Riak are you
using? Could it be the module? Are you using proto_bufs or the
REST API?
Have you used the inspector? Where is the memory increase
happening- in
Node
due to the slowdown/backup?
- Nick Campbell
On Tue, Mar 8, 2011 at 11:28 PM, Marco Rogers <
marco.rog...@gmail.com
wrote:
Luke.ETLis an interesting use case for node, and one I haven't
heard
a
lot of people talking about. Some quick thoughts.
- Your pooling tactic is a good idea in general. Use as many
connections
as your db can handle for high writing load.
- It sounds like you're aware that your main problem is
throttling the
rows
coming from the CSV. Node fs streams are fast :) You're in
the right
neighborhood with the pause() method. You want to couple this
with
whatever
method you're using to determine load on the DB.
- You could read a number of items from the csv stream and
then pause
it
and process that batch. Resume when the batch is done.
- Or you could let the stream flow until some alert tells you
that the
database is saturated with writes, then call pause on the csv
stream
until
it cools down.
If this is an infrequent job, I would go with the second
option. If you
want to run this longer term, I would probably explore the
first. And
tweak
the batch size and pool size to keep throughput high.
Lastly, I don't think you're missing anything magical with
node. It's
supposed to be pretty low level, and people are building handy
utilities on
top of it. It might help to read more about streams and maybe
think
about
writing your own.
Your custom stream could sit between the csv file stream and
database
writing code. It would handle the batching or throttling any
way you
wanted.
I'm not sure this is actually helpful. You can always gist
some code
and
ask more specific questions too. Feel free to share your
experiences
with
this as well.
:Marco
--
You received this message because you are subscribed to the
Google
Groups
"nodejs" group.
To post to this group, send email to nod...@googlegroups.com<javascript:>.
To unsubscribe from this group, send email to
nodejs+un...@googlegroups.com <javascript:>.
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en.
--
You received this message because you are subscribed to the Google
Groups
"nodejs" group.
To post to this group, send email to nod...@googlegroups.com<javascript:>.
To unsubscribe from this group, send email to
nodejs+un...@googlegroups.com <javascript:>.
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en.
--
--
Job Board: http://jobs.nodejs.org/
Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nodejs@googlegroups.com
To unsubscribe from this group, send email to
nodejs+unsubscribe@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

---
You received this message because you are subscribed to the Google Groups "nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nodejs+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupnodejs @
categoriesnodejs
postedApr 16, '14 at 1:18p
activeApr 16, '14 at 1:18p
posts1
users1
websitenodejs.org
irc#node.js

1 user in discussion

Ignacio Elguezabal: 1 post

People

Translate

site design / logo © 2021 Grokbase