FAQ
Howdy folks, I'm working on a JSON Python module [1] and I'm struggling with an
appropriate syntax for dealing with incrementally parsing streams of data as
they come in (off a socket or file object).

The underlying C-level parsing library that I'm using (Yajl [2]) already uses a
callback system internally for handling such things, but I'm worried about:
* Ease of use, simplicity
* Python method invocation overhead going from C back into Python

One of the ideas I've had is to "iterparse" a la:
for k, v in yajl.iterloads(fp):
... print ('key, value', k, v)
>>>

Effectively building a generator for the JSON string coming off of the `fp`
object and when generator.next() is called reading more of the stream object.
This has some shortcomings however:
* For JSON like: '''{"rc":0,"data":<large JSON object>}''' the iterloads()
function would block for some time when processing the value of the "data"
key.
* Presumes the developer has prior knowledge of the kind of JSON strings
being passed in

I've searched around, following this "iterloads" notion, for a tree-generator
and I came up with nothing.

Any suggestions on how to accomplish iterloads, or perhaps a suggestion for a
more sensible syntax for incrementally parsing objects from the stream and
passing them up into Python?

Cheers,
-R. Tyler Ballance
--------------------------------------
Jabber: rtyler at jabber.org
GitHub: http://github.com/rtyler
Twitter: http://twitter.com/agentdero
Blog: http://unethicalblogger.com



[1] http://github.com/rtyler/py-yajl
[2] http://lloyd.github.com/yajl/


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20091204/f6e53599/attachment.pgp>

Search Discussions

  • Nobody at Dec 5, 2009 at 6:28 pm

    On Fri, 04 Dec 2009 13:51:15 -0800, tyler wrote:

    Howdy folks, I'm working on a JSON Python module [1] and I'm struggling with
    an appropriate syntax for dealing with incrementally parsing streams of data
    as they come in (off a socket or file object).

    The underlying C-level parsing library that I'm using (Yajl [2]) already uses
    a callback system internally for handling such things, but I'm worried about:
    * Ease of use, simplicity
    * Python method invocation overhead going from C back into Python

    One of the ideas I've had is to "iterparse" a la:
    for k, v in yajl.iterloads(fp):
    ... print ('key, value', k, v)
    Effectively building a generator for the JSON string coming off of the `fp`
    object and when generator.next() is called reading more of the stream object.
    This has some shortcomings however:
    * For JSON like: '''{"rc":0,"data":<large JSON object>}''' the iterloads()
    function would block for some time when processing the value of the "data"
    key.
    * Presumes the developer has prior knowledge of the kind of JSON strings
    being passed in

    I've searched around, following this "iterloads" notion, for a tree-generator
    and I came up with nothing.

    Any suggestions on how to accomplish iterloads, or perhaps a suggestion for a
    more sensible syntax for incrementally parsing objects from the stream and
    passing them up into Python?
    One option is to return values as opaque objects with .type() and .data()
    methods. The opaque object can be returned as soon as the parser starts to
    parse the value.

    If the user calls the .data() method for an atomic object (string,
    number, boolean, null) before parsing is complete, the call will block.
    For composite objects (array, object), the call will return an iterator
    immediately.

    If the user never calls the data() method, there's no need to convert the
    element to a Python value. If the object's refcount reaches zero while
    parsing is still ongoing, the parser can discard any existing data and
    discard further data as it is read.

    E.g. a program to read JSON data from stdin and print the data back to
    stdout in (approximately) JSON format would look like:

    def print_json(f, node):
    if node.type() == json.NULL:
    f.write("null")
    elif node.type() == json.BOOL:
    f.write("true" if node.data() else "false")
    elif node.type() == json.NUMBER:
    f.write(node.data())
    elif node.type() == json.STRING:
    f.write('"' + node.data() + '"')
    elif node.type() == json.ARRAY:
    f.write('[')
    for i, v in enumerate(node.data()):
    if i > 0: f.write(',')
    print_json(f, v)
    f.write(']')
    elif node.type() == json.OBJECT:
    f.write('{')
    for i, (k, v) in enumerate(node.data()):
    if i > 0: f.write(',')
    print_json(f, k)
    f.write(": ")
    print_json(f, v)
    f.write('}')

    root = json.parse(sys.stdin)
    print_json(sys.stdout, root)

    For greater pythonicity, you could make the composite types implement the
    iterator interface directly, so the data() method becomes redundant (if
    called, it would just return "self"), and use distinct classes for the
    distinct types (so that you can use type() or isinstance()), i.e.:

    def print_json(f, node):
    if isinstance(node, json.Null):
    f.write("null")
    elif isinstance(node, json.Bool):
    f.write("true" if node.data() else "false")
    elif isinstance(node, json.Number):
    f.write(node.data())
    elif isinstance(node, json.String):
    f.write('"' + node.data() + '"')
    elif isinstance(node, json.Array):
    f.write('[')
    for i, v in enumerate(node):
    if i > 0: f.write(',')
    print_json(f, v)
    f.write(']')
    elif isinstance(node, json.Object):
    f.write('{')
    for i, (k, v) in enumerate(node):
    if i > 0: f.write(',')
    print_json(f, k)
    f.write(": ")
    print_json(f, v)
    f.write('}')

    root = json.parse(sys.stdin)
    print_json(sys.stdout, root)

    If you think that some of the individual strings may be large, you could
    make the String class implement the iterator interface (or even the file
    interface with .read() etc), to allow the data to be read incrementally:

    elif isinstance(node, json.String):
    f.write('"')
    for s in node:
    f.write(s)
    f.write('"')

    The main point is to allow the node object to be returned as soon as the
    type is known, without having to wait until the data has been fully
    parsed, and to require a separate step (for which there may be various
    choices) in order to actually retrieve the data.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedDec 4, '09 at 9:51p
activeDec 5, '09 at 6:28p
posts2
users2
websitepython.org

2 users in discussion

Tyler: 1 post Nobody: 1 post

People

Translate

site design / logo © 2022 Grokbase