FAQ
Hello,

I'm pleased to announce pyxser-1.2r, a Python-Object to XML
serializer and deserializer. This module package it's completely
written in C and licensed under LGPLv3.

The tested Python versions are 2.5.X and 2.7.X.

* home page:
? http://coder.cl/software/pyxser

* hosted at:
? http://sourceforge.net/projects/pyxser/

* pypi entry:
? http://pypi.python.org/pypi?:action=display&name=pyxser&version=1.2r

The current ChangeLog is as follows:

- -----8<----------8<----------8<----------8<-----
1.2r (2009.08.23):

? ? ? ? Daniel Molina Wegener <dmw at coder.cl>
? ? ? ? * Added encoded serialization of Unicode strings by using
? ? ? ? the user defined encoding as is passed to the serialization
? ? ? ? functions as enc parameter
? ? ? ? * Refactored some functions to allow more ordered code.
- -----8<----------8<----------8<----------8<-----

As you see, now Unicode strings are serialized as encoded byte string
by using the encoding that the user pass as enc parameter to the
serialization function. This means that Unicode strings are serialized
in a human readable form, regarding a better interoperability with
other platforms.

Best regards...
- --
?.O. | Daniel Molina Wegener ? | FreeBSD & Linux
?..O | dmw [at] coder [dot] cl | Open Standards
?OOO | http://coder.cl/ ? ? ? ?| FOSS Developer

Search Discussions

  • Stefan Behnel at Aug 24, 2009 at 7:16 am

    Daniel Molina Wegener wrote:
    * Added encoded serialization of Unicode strings by using
    the user defined encoding as is passed to the serialization
    functions as enc parameter

    As you see, now Unicode strings are serialized as encoded byte string
    by using the encoding that the user pass as enc parameter to the
    serialization function. This means that Unicode strings are serialized
    in a human readable form, regarding a better interoperability with
    other platforms.
    You mean, the whole XML document is serialised with that encoding, right?

    Stefan
  • Daniel Molina Wegener at Aug 24, 2009 at 12:29 pm
    Stefan Behnel <stefan_ml at behnel.de>
    on Monday 24 August 2009 03:16
    wrote in comp.lang.python:

    Daniel Molina Wegener wrote:
    * Added encoded serialization of Unicode strings by using
    the user defined encoding as is passed to the serialization
    functions as enc parameter

    As you see, now Unicode strings are serialized as encoded byte string
    by using the encoding that the user pass as enc parameter to the
    serialization function. This means that Unicode strings are serialized
    in a human readable form, regarding a better interoperability with
    other platforms.
    You mean, the whole XML document is serialised with that encoding, right?
    Well, if you call pyxser with:

    xmldocstr = pyxser.serialize(obj = someobj, enc = "utf-8", depth = 3)

    pyxser will serialize someobj into XML document (xmldocstr) using the
    utf-8 encoding with a depth of three levels in the object tree. If the
    object tree has unicode objects, those objects will be encoded into
    utf-8 too. And yes, it means that unicode objects are encoded into the
    encoding that the XML document encoding has, and as you say, the whole
    XML document has one encoding. There is no mixing of byte encoded strings
    with different encodings in the outout document.

    When the object is restored, by using pyxser.unserialize:

    pyobj = pyxser.unserialize(obj = xmldocstr, enc = "utf-8")

    Unicode objects are restored as unicode objects, by decoding the utf-8
    strings on the document. Byte string objects are restored as byte string
    objects. And the other types of objects are restored as their original
    type.

    Another issue is the fact that if you have mixed some encodings in byte
    strings objects in your object tree, such as iso-8859-1 and utf-8, and
    you try to serialize that object, pyxser will output to stdout the
    serialization errors by trying to handle those mixed encodings which are
    not regarding the document encoding.

    There are some restrictions, if the class of the object is declared on
    the __main__ module, pyxser can not handle it... This is a bug but I'm
    expecting to solve it in near future.

    A depth level of zero (depth = 0) will serialize the complete object
    tree with a fixed limit of 50 levels, pyxser can handle cross referenced
    objects and circular references.

    Sorry if don't answer more questions until we reach the night --- here
    in Chile --- the night, I must go to my job right now, and there I don't
    have access to usenet or google groups.
    Stefan
    Best regards,
    - --
    .O. | Daniel Molina Wegener | FreeBSD & Linux
    ..O | dmw [at] coder [dot] cl | Open Standards
    OOO | http://coder.cl/ | FOSS Developer
  • Stefan Behnel at Aug 24, 2009 at 1:00 pm

    Daniel Molina Wegener wrote:
    unicode objects are encoded into the
    encoding that the XML document encoding has, and as you say, the whole
    XML document has one encoding. There is no mixing of byte encoded strings
    with different encodings in the outout document.
    Ok, that's what I hoped anyway. It just wasn't clear from your description.

    When the object is restored, by using pyxser.unserialize:

    pyobj = pyxser.unserialize(obj = xmldocstr, enc = "utf-8")
    But this is XML, right? What do you need to pass the encoding for at this
    point?

    Another issue is the fact that if you have mixed some encodings in byte
    strings objects in your object tree, such as iso-8859-1 and utf-8, and
    you try to serialize that object, pyxser will output to stdout the
    serialization errors by trying to handle those mixed encodings which are
    not regarding the document encoding.
    There shouldn't be any serialisation errors (unless you try to recode byte
    strings on the way out, which is a no-no for arbitrary user input). All you
    have to do is properly escape the byte string so that it passes the XML
    encoding step.

    One trick to do that is to decode the byte string as ISO-8859-1 and
    serialise the result as a normal Unicode string. Then you can re-encode the
    unicode string on input back to ISO-8859-1.

    I choose ISO-8859-1 here because it has the well-defined side-effect of
    mapping byte values directly to Unicode characters with an identical code
    point value. So you do not risk any failures or data loss.

    Stefan
  • Daniel Molina Wegener at Aug 25, 2009 at 4:03 am
    Stefan Behnel <stefan_ml at behnel.de>
    on Monday 24 August 2009 09:00
    wrote in comp.lang.python:

    Daniel Molina Wegener wrote:
    unicode objects are encoded into the
    encoding that the XML document encoding has, and as you say, the whole
    XML document has one encoding. There is no mixing of byte encoded strings
    with different encodings in the outout document.
    Ok, that's what I hoped anyway. It just wasn't clear from your
    description.

    When the object is restored, by using pyxser.unserialize:

    pyobj = pyxser.unserialize(obj = xmldocstr, enc = "utf-8")
    But this is XML, right? What do you need to pass the encoding for at this
    point?
    The user may want a different encoding, other than utf-8, it can
    be any encoding supported by libxml2.
    Another issue is the fact that if you have mixed some encodings in byte
    strings objects in your object tree, such as iso-8859-1 and utf-8, and
    you try to serialize that object, pyxser will output to stdout the
    serialization errors by trying to handle those mixed encodings which are
    not regarding the document encoding.
    There shouldn't be any serialisation errors (unless you try to recode byte
    strings on the way out, which is a no-no for arbitrary user input). All
    you have to do is properly escape the byte string so that it passes the
    XML encoding step.
    Yup, but if the encodings are mixed inside Python byte strings, I think
    that there is no way to know which encoding are using them. This may cause
    XML serialization errors, by having a different encoding that the user
    have set as the document encoding.
    One trick to do that is to decode the byte string as ISO-8859-1 and
    serialise the result as a normal Unicode string. Then you can re-encode
    the unicode string on input back to ISO-8859-1.

    I choose ISO-8859-1 here because it has the well-defined side-effect of
    mapping byte values directly to Unicode characters with an identical code
    point value. So you do not risk any failures or data loss.
    Sure, but if there are Python byte strings (not Unicode strings), ones
    encoded in big5 and others in iso-8859-1 inside the object tree, the
    XML serialization would throw errors on the encoding conversion, by
    setting those bytes inside the document...
    Stefan
    Thanks for commenting, and sorry for the late answer. This day was
    busy...

    Best regards,
    - --
    .O. | Daniel Molina Wegener | FreeBSD & Linux
    ..O | dmw [at] coder [dot] cl | Open Standards
    OOO | http://coder.cl/ | FOSS Developer
  • Stefan Behnel at Aug 25, 2009 at 5:11 am

    Daniel Molina Wegener wrote:
    Stefan Behnel wrote:
    Daniel Molina Wegener wrote:
    When the object is restored, by using pyxser.unserialize:

    pyobj = pyxser.unserialize(obj = xmldocstr, enc = "utf-8")
    But this is XML, right? What do you need to pass the encoding for at this
    point?
    The user may want a different encoding, other than utf-8, it can
    be any encoding supported by libxml2.
    I really meant what I wrote: this is XML. The encoding is well defined in
    the XML declaration at the start of the document (and will default to UTF-8
    if not provided). Passing it externally will allow users to override that,
    which doesn't make any sense at all.

    if the encodings are mixed inside Python byte strings, I think
    that there is no way to know which encoding are using them. Correct.
    This may cause XML serialization errors
    Yes, but only if you try to recode the strings (which, as I said, is a no-no).

    One trick to do that is to decode the byte string as ISO-8859-1 and
    serialise the result as a normal Unicode string. Then you can re-encode
    the unicode string on input back to ISO-8859-1.
    I choose ISO-8859-1 here because it has the well-defined side-effect of
    mapping byte values directly to Unicode characters with an identical code
    point value. So you do not risk any failures or data loss.
    Sure, but if there are Python byte strings (not Unicode strings), ones
    encoded in big5 and others in iso-8859-1 inside the object tree, the
    XML serialization would throw errors on the encoding conversion, by
    setting those bytes inside the document...
    No, I really meant: decoding from ISO-8859-1 to Unicode, for all byte
    strings, regardless of their encoding (since you can't even know if they
    represent encoded text at all). So you get a unicode string that you can
    serialise to the target encoding, although it may result in character
    references (&#xyz;) being output. But you won't get any errors, at least.

    On the way in, you get a unicode string again, which you can encode to
    ISO-8859-1 to get the original byte string back.

    Stefan
  • Stefan Behnel at Aug 25, 2009 at 5:23 am

    Stefan Behnel wrote:
    for all byte
    strings, regardless of their encoding (since you can't even know if they
    represent encoded text at all).
    Hmm, having written that, I guess it's actually best to encode byte strings
    as base64 instead. Otherwise, null bytes and other special byte values
    won't pass.

    I also think that if the user wants readable output for text strings, it's
    reasonable to require Unicode input instead of byte strings. Handling text
    in byte strings is just too error prone.

    Still, you may have to sanitize text input to make sure it doesn't contain
    special characters either. Take a look at the way lxml does it in the
    apihelpers.pxi source file, or read the XML spec on character content.

    Stefan
  • Daniel Molina Wegener at Aug 25, 2009 at 12:08 pm
    Stefan Behnel <stefan_ml at behnel.de>
    on Tuesday 25 August 2009 01:23
    wrote in comp.lang.python:

    Stefan Behnel wrote:
    for all byte
    strings, regardless of their encoding (since you can't even know if they
    represent encoded text at all).
    Hmm, having written that, I guess it's actually best to encode byte
    strings as base64 instead. Otherwise, null bytes and other special byte
    values won't pass.
    Sure, base64 is a good option for byte string input.
    I also think that if the user wants readable output for text strings, it's
    reasonable to require Unicode input instead of byte strings. Handling text
    in byte strings is just too error prone.

    Still, you may have to sanitize text input to make sure it doesn't contain
    special characters either. Take a look at the way lxml does it in the
    apihelpers.pxi source file, or read the XML spec on character content.
    Thanks, I will look for that. I must to do a better implementation on
    handling byte strings, since would be many cases on where encoded strings
    are mixed. For example different database inputs with different
    encodings --- if those byte strings are not readed as Unicode strings.

    Both sanitizing and base64 encoding are good options, and also, both are
    readable from other platforms. The problem with later implementations of
    pyxser was that it was using *RawUnicodeEscape* which is not readable from
    other platforms.
    Stefan
    Best regards,
    - --
    .O. | Daniel Molina Wegener | FreeBSD & Linux
    ..O | dmw [at] coder [dot] cl | Open Standards
    OOO | http://coder.cl/ | FOSS Developer

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedAug 23, '09 at 4:35p
activeAug 25, '09 at 12:08p
posts8
users2
websitepython.org

People

Translate

site design / logo © 2022 Grokbase