[Struts-user] [OT] How to handle non UTF characters in XML

Joe Germuska
Apr 16, 2007 at 10:42 pm
See, the problem is that you're not handling the character encoding
correctly in general. You should use String's getBytes method only when you
know what you're doing, because the whole point of character encodings is
that you can represent any given string with different sequences of bytes.

I'd suggest doing more research on encoding in general: here's one popular
piece, although not Javacentric:
The Absolute Minimum Every Software Developer Absolutely, Positively Must
Know About Unicode and Character Sets (No
From there, you may want to review the APIs for java.io.Reader and
java.io.Writer, which are specifically designed to help smooth over the
issues involved in serializing Java strings to bytes.

This looks like it's going way too far off topic to be something that should
be discussed much further on the Struts list.


On 4/16/07, Ashish Kulkarni wrote:

Here is the code where i read the dom tree and then convert it to a
then convert this string into Byte array and then user
DocumentBuilder().parse to parse it.

I get error in factory.newDocumentBuilder().parse(byteArray);

TransformerFactory tFactory =
Transformer transformer = tFactory.newTransformer();
StringWriter writer = new StringWriter();
DOMSource source = new DOMSource(doc);
transformer.transform(source, new StreamResult(writer));
String obj = writer.toString();
ByteArrayInputStream byteArray = new ByteArrayInputStream(obj.getBytes());
Document doc = factory.newDocumentBuilder().parse(byteArray);

On 4/16/07, Joe Germuska wrote:
On 4/16/07, Christopher Schultz wrote:

Hash: SHA1


Ashish Kulkarni wrote:
I have java class which creates an XML file from SQL resultset,
It works fine in USA, but i am having issues when this process runs
Germany where they have non UTF characters in there database like ü

I think you mean non-lower-ASCII. This characters are certainly
by UTF-8.
How do we handle this kind of situation in XML file, i set the XML
be of UTF-8 type.
How do you set the file "type" to UTF-8?

I'm assuming Ashish is talking about the "encoding" attribute of the XML
declaration in the first line of the file.

Chris is correct that the real magic happens when you serialize the DOM to
file, but you should be sure to use the same encoding with the writer that
actually creates the file as you do in the XML declaration. If your
characters aren't UTF-8 then don't use UTF-8. Any decent XML reading
software will recognize the encoding when the file is read.


Joe Germuska
Joe@Germuska.com * http://blog.germuska.com

"The truth is that we learned from João forever to be out of tune."
-- Caetano Veloso

