FAQ
Hi,

I already set NLS_LANG=.UTF8 in .bash_profile.

The data I am reading from $file which is in UTF-8 format.
If I do not use encode() function then the data inserted in the database is
"inverted question mark" for any japanese character.


If I ran the code without encode() function on windows 2000 with Active Perl
5.8.4.810 installed then I am not having any problem. But the same code is
not running on sun solaris box. It is inserting "Inverted question mark in
Database"

Is DBD::Oracle uses any OS level functions ? My solaris version is 2.8


Thanks,
Prashant Shelar



From: "Susan Cassidy" <cassidy@systransoft.com>
To: "'prashant shelar'" <prashantshelar@hotmail.com>
Subject: RE: :Oracle unicode problem
Date: Tue, 2 Nov 2004 08:59:58 -0800

Are you setting the environment variable NLS_LANG to something with UTF8 in
it? E.g.: .UTF8

Also, if your data is already UTF8, you may be encoding it again and causing
a problem.

Susan Cassidy
-----Original Message-----
From: prashant shelar
Sent: Tuesday, November 02, 2004 7:38 AM
To: dbi-users@perl.org
Subject: DBD::Oracle unicode problem
>
>
Hi , >
I am having Oracle 9i release2 database with following details. All
database String columns are VARCHAR2 >
database_charset =AL32UTF8
national_charset =AL16UTF16
The database server and perl scripts are running on Sun Solaris 2.8
box.
Perl Version = 5.8.5
DBI version = 1.45
DBD::Oracle version = 1.16 (I downloaded from >
http://homepage.eircom.net/~timbunce/DBD-Oracle-1.16-rc7-20040826.tar.gz) >
I am inserting the Japanese characters using Perl script running on
same
DB box. When I retrive the same data using SQLPLUS spool command the many
characters are not matching. Only some characters are matching. >
For e.g. If I having given U+3044 character to insert in database. If
I
spool if I am getting the
character U+307F. >
Any Help is greatly Appretiated. >
The code snippet is as below. >
my $dbh = DBI->connect("dbi:Oracle:$sid", $uid, $pwd, {
RaiseError=>1,
AutoCommit=>1 }); >
open (INFILE, $file) ;
while (my $line = <INFILE>) { >
my @data = split(/\|/, $line);
my $sql = "INSERT INTO table (col1, col2) values
(encode("utf8",$data[1]), encod("utf8",$data[2]) ); >
$dbh->do($sql);
} >
$dbh->disconnect; >
_________________________________________________________________
The all-new MSN Mesenger! Get the coolest emoticons.
http://server1.msn.co.in/sp04/messengerchat/ Share more of yourself!
_________________________________________________________________
Send Money to India! Get a Mithai box.
http://creative.mediaturf.net/creatives/icicibank/TOL.htm Win a FREE holiday
in Goa.

Search Discussions

  • Christian Merz at Nov 4, 2004 at 7:20 am
    Hi,
    a few days ago I got a note dealing with character set problems (with German
    umlauts).
    (I didn't counter check the info)

    In a Windows environment you have to cope with at least 4 character sets.
    1. internally WinNT uses UCS-2 (2 bytes) Win2000 uses UTF16 (2 or 4 bytes)
    2. graphical apps may use something like ISO Latin 1
    3. a German DOS prompt uses we8pc850
    4. your HTML browser may use UTF8

    This also affects database accesses.
    You have to consider your NLS_LANG registry entry, for example
    AMERICAN_AMERICA.WE8ISO8859P1 or
    GERMAN_GERMANY.WE8ISO8859P1
    (see: v$nls_valid_values, v$nls_parameters, nls_XXXparameters, where XXX in
    (database,instance,session))
    If you use graphical apps (SQLPLUSW.exe, notepad, ...) you may get the correct
    output, according to your NLS_LANG

    On your DOS prompt you may have to overwrite the registry default:
    C:\> set NLS_LANG=american_america.we8pc850
    to process correctly your German umlauts in an SQL script.

    On Unix you also may have to set your NLS_LANG environment wariable (and maybe
    ORA_NLS...)

    cu,
    ---------------------------------------------------------
    Landeshauptstadt München
    Direktorium - AFID 3.3 - Oracle DBA
    C.A. Merz

    ----- Original Message -----
    From: "prashant shelar" <prashantshelar@hotmail.com>
    To: <cassidy@systransoft.com>; <dbi-users@perl.org>
    Sent: Wednesday, November 03, 2004 5:33 AM
    Subject: RE: :Oracle unicode problem

    Hi,

    I already set NLS_LANG=.UTF8 in .bash_profile.

    The data I am reading from $file which is in UTF-8 format.
    If I do not use encode() function then the data inserted in the database is
    "inverted question mark" for any japanese character.


    If I ran the code without encode() function on windows 2000 with Active Perl
    5.8.4.810 installed then I am not having any problem. But the same code is
    not running on sun solaris box. It is inserting "Inverted question mark in
    Database"

    Is DBD::Oracle uses any OS level functions ? My solaris version is 2.8


    Thanks,
    Prashant Shelar



    From: "Susan Cassidy" <cassidy@systransoft.com>
    To: "'prashant shelar'" <prashantshelar@hotmail.com>
    Subject: RE: :Oracle unicode problem
    Date: Tue, 2 Nov 2004 08:59:58 -0800

    Are you setting the environment variable NLS_LANG to something with UTF8 in
    it? E.g.: .UTF8

    Also, if your data is already UTF8, you may be encoding it again and causing
    a problem.

    Susan Cassidy
    -----Original Message-----
    From: prashant shelar
    Sent: Tuesday, November 02, 2004 7:38 AM
    To: dbi-users@perl.org
    Subject: DBD::Oracle unicode problem


    Hi ,

    I am having Oracle 9i release2 database with following details. All
    database String columns are VARCHAR2

    database_charset =AL32UTF8
    national_charset =AL16UTF16
    The database server and perl scripts are running on Sun Solaris 2.8
    box.
    Perl Version = 5.8.5
    DBI version = 1.45
    DBD::Oracle version = 1.16 (I downloaded from

    http://homepage.eircom.net/~timbunce/DBD-Oracle-1.16-rc7-20040826.tar.gz)

    I am inserting the Japanese characters using Perl script running on
    same
    DB box. When I retrive the same data using SQLPLUS spool command the many
    characters are not matching. Only some characters are matching.

    For e.g. If I having given U+3044 character to insert in database. If
    I
    spool if I am getting the
    character U+307F.

    Any Help is greatly Appretiated.

    The code snippet is as below.

    my $dbh = DBI->connect("dbi:Oracle:$sid", $uid, $pwd, {
    RaiseError=>1,
    AutoCommit=>1 });

    open (INFILE, $file) ;
    while (my $line = <INFILE>) {

    my @data = split(/\|/, $line);
    my $sql = "INSERT INTO table (col1, col2) values
    (encode("utf8",$data[1]), encod("utf8",$data[2]) );

    $dbh->do($sql);
    }

    $dbh->disconnect;

    _________________________________________________________________
    The all-new MSN Mesenger! Get the coolest emoticons.
    http://server1.msn.co.in/sp04/messengerchat/ Share more of yourself!
    _________________________________________________________________
    Send Money to India! Get a Mithai box.
    http://creative.mediaturf.net/creatives/icicibank/TOL.htm Win a FREE holiday
    in Goa.
  • Susan Cassidy at Nov 4, 2004 at 9:44 pm
    I finally got my large, complex cgi/Oracle application working with
    DBD::Oracle 1.16, using database character set AL32UTF8, NLS_LANG=.UTF8,
    etc.

    I was trying to write a small test program to have people use to validate
    whether their Oracle and Apache setups were correct, when I ran into a
    strange problem.

    Although I thought the program was doing the same basic functions as as the
    large program (this is all on the same system, with the same perl, etc.),
    when I inserted and retrieved the data from Oracle, it did not match the
    original UTF8 data (although the big program does this successfully).

    The test program takes some English sentences, runs them through a
    translator (which produces utf8 output, works fine - data validates as utf8
    on multiple systems, etc.). I save the output in memory (hash where key is
    the original text, value is the translated value). I then insert it into
    the database, and retrieve it. The retrieved data did not match the
    translated data.

    Keep in mind that the large application goes through a few more gyrations,
    but performs the same basic functions.

    The INSERT into the database is nothing fancy, just:
    $stmt="INSERT into test_trans_tbl (source_text, target_text) values (?, ?)";
    $sth=$dbh->prepare($stmt) ||
    errexit("bad prepare for stmt $stmt, error: $DBI::errstr");
    Then, inside a loop:
    $rc=$sth->execute($query,$textval) ||
    errexit("can't execute statement for source \"$query\"",
    " return code $rc: DB error: $DBI::errstr");


    I added some tests in the code to check on the translated value like:
    if (Encode::is_utf8($textval)) {
    print "<p>&nbsp;is utf8!\n";
    } else {
    print "<p>&nbsp;is NOT utf8\n";
    }
    This prints "is NOT utf8" (when I know that it really is utf8).

    If I do the same thing to the retrieved data, it prints that the data IS
    utf8.

    However, if I turn off the utf8 flag explicitly after retrieving the data,
    before comparing the translated data with the retrieved data, it works:

    $stmt="SELECT source_text, target_text from test_trans_tbl";
    print "<p>Running statement:\n\t$stmt\n";
    execute_db_statement($stmt, __LINE__);
    my %retrieved_text;
    while (@data = $sth->fetchrow_array) {
    foreach (@data) { $_='' unless defined}
    next if ($data[0] eq '');
    Encode::_utf8_off($data[1]); #This makes it work, but makes no logical
    sense
    $retrieved_text{$data[0]}=$data[1];
    }

    Of course, where I print out the status of utf8 below this, it now says it
    is NOT utf8.

    But, the data is now correct, and matches the data inserted/retrieved from
    PostgreSQL (where utf8 stuff has been working for quite a while).

    I have re-read the Encode perldoc stuff several times. It seems to be
    working (on my system) backwards, sort of?

    I the DBD::Oracle 1.16 docs, Tim says:
    If the string passed to bind_param() is considered by perl to be a
    valid utf8 string ( utf8::is_utf8($string) returns true ), then
    DBD::Oracle will implicitly set csform SQLCS_NCHAR and csid AL32UTF8
    for you on insert.
    So, I think this may have something to do with it. However, I am
    "unset"ting it after retrieval, not before inserting it. ????

    The only thing I can think is that for some weird reason, the utf8 flag is
    not in the state expected on my particular installation.

    If anyone else is having this type of problem, maybe this will give them a
    hint.

    If anyone has ideas about why this is happening, I'd love to hear them. I
    hope I'm not missing something obvious, but of course, that is possible!

    By the way, the same program moved over to a different machine where we use
    PostgreSQL (DBD::Pg) (without the _utf8_off, of course) works fine (as I
    would expect).

    The only unusual thing that I know of on the weird system is that the Perl
    was built for threads (perl, v5.8.5 built for i686-linux-thread-multi),
    which is not the case on the "good" system (v5.8.3 built for i686-linux). (I
    would not have built it that way, but the sysadmin did).

    Susan Cassidy
  • Tim Bunce at Nov 5, 2004 at 10:10 am

    On Thu, Nov 04, 2004 at 01:42:13PM -0800, Susan Cassidy wrote:
    I finally got my large, complex cgi/Oracle application working with
    DBD::Oracle 1.16, using database character set AL32UTF8, NLS_LANG=.UTF8,
    etc.
    And what are the _client_ CHAR and NCHAR character sets?
    And is the field you're inserting into a CHAR or NCHAR?
    The test program takes some English sentences, runs them through a
    translator (which produces utf8 output, works fine - data validates as utf8
    on multiple systems, etc.).
    It's important to keep in mind that "validates as utf8" is ambiguous.

    It could mean *either or both* of:

    a) the sequence of is a valid utf8 encoding.
    b) the perl scalar value has the perl SvUTF8 flag turned on.

    Much confusion is caused by not keeping those two separate points
    in mind. It's important to be clear what you're thinking about,
    and precise when communicating it to others.
    I then insert it into the database, and retrieve it. The retrieved
    data did not match the translated data.
    I'm afraid that "The retrieved data did not match the translated
    data" is another ambiguous statement.

    If a sequence of bytes that does not have the SvUTF8 flag turned
    on is compared with the same sequence of bytes that does, they won't
    match (unless the string is all ASCII).

    Perl will encode the sequence of bytes that does not have the SvUTF8
    flag turned on into UTF8 by treating each byte as a Latin1 character
    (by default). If the sequence of bytes was UTF8 encoded already
    (but not marked with the SvUTF8 flag) then treating each byte as a
    Latin1 character will produce garbage unless the string is all ASCII.

    So the two strings with the same sequence of bytes may not match!
    I added some tests in the code to check on the translated value like:
    if (Encode::is_utf8($textval)) {
    print "<p>&nbsp;is utf8!\n";
    } else {
    print "<p>&nbsp;is NOT utf8\n";
    }
    This prints "is NOT utf8" (when I know that it really is utf8).
    Do you know which out of A and B above Encode::is_utf8 actually tests for?
    Do you know which out of A and B you mean by "it really is utf8"?
    If I do the same thing to the retrieved data, it prints that the data IS
    utf8.
    The returned data will be both valid utf8 and have the SvUTF8 flag on
    if your relevant (CHAR/NCHAR) client character set is UTF8 or AL32UTF8.

    But that doesn't mean it contains the same string you passed in! :)
    So I trust you're also checking if $inserted_value eq $fetched_value.
    However, if I turn off the utf8 flag explicitly after retrieving the data,
    before comparing the translated data with the retrieved data, it works:
    Probably because you're now comparing byte strings as byte strings.
    Of course, where I print out the status of utf8 below this, it now says it
    is NOT utf8.
    Of course.
    I have re-read the Encode perldoc stuff several times. It seems to be
    working (on my system) backwards, sort of?

    I the DBD::Oracle 1.16 docs, Tim says:
    If the string passed to bind_param() is considered by perl to be a
    valid utf8 string ( utf8::is_utf8($string) returns true ), then
    DBD::Oracle will implicitly set csform SQLCS_NCHAR and csid AL32UTF8
    for you on insert.
    So, I think this may have something to do with it. However, I am
    "unset"ting it after retrieval, not before inserting it. ????
    But was it actually set on the value you inserted?

    [FYI, the output from trace() quotes strings with the SvUTF8 flag
    on with double quotes, and uses single quotes if SvUTF8 is off.
    That's a quick way to see what's going on.]
    By the way, the same program moved over to a different machine where we use
    PostgreSQL (DBD::Pg) (without the _utf8_off, of course) works fine (as I
    would expect).
    I suspect DBD::Pg is doing something wrong that just happens to
    work for your view of how it ought to work. Of course, I may be wrong.

    Tim.
  • Susan Cassidy at Nov 5, 2004 at 6:26 pm
    Hi,
    Thanks Tim.

    I'm not sure what you mean by " the _client_ CHAR and NCHAR character sets".
    How do I check? (Obviously, I did not install the Oracle stuff myself, and
    we do not have our own DBA).

    By "validates as utf8", I meant that it is valid utf8 encoding (a).

    By "did not match" I meant that " if $saved_data eq $retrieved_data "
    returns false.

    Thanks,
    Susan
    -----Original Message-----
    From: Tim Bunce
    Sent: Friday, November 05, 2004 2:10 AM
    To: Susan Cassidy
    Cc: dbi-users@perl.org
    Subject: Re: more DBD::Oracle utf8 weirdness, and kludge that should not
    have worked, but did
    On Thu, Nov 04, 2004 at 01:42:13PM -0800, Susan Cassidy wrote:
    I finally got my large, complex cgi/Oracle application working with
    DBD::Oracle 1.16, using database character set AL32UTF8, NLS_LANG=.UTF8,
    etc.
    And what are the _client_ CHAR and NCHAR character sets?
    And is the field you're inserting into a CHAR or NCHAR?
    The test program takes some English sentences, runs them through a
    translator (which produces utf8 output, works fine - data validates as utf8
    on multiple systems, etc.).
    It's important to keep in mind that "validates as utf8" is ambiguous.

    It could mean *either or both* of:

    a) the sequence of is a valid utf8 encoding.
    b) the perl scalar value has the perl SvUTF8 flag turned on.

    Much confusion is caused by not keeping those two separate points
    in mind. It's important to be clear what you're thinking about,
    and precise when communicating it to others.
    I then insert it into the database, and retrieve it. The retrieved
    data did not match the translated data.
    I'm afraid that "The retrieved data did not match the translated
    data" is another ambiguous statement.

    If a sequence of bytes that does not have the SvUTF8 flag turned
    on is compared with the same sequence of bytes that does, they won't
    match (unless the string is all ASCII).

    Perl will encode the sequence of bytes that does not have the SvUTF8
    flag turned on into UTF8 by treating each byte as a Latin1 character
    (by default). If the sequence of bytes was UTF8 encoded already
    (but not marked with the SvUTF8 flag) then treating each byte as a
    Latin1 character will produce garbage unless the string is all ASCII.

    So the two strings with the same sequence of bytes may not match!
    I added some tests in the code to check on the translated value like:
    if (Encode::is_utf8($textval)) {
    print "<p>&nbsp;is utf8!\n";
    } else {
    print "<p>&nbsp;is NOT utf8\n";
    }
    This prints "is NOT utf8" (when I know that it really is utf8).
    Do you know which out of A and B above Encode::is_utf8 actually tests for?
    Do you know which out of A and B you mean by "it really is utf8"?
    If I do the same thing to the retrieved data, it prints that the data IS
    utf8.
    The returned data will be both valid utf8 and have the SvUTF8 flag on
    if your relevant (CHAR/NCHAR) client character set is UTF8 or AL32UTF8.

    But that doesn't mean it contains the same string you passed in! :)
    So I trust you're also checking if $inserted_value eq $fetched_value.
    However, if I turn off the utf8 flag explicitly after retrieving the data,
    before comparing the translated data with the retrieved data, it works:
    Probably because you're now comparing byte strings as byte strings.
    Of course, where I print out the status of utf8 below this, it now says it
    is NOT utf8.
    Of course.
    I have re-read the Encode perldoc stuff several times. It seems to be
    working (on my system) backwards, sort of?

    I the DBD::Oracle 1.16 docs, Tim says:
    If the string passed to bind_param() is considered by perl to be a
    valid utf8 string ( utf8::is_utf8($string) returns true ), then
    DBD::Oracle will implicitly set csform SQLCS_NCHAR and csid AL32UTF8
    for you on insert.
    So, I think this may have something to do with it. However, I am
    "unset"ting it after retrieval, not before inserting it. ????
    But was it actually set on the value you inserted?

    [FYI, the output from trace() quotes strings with the SvUTF8 flag
    on with double quotes, and uses single quotes if SvUTF8 is off.
    That's a quick way to see what's going on.]
    By the way, the same program moved over to a different machine where we use
    PostgreSQL (DBD::Pg) (without the _utf8_off, of course) works fine (as I
    would expect).
    I suspect DBD::Pg is doing something wrong that just happens to
    work for your view of how it ought to work. Of course, I may be wrong.

    Tim.
  • Tim Bunce at Nov 5, 2004 at 11:45 pm

    On Fri, Nov 05, 2004 at 10:24:39AM -0800, Susan Cassidy wrote:
    Hi,
    Thanks Tim.

    I'm not sure what you mean by " the _client_ CHAR and NCHAR character sets".
    How do I check? (Obviously, I did not install the Oracle stuff myself, and
    we do not have our own DBA).
    What are your NLS_LANG and NLS_NCHAR environment variables set to?
    By "validates as utf8", I meant that it is valid utf8 encoding (a).

    By "did not match" I meant that " if $saved_data eq $retrieved_data "
    returns false.
    Okay. I've no time to re-examine your original email (got to
    sleep-n-pack for the MySQL conference in Frankfurt). From what I
    said in reply and what you've here you, or someone else, ought to be
    able to work out what's going on.

    A hint: it's important that your client NLS_LANG and NLS_NCHAR
    environment variables are set correctly, and that any UTF8 values
    your're using have the UTF8 flag set.

    Please reread the Unicode section of the DBD::Oracle docs.
    Let me know if there's anything that's not clear enough.

    Tim.
    Thanks,
    Susan
    -----Original Message-----
    From: Tim Bunce
    Sent: Friday, November 05, 2004 2:10 AM
    To: Susan Cassidy
    Cc: dbi-users@perl.org
    Subject: Re: more DBD::Oracle utf8 weirdness, and kludge that should not
    have worked, but did
    On Thu, Nov 04, 2004 at 01:42:13PM -0800, Susan Cassidy wrote:
    I finally got my large, complex cgi/Oracle application working with
    DBD::Oracle 1.16, using database character set AL32UTF8, NLS_LANG=.UTF8,
    etc.
    And what are the _client_ CHAR and NCHAR character sets?
    And is the field you're inserting into a CHAR or NCHAR?
    The test program takes some English sentences, runs them through a
    translator (which produces utf8 output, works fine - data validates as utf8
    on multiple systems, etc.).
    It's important to keep in mind that "validates as utf8" is ambiguous.

    It could mean *either or both* of:

    a) the sequence of is a valid utf8 encoding.
    b) the perl scalar value has the perl SvUTF8 flag turned on.

    Much confusion is caused by not keeping those two separate points
    in mind. It's important to be clear what you're thinking about,
    and precise when communicating it to others.
    I then insert it into the database, and retrieve it. The retrieved
    data did not match the translated data.
    I'm afraid that "The retrieved data did not match the translated
    data" is another ambiguous statement.

    If a sequence of bytes that does not have the SvUTF8 flag turned
    on is compared with the same sequence of bytes that does, they won't
    match (unless the string is all ASCII).

    Perl will encode the sequence of bytes that does not have the SvUTF8
    flag turned on into UTF8 by treating each byte as a Latin1 character
    (by default). If the sequence of bytes was UTF8 encoded already
    (but not marked with the SvUTF8 flag) then treating each byte as a
    Latin1 character will produce garbage unless the string is all ASCII.

    So the two strings with the same sequence of bytes may not match!
    I added some tests in the code to check on the translated value like:
    if (Encode::is_utf8($textval)) {
    print "<p>&nbsp;is utf8!\n";
    } else {
    print "<p>&nbsp;is NOT utf8\n";
    }
    This prints "is NOT utf8" (when I know that it really is utf8).
    Do you know which out of A and B above Encode::is_utf8 actually tests for?
    Do you know which out of A and B you mean by "it really is utf8"?
    If I do the same thing to the retrieved data, it prints that the data IS
    utf8.
    The returned data will be both valid utf8 and have the SvUTF8 flag on
    if your relevant (CHAR/NCHAR) client character set is UTF8 or AL32UTF8.

    But that doesn't mean it contains the same string you passed in! :)
    So I trust you're also checking if $inserted_value eq $fetched_value.
    However, if I turn off the utf8 flag explicitly after retrieving the data,
    before comparing the translated data with the retrieved data, it works:
    Probably because you're now comparing byte strings as byte strings.
    Of course, where I print out the status of utf8 below this, it now says it
    is NOT utf8.
    Of course.
    I have re-read the Encode perldoc stuff several times. It seems to be
    working (on my system) backwards, sort of?

    I the DBD::Oracle 1.16 docs, Tim says:
    If the string passed to bind_param() is considered by perl to be a
    valid utf8 string ( utf8::is_utf8($string) returns true ), then
    DBD::Oracle will implicitly set csform SQLCS_NCHAR and csid AL32UTF8
    for you on insert.
    So, I think this may have something to do with it. However, I am
    "unset"ting it after retrieval, not before inserting it. ????
    But was it actually set on the value you inserted?

    [FYI, the output from trace() quotes strings with the SvUTF8 flag
    on with double quotes, and uses single quotes if SvUTF8 is off.
    That's a quick way to see what's going on.]
    By the way, the same program moved over to a different machine where we use
    PostgreSQL (DBD::Pg) (without the _utf8_off, of course) works fine (as I
    would expect).
    I suspect DBD::Pg is doing something wrong that just happens to
    work for your view of how it ought to work. Of course, I may be wrong.

    Tim.
  • Tim Bunce at Nov 6, 2004 at 10:41 pm

    On Fri, Nov 05, 2004 at 10:09:52AM +0000, Tim Bunce wrote:
    On Thu, Nov 04, 2004 at 01:42:13PM -0800, Susan Cassidy wrote:
    I finally got my large, complex cgi/Oracle application working with
    DBD::Oracle 1.16, using database character set AL32UTF8, NLS_LANG=.UTF8,
    etc.
    And what are the _client_ CHAR and NCHAR character sets?
    And is the field you're inserting into a CHAR or NCHAR?
    The test program takes some English sentences, runs them through a
    translator (which produces utf8 output, works fine - data validates as utf8
    on multiple systems, etc.).
    It's important to keep in mind that "validates as utf8" is ambiguous.

    It could mean *either or both* of:

    a) the sequence of is a valid utf8 encoding.
    b) the perl scalar value has the perl SvUTF8 flag turned on.

    Much confusion is caused by not keeping those two separate points
    in mind. It's important to be clear what you're thinking about,
    and precise when communicating it to others.
    I really should be doing other things, but this was on my mind so
    I've cooked up a couple of utility functions that'll probably be
    in the next release of the DBI:

    sub data_diff {
    my ($a, $b, $strict) = @_;
    require utf8;

    return '' if !$strict and $a eq $b;

    # hacks to cater for perl 5.6 for data_str_diff() & data_desc()
    *utf8::is_utf8 = sub {
    return (DBI::neat(shift) =~ /^"/); # XXX ugly hack, sufficient here
    } unless defined &utf8::is_utf8;
    *utf8::valid = sub { 1 } unless defined &utf8::valid;

    my $diff = data_str_diff($a, $b);
    my $a_desc = data_desc($a);
    my $b_desc = data_desc($b);

    return "" if !$diff && $a_desc eq $b_desc;

    return "\$a: $a_desc\n\$b: $b_desc\n$diff";
    }

    sub data_str_diff {
    my ($a, $b) = @_;
    my @a_chars = (utf8::is_utf8($a)) ? unpack("U*", $a) : unpack("C*", $a);
    my @b_chars = (utf8::is_utf8($b)) ? unpack("U*", $b) : unpack("C*", $b);
    my $i = 0;
    while (@a_chars && @b_chars) {
    ++$i, shift(@a_chars), shift(@b_chars), next
    if $a_chars[0] eq $b_chars[0];
    my @desc = map {
    $_ > 255 ? # if wide character...
    sprintf("\\x{%04X}", $_) : # \x{...}
    chr($_) =~ /[[:cntrl:]]/ ? # else if control character ...
    sprintf("\\x%02X", $_) : # \x..
    chr($_) # else as themselves
    } ($a_chars[0], $b_chars[0]);
    foreach my $c ( @desc ) {
    next unless $c =~ m/\\x\{08(..)}/;
    $c .= "='" .chr(hex($1)) ."'"
    }
    return sprintf "Strings differ at index $i: a[$i]=$desc[0], b[$i]=$desc[1]\n";
    }
    return "";
    }

    sub data_desc { # describe a data string
    my ($a) = @_;
    require utf8;
    require bytes;
    # Give sufficient info to help diagnose at least these kinds of situations:
    # - valid UTF8 byte sequence but UTF8 flag not set
    # (might be ascii so also need to check for hibit to make it worthwhile)
    # - UTF8 flag set but invalid UTF8 byte sequence
    # could do better here, but this'll do for now
    my $is_ascii = $a =~ m/^[\000-\177]*$/;
    return sprintf "UTF8 %s, %s, %d bytes %d chars%s",
    utf8::is_utf8($a) ? "on" : "off",
    $is_ascii ? "ASCII" : "Non-ASCII",
    bytes::length($a), length($a),
    utf8::valid($a) ? "" : ", INVALID";
    }


    Basically, if you've got two strings you expect to be equal, you can
    call data_diff($a, $b) and get back a description of how they differ.
    So:
    print data_diff("abc\x{263a}e", "abcd");

    will say

    $a: UTF8 on, Non-ASCII, 7 bytes 5 chars
    $b: UTF8 off, ASCII, 4 bytes 4 chars
    Strings differ at index 3: a[3]=\x{263A}, b[3]=d

    I think this'll make life much easier for those poor souls trying
    to deal with unicode issues, and the other poor souls trying to
    help them.

    I've only hacked this together just now. I'd appreciate it if people
    could give it some testing. (Major kudos to anyone who sends me
    documentation and/or a test suite for them! :)

    Enjoy!

    Tim.

    p.s. They'll "work" on perl 5.6.1 but utf8::valid only exists in perl 5.8.x
    (but if you're using Unicode you *really* should be using perl 5.8.x).

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdbi-users @
categoriesperl
postedNov 3, '04 at 4:34a
activeNov 6, '04 at 10:41p
posts7
users4
websitedbi.perl.org

People

Translate

site design / logo © 2022 Grokbase