From: "Peter Karman" <peter@peknet.com>
no. you must set ebit in new(), not after instantiation. I've added a
note to the docs to emphasize that.
my $tr = Search::Tools::Transliterate->new( ebit => 0 );
Thanks. This way it works fine.
The latest 4 chars are 4 new UTF-8 chars in romanian language (U+0218,
U+0219, U+021A, U+021B). Can they be transliterated?
They are ???? but with a comma below, and not with a sedila. Can they be
displayed as sStT?
sure. Just add them via the map() method. I believe that's documented
with an example, but here's another:
use strict;
use Search::Tools::Transliterate;
use utf8;
binmode STDERR, ':utf8';
my $string = "??????????";
# new romanian utf8 chars
$string .= "\x{0218}";
$string .= "\x{0219}";
$string .= "\x{021A}";
$string .= "\x{021B}";
my $tr = Search::Tools::Transliterate->new(ebit=>0);
$tr->map->{"\x{0218}"} = 's';
$tr->map->{"\x{0219}"} = 'S';
$tr->map->{"\x{021A}"} = 't';
$tr->map->{"\x{021B}"} = 'T';
print STDERR $tr->convert($string) . "\n";
I added the above code as part of a new test and just uploaded 0.19 to
cpan.
If you have suggestions for permanent additions/changes to the character
mapping file, please open a RT ticket and I'll see that they get
reviewed for a future release.
Thanks for the feedback.
Just as a feedback, here is a short comparison I've made between these 2
modules:
Text::Unidecode is 5 or 6 times faster than S::T::T.
I haven't tested what S::T::T does internally, but Text::Unidecode uses many
other perl modules which are loaded dynamicly, and the current ActiveState
PDK can't load them automaticly, so it is harder to use Text::Unidecode.
Because it is able to use the map hash, S::T::T is more flexible than
Text::Unidecode.
I found that Text::Unidecode gives "Bei Jing" for the string
"\x{5317}\x{4EB0}\n" while S::T::T just gives 2 spaces.
And I've tried to transliterate those new 4 romanian chars using these 2
modules:
use Text::Unidecode;
print unidecode("\x{0218}\x{0219}\x{021A}\x{021B}");
#It printed: SsTt
use Search::Tools::Transliterate;
my $tr = Search::Tools::Transliterate->new(ebit => 0);
open(OUT, ">:utf8", "test.txt");
print OUT $tr->convert("\x{0218}\x{0219}\x{021A}\x{021B}");
It printed: ????
Well, without using the map hash, this doesn't print the "correct" string,
but it is interesting because it prints the corresponding characters which
are used now instead of those new characters with a comma instead of a
sedila below them.
HTH.
Octavian