On Jan 23, 2015, at 12:57 AM, demerphq wrote:

Objects are not desirable. Character strings should be a true Perl type, like arrays and hashes.
But you want to turn strings into a objects here. Essentially you want
to "decorate" strings with information about their encoding, and then
provide a set of rules for how those strings can be used. Since it is
perfectly possible to prototype this as a CPAN module, then I think
p5p precedent says that you should do so first.
It would be awesome to have all the functionality of, say, Swift strings, yes (though they’re values, not objects). However, that’s probably not practical -- certainly in the short term. I think a decent 80% solution today would be:

1. A flag indicating that the bytes in the string do, in fact, conform to a Unicode string (as understood by the utf8-flag-checking operators).
2. A way to tell when that flag is set from Perl and from XS.

So which one is the "text" that Perl will recognize?
The one that utf8-checking operations currently expect.
IMO in any sane scenario for dealing with this there would be no
"text" type. There would merely be types with valid interpretations,
possibly more than one, such as certain strings being valid latin-1,
utf-8, and ASCII data all at the same time.
IIUC, Perl already has an expectation of what bytes should be when the utf8 flag is on. So no need for all that other stuff.
But the easy answer is "you can't use regexes on binary data".
It’s all binary, as you point out above. Even text. How do you know if the bytes are text or not, and therefore whether or not it’s appropriate to use a regex on it? The value I’m using should itself know, and be able to tell me.
If we had a way of marking a string as "binary", which implies that it
cannot be interpreted as a given codepoint, nor translated to a
different encoding, then we probably would want to say that it is
meaningless to match a high-code-point unicode character against such
data. One would presumably be forced to decode the unicode pattern
into a specific encoding and then decode that into binary, and then
match that against the binary string.
Not really following, but it kind of sounds like what Encode does already.
Anyway, I think this whole debate is muddled by sloppy thinking about
what "text" is. And I think almost nobody would like the world where
Perl was excessively strict about how text and binary can be mixed.
Mostly there is no problem here. Occasionally people get bitten, but I
feel that is a small minority of the amount of things people do in
Maybe terms have been used loosely, but the issue is quite easy to define:

*There is no way to know, at compile time or runtime, whether or not any given string value has been decoded into the Perl Unicode string format.*

That's it.



Search Discussions

Discussion Posts


Follow ups

Related Discussions



site design / logo © 2019 Grokbase