I can see why much of this functionality belongs in the text package, but
the boundaries are still a little fuzzy. It's hard to tell what belongs
where. Perhaps the unicode package should just be for decoding/encoding
unicode text, and any tables for classifying code points belongs in text.
The data for wordbreak properties (
http://www.unicode.org/Public/6.2.0/ucd/auxiliary/WordBreakProperty.txt)are in the same format as those for Scripts and Properties, so I modified
unicode/maketables.go to include the wordbreak tables in addition to the
other ones that it already makes. This caused the resulting output file
(tables.go) to change in size from 158 kb to 178 kb (an increase of 12.7%).
So it's not that much bigger, but maybe all (or some) of them belong in
another package anyway.
Regarding iterating over all the tables - this was a concern I had after I
wrote my initial email as well. I think that we will probably have to
generate some sort of different data structure better capable of
determining which of a group of categories a rune belongs to, without
having to try 'Is(*TableRange, rune) on each one. I'll have to think about
this, but I think there is a better solution. In any case, I could probably
add the table data tonight or tomorrow if we decided to keep it in the
current TableRange format. Otherwise, we'll need to decide how and where
(and if) to store that table data.
The title case issue is a little more tricky. The main reason I care is
because I was trying to add that functionality to a text editor app being
written in Go (https://github.com/limetext/lime/) Many text editors have
this kind of functionality where you can highlight text and cycle through
different casing options (including title case.) I tried out the sample in
MS word and Notepad++ (with TextFX plugin). Word changed conan o'brien to
Conan O'brien and shouldn't to Shouldn't (the way Unicode word breaking
rules would) and Notepad++ changed them to Conan O'Brien and Shouldn'T (the
way Go's Title function would.) I would favor Word's implementation,
because it follows the Unicode standard (even if it didn't catch all the
exceptional cases.) In my case, I might just write my own title case
function for the text editor. Honestly, unless I were to add a full blown
word-break to this project, I would probably just do the same thing and
just exclude the apostrophe from the separator category (the way underscore
is excluded in Go's implementation.)
Finally, I've come to realize my IsBreak function would be quite
inadequate. After reading up a bit more on this, I realize that there is
more to determining word break than an intersection of two characters.
Sometimes a third character comes in to play (a second character on either
side of the proposed break.) The proper implementation is certainly not
trivial, but it's not out of the realm of possibility. I like the proposed
segmentation interface, since it allows for other segmenters based on
locale or preference.
On Thu, Nov 21, 2013 at 5:24 AM, Marcel van Lohuizen wrote:That said, if we decide to fix Title, something along these lines would be
the right approach:
- add the break data to go.text/unicode/break.
- write a table generator to distill only the data relevant to Title (I
expect this to be considerably smaller.)
- add this to core. (Where depends on the size, I would say)
On Thursday, November 21, 2013, Marcel van Lohuizen wrote:Hi Andrew,
Something along this lines has been proposed in
https://docs.google.com/a/golang.org/document/d/1Q64ktYh7XptpEI3L2G7xYqsusohOeLUft865Zd7fbGU/viewto be part of the go.text repository.
A few comments:
- I don't think it is that trivial to use the existing unicode
tables. The Unicode breaking algorithm requires you to look up the breaking
category of two runes and then compare. The existing tables are more for
determining whether a rune is in a certain set. To use these tables for the
former would require iterating over all sets, which is rather inefficient.
- The Unicode word break algorithms only give a decent default. But
actual breaking behavior may differ from language to language. For example,
word breaking in Chinese is a notoriously hard problem.
- Aside from that, title case behavior is different per language (and
work breaking is not always the same as breaking to do proper title case).
- It is not always obvious what to do, even for English: compare
don't -> Don't with conan o'brien -> Conan O'Brien.
Obviously, putting this in go.text won't help fix the broken Title in the
core libraries. I'm not sure if it is worth fixing it. The tables don't
belong in core, imho. It may be an option to put a simple solution in core
just to solve Tile, but I'm not sure it is worth it (plus you're just
moving problems, see above example). Quite frankly, I think *.Title should
not have been in core in the first place. (Python made the same mistake.)
That being said, it will be useful to have the basic tables in
go.text/unicode/break, for example, so that other packages in go.text can
build on it. This should probably have the form of a rune (or rather UTF-8,
as most of go.text packages do) to break property function, or something
like that.
Marcel
On Thu, Nov 21, 2013 at 12:11 AM, Andrew Brown wrote:Currently go's unicode package does not have the ability to properly
identify word breaks according to the Unicode specification. I first
noticed this, because the strings.Title function (which ought to turn the
first rune of every word to title case) actually determines word boundaries
based on the 'Separator' class of character. In other words, it thinks
that "shouldn't" is two words, 'shouldn' and 't' and capitalizes it like
so: "Shouldn'T"
https://code.google.com/p/go/issues/detail?id=6801There are lots of potential uses for properly determining Unicode word
breaks, and I would like to propose a solution. I would like to extend
unicode/maketables.go (
https://code.google.com/p/go/source/browse/src/pkg/unicode/maketables.go)to produce table data for word break classifications in the same way it
produces tables for scripts, properties, etc... Most of the functionality
is already there, and the source data for word breaks is already in the
same format as that of other tables currently being parsed:
http://www.unicode.org/Public/6.2.0/ucd/auxiliary/WordBreakProperty.txtcompare to:
http://www.unicode.org/Public/6.2.0/ucd/Scripts.txtThen I will add a new file to the unicode package, wordbreak.go, which
will contain one function (to start)
// IsWordBoundary reports whether there is a word boundary
// between a and b
// See
http://www.unicode.org/reports/tr29/#Word_Boundariesfunc IsWordBoundary(a rune, b rune) bool
This function will identify which word boundary class the runes belong
to (according to the new tables added in maketables) and then look up in a
rule table whether the boundary between them is a word boundary. See
http://unicode.org/Public/UNIDATA/auxiliary/WordBreakTest.html#r999.0 for
a visual representation.
Later, perhaps, more functionality could be added. For instance, the
strings package might want to separate a string into a slice of individual
word strings based on the unicode spec. In either case, my proposed
changes represent a solid step forward from what is there now, and they
allow me to fix the title case bug that caused me to look in to this in the
first place.
Any thoughts?
--
You received this message because you are subscribed to the Google
Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to golang-nuts+unsubscribe@googlegroups.com.
For more options, visit
https://groups.google.com/groups/opt_out. --
Trying this for a while:
http://go/OnlyCheckEmailTwiceADay.Marcel van Lohuizen -- Google Switzerland GmbH -- Identifikationsnummer:
CH-020.4.028.116-1
--
Trying this for a while:
http://go/OnlyCheckEmailTwiceADay.Marcel van Lohuizen -- Google Switzerland GmbH -- Identifikationsnummer:
CH-020.4.028.116-1
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
For more options, visit
https://groups.google.com/groups/opt_out.