FAQ
Currently go's unicode package does not have the ability to properly
identify word breaks according to the Unicode specification. I first
noticed this, because the strings.Title function (which ought to turn the
first rune of every word to title case) actually determines word boundaries
based on the 'Separator' class of character. In other words, it thinks
that "shouldn't" is two words, 'shouldn' and 't' and capitalizes it like
so: "Shouldn'T"

https://code.google.com/p/go/issues/detail?id=6801

There are lots of potential uses for properly determining Unicode word
breaks, and I would like to propose a solution. I would like to extend
unicode/maketables.go (
https://code.google.com/p/go/source/browse/src/pkg/unicode/maketables.go)
to produce table data for word break classifications in the same way it
produces tables for scripts, properties, etc... Most of the functionality
is already there, and the source data for word breaks is already in the
same format as that of other tables currently being parsed:

http://www.unicode.org/Public/6.2.0/ucd/auxiliary/WordBreakProperty.txt
compare to:
http://www.unicode.org/Public/6.2.0/ucd/Scripts.txt

Then I will add a new file to the unicode package, wordbreak.go, which will
contain one function (to start)

// IsWordBoundary reports whether there is a word boundary
// between a and b
// See http://www.unicode.org/reports/tr29/#Word_Boundaries
func IsWordBoundary(a rune, b rune) bool


This function will identify which word boundary class the runes belong to
(according to the new tables added in maketables) and then look up in a
rule table whether the boundary between them is a word boundary. See
http://unicode.org/Public/UNIDATA/auxiliary/WordBreakTest.html#r999.0 for a
  visual representation.

Later, perhaps, more functionality could be added. For instance, the
strings package might want to separate a string into a slice of individual
word strings based on the unicode spec. In either case, my proposed
changes represent a solid step forward from what is there now, and they
allow me to fix the title case bug that caused me to look in to this in the
first place.

Any thoughts?

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Search Discussions

  • Marcel van Lohuizen at Nov 21, 2013 at 11:58 am
    Hi Andrew,

    Something along this lines has been proposed in
    https://docs.google.com/a/golang.org/document/d/1Q64ktYh7XptpEI3L2G7xYqsusohOeLUft865Zd7fbGU/viewto
    be part of the go.text repository.

    A few comments:

        - I don't think it is that trivial to use the existing unicode tables.
        The Unicode breaking algorithm requires you to look up the breaking
        category of two runes and then compare. The existing tables are more for
        determining whether a rune is in a certain set. To use these tables for the
        former would require iterating over all sets, which is rather inefficient.
        - The Unicode word break algorithms only give a decent default. But
        actual breaking behavior may differ from language to language. For example,
        word breaking in Chinese is a notoriously hard problem.
        - Aside from that, title case behavior is different per language (and
        work breaking is not always the same as breaking to do proper title case).
        - It is not always obvious what to do, even for English: compare don't
        -> Don't with conan o'brien -> Conan O'Brien.

    Obviously, putting this in go.text won't help fix the broken Title in the
    core libraries. I'm not sure if it is worth fixing it. The tables don't
    belong in core, imho. It may be an option to put a simple solution in core
    just to solve Tile, but I'm not sure it is worth it (plus you're just
    moving problems, see above example). Quite frankly, I think *.Title should
    not have been in core in the first place. (Python made the same mistake.)

    That being said, it will be useful to have the basic tables in
    go.text/unicode/break, for example, so that other packages in go.text can
    build on it. This should probably have the form of a rune (or rather UTF-8,
    as most of go.text packages do) to break property function, or something
    like that.

    Marcel

    On Thu, Nov 21, 2013 at 12:11 AM, Andrew Brown wrote:

    Currently go's unicode package does not have the ability to properly
    identify word breaks according to the Unicode specification. I first
    noticed this, because the strings.Title function (which ought to turn the
    first rune of every word to title case) actually determines word boundaries
    based on the 'Separator' class of character. In other words, it thinks
    that "shouldn't" is two words, 'shouldn' and 't' and capitalizes it like
    so: "Shouldn'T"

    https://code.google.com/p/go/issues/detail?id=6801

    There are lots of potential uses for properly determining Unicode word
    breaks, and I would like to propose a solution. I would like to extend
    unicode/maketables.go (
    https://code.google.com/p/go/source/browse/src/pkg/unicode/maketables.go)
    to produce table data for word break classifications in the same way it
    produces tables for scripts, properties, etc... Most of the functionality
    is already there, and the source data for word breaks is already in the
    same format as that of other tables currently being parsed:

    http://www.unicode.org/Public/6.2.0/ucd/auxiliary/WordBreakProperty.txt
    compare to:
    http://www.unicode.org/Public/6.2.0/ucd/Scripts.txt

    Then I will add a new file to the unicode package, wordbreak.go, which
    will contain one function (to start)

    // IsWordBoundary reports whether there is a word boundary
    // between a and b
    // See http://www.unicode.org/reports/tr29/#Word_Boundaries
    func IsWordBoundary(a rune, b rune) bool


    This function will identify which word boundary class the runes belong to
    (according to the new tables added in maketables) and then look up in a
    rule table whether the boundary between them is a word boundary. See
    http://unicode.org/Public/UNIDATA/auxiliary/WordBreakTest.html#r999.0 for
    a visual representation.

    Later, perhaps, more functionality could be added. For instance, the
    strings package might want to separate a string into a slice of individual
    word strings based on the unicode spec. In either case, my proposed
    changes represent a solid step forward from what is there now, and they
    allow me to fix the title case bug that caused me to look in to this in the
    first place.

    Any thoughts?

    --
    You received this message because you are subscribed to the Google Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.


    --
    Trying this for a while: http://go/OnlyCheckEmailTwiceADay.
    Marcel van Lohuizen -- Google Switzerland GmbH -- Identifikationsnummer:
    CH-020.4.028.116-1

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Marcel van Lohuizen at Nov 21, 2013 at 12:24 pm
    That said, if we decide to fix Title, something along these lines would be
    the right approach:
    - add the break data to go.text/unicode/break.
    - write a table generator to distill only the data relevant to Title (I
    expect this to be considerably smaller.)
    - add this to core. (Where depends on the size, I would say)


    On Thursday, November 21, 2013, Marcel van Lohuizen wrote:

    Hi Andrew,

    Something along this lines has been proposed in
    https://docs.google.com/a/golang.org/document/d/1Q64ktYh7XptpEI3L2G7xYqsusohOeLUft865Zd7fbGU/viewto be part of the go.text repository.

    A few comments:

    - I don't think it is that trivial to use the existing unicode tables.
    The Unicode breaking algorithm requires you to look up the breaking
    category of two runes and then compare. The existing tables are more for
    determining whether a rune is in a certain set. To use these tables for the
    former would require iterating over all sets, which is rather inefficient.
    - The Unicode word break algorithms only give a decent default. But
    actual breaking behavior may differ from language to language. For example,
    word breaking in Chinese is a notoriously hard problem.
    - Aside from that, title case behavior is different per language (and
    work breaking is not always the same as breaking to do proper title case).
    - It is not always obvious what to do, even for English: compare don't
    -> Don't with conan o'brien -> Conan O'Brien.

    Obviously, putting this in go.text won't help fix the broken Title in the
    core libraries. I'm not sure if it is worth fixing it. The tables don't
    belong in core, imho. It may be an option to put a simple solution in core
    just to solve Tile, but I'm not sure it is worth it (plus you're just
    moving problems, see above example). Quite frankly, I think *.Title should
    not have been in core in the first place. (Python made the same mistake.)

    That being said, it will be useful to have the basic tables in
    go.text/unicode/break, for example, so that other packages in go.text can
    build on it. This should probably have the form of a rune (or rather UTF-8,
    as most of go.text packages do) to break property function, or something
    like that.

    Marcel


    On Thu, Nov 21, 2013 at 12:11 AM, Andrew Brown ({}, 'cvml', 'andrewrbrown1@gmail.com');>
    wrote:
    Currently go's unicode package does not have the ability to properly
    identify word breaks according to the Unicode specification. I first
    noticed this, because the strings.Title function (which ought to turn the
    first rune of every word to title case) actually determines word boundaries
    based on the 'Separator' class of character. In other words, it thinks
    that "shouldn't" is two words, 'shouldn' and 't' and capitalizes it like
    so: "Shouldn'T"

    https://code.google.com/p/go/issues/detail?id=6801

    There are lots of potential uses for properly determining Unicode word
    breaks, and I would like to propose a solution. I would like to extend
    unicode/maketables.go (
    https://code.google.com/p/go/source/browse/src/pkg/unicode/maketables.go)
    to produce table data for word break classifications in the same way it
    produces tables for scripts, properties, etc... Most of the functionality
    is already there, and the source data for word breaks is already in the
    same format as that of other tables currently being parsed:

    http://www.unicode.org/Public/6.2.0/ucd/auxiliary/WordBreakProperty.txt
    compare to:
    http://www.unicode.org/Public/6.2.0/ucd/Scripts.txt

    Then I will add a new file to the unicode package, wordbreak.go, which
    will contain one function (to start)

    // IsWordBoundary reports whether there is a word boundary
    // between a and b
    // See http://www.unicode.org/reports/tr29/#Word_Boundaries
    func IsWordBoundary(a rune, b rune) bool


    This function will identify which word boundary class the runes belong to
    (according to the new tables added in maketables) and then look up in a
    rule table whether the boundary between them is a word boundary. See
    http://unicode.org/Public/UNIDATA/auxiliary/WordBreakTest.html#r999.0 for
    a visual representation.

    Later, perhaps, more functionality could be added. For instance, the
    strings package might want to separate a string into a slice of individual
    word strings based on the unicode spec. In either case, my proposed
    changes represent a solid step forward from what is there now, and they
    allow me to fix the title case bug that caused me to look in to this in the
    first place.

    Any thoughts?

    --
    You received this message because you are subscribed to the Google Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-nuts+unsubscribe@googlegroups.com <javascript:_e({},
    'cvml', 'golang-nuts%2Bunsubscribe@googlegroups.com');>.
    For more options, visit https://groups.google.com/groups/opt_out.


    --
    Trying this for a while: http://go/OnlyCheckEmailTwiceADay.
    Marcel van Lohuizen -- Google Switzerland GmbH -- Identifikationsnummer:
    CH-020.4.028.116-1
    --
    Trying this for a while: http://go/OnlyCheckEmailTwiceADay.
    Marcel van Lohuizen -- Google Switzerland GmbH -- Identifikationsnummer:
    CH-020.4.028.116-1

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Andrew Brown at Nov 21, 2013 at 4:32 pm
    I can see why much of this functionality belongs in the text package, but
    the boundaries are still a little fuzzy. It's hard to tell what belongs
    where. Perhaps the unicode package should just be for decoding/encoding
    unicode text, and any tables for classifying code points belongs in text.

    The data for wordbreak properties (
    http://www.unicode.org/Public/6.2.0/ucd/auxiliary/WordBreakProperty.txt)
    are in the same format as those for Scripts and Properties, so I modified
    unicode/maketables.go to include the wordbreak tables in addition to the
    other ones that it already makes. This caused the resulting output file
    (tables.go) to change in size from 158 kb to 178 kb (an increase of 12.7%).
      So it's not that much bigger, but maybe all (or some) of them belong in
    another package anyway.

    Regarding iterating over all the tables - this was a concern I had after I
    wrote my initial email as well. I think that we will probably have to
    generate some sort of different data structure better capable of
    determining which of a group of categories a rune belongs to, without
    having to try 'Is(*TableRange, rune) on each one. I'll have to think about
    this, but I think there is a better solution. In any case, I could probably
    add the table data tonight or tomorrow if we decided to keep it in the
    current TableRange format. Otherwise, we'll need to decide how and where
    (and if) to store that table data.

    The title case issue is a little more tricky. The main reason I care is
    because I was trying to add that functionality to a text editor app being
    written in Go (https://github.com/limetext/lime/) Many text editors have
    this kind of functionality where you can highlight text and cycle through
    different casing options (including title case.) I tried out the sample in
    MS word and Notepad++ (with TextFX plugin). Word changed conan o'brien to
    Conan O'brien and shouldn't to Shouldn't (the way Unicode word breaking
    rules would) and Notepad++ changed them to Conan O'Brien and Shouldn'T (the
    way Go's Title function would.) I would favor Word's implementation,
    because it follows the Unicode standard (even if it didn't catch all the
    exceptional cases.) In my case, I might just write my own title case
    function for the text editor. Honestly, unless I were to add a full blown
    word-break to this project, I would probably just do the same thing and
    just exclude the apostrophe from the separator category (the way underscore
    is excluded in Go's implementation.)

    Finally, I've come to realize my IsBreak function would be quite
    inadequate. After reading up a bit more on this, I realize that there is
    more to determining word break than an intersection of two characters.
      Sometimes a third character comes in to play (a second character on either
    side of the proposed break.) The proper implementation is certainly not
    trivial, but it's not out of the realm of possibility. I like the proposed
    segmentation interface, since it allows for other segmenters based on
    locale or preference.

    On Thu, Nov 21, 2013 at 5:24 AM, Marcel van Lohuizen wrote:

    That said, if we decide to fix Title, something along these lines would be
    the right approach:
    - add the break data to go.text/unicode/break.
    - write a table generator to distill only the data relevant to Title (I
    expect this to be considerably smaller.)
    - add this to core. (Where depends on the size, I would say)


    On Thursday, November 21, 2013, Marcel van Lohuizen wrote:

    Hi Andrew,

    Something along this lines has been proposed in
    https://docs.google.com/a/golang.org/document/d/1Q64ktYh7XptpEI3L2G7xYqsusohOeLUft865Zd7fbGU/viewto be part of the go.text repository.

    A few comments:

    - I don't think it is that trivial to use the existing unicode
    tables. The Unicode breaking algorithm requires you to look up the breaking
    category of two runes and then compare. The existing tables are more for
    determining whether a rune is in a certain set. To use these tables for the
    former would require iterating over all sets, which is rather inefficient.
    - The Unicode word break algorithms only give a decent default. But
    actual breaking behavior may differ from language to language. For example,
    word breaking in Chinese is a notoriously hard problem.
    - Aside from that, title case behavior is different per language (and
    work breaking is not always the same as breaking to do proper title case).
    - It is not always obvious what to do, even for English: compare
    don't -> Don't with conan o'brien -> Conan O'Brien.

    Obviously, putting this in go.text won't help fix the broken Title in the
    core libraries. I'm not sure if it is worth fixing it. The tables don't
    belong in core, imho. It may be an option to put a simple solution in core
    just to solve Tile, but I'm not sure it is worth it (plus you're just
    moving problems, see above example). Quite frankly, I think *.Title should
    not have been in core in the first place. (Python made the same mistake.)

    That being said, it will be useful to have the basic tables in
    go.text/unicode/break, for example, so that other packages in go.text can
    build on it. This should probably have the form of a rune (or rather UTF-8,
    as most of go.text packages do) to break property function, or something
    like that.

    Marcel

    On Thu, Nov 21, 2013 at 12:11 AM, Andrew Brown wrote:

    Currently go's unicode package does not have the ability to properly
    identify word breaks according to the Unicode specification. I first
    noticed this, because the strings.Title function (which ought to turn the
    first rune of every word to title case) actually determines word boundaries
    based on the 'Separator' class of character. In other words, it thinks
    that "shouldn't" is two words, 'shouldn' and 't' and capitalizes it like
    so: "Shouldn'T"

    https://code.google.com/p/go/issues/detail?id=6801

    There are lots of potential uses for properly determining Unicode word
    breaks, and I would like to propose a solution. I would like to extend
    unicode/maketables.go (
    https://code.google.com/p/go/source/browse/src/pkg/unicode/maketables.go)
    to produce table data for word break classifications in the same way it
    produces tables for scripts, properties, etc... Most of the functionality
    is already there, and the source data for word breaks is already in the
    same format as that of other tables currently being parsed:

    http://www.unicode.org/Public/6.2.0/ucd/auxiliary/WordBreakProperty.txt
    compare to:
    http://www.unicode.org/Public/6.2.0/ucd/Scripts.txt

    Then I will add a new file to the unicode package, wordbreak.go, which
    will contain one function (to start)

    // IsWordBoundary reports whether there is a word boundary
    // between a and b
    // See http://www.unicode.org/reports/tr29/#Word_Boundaries
    func IsWordBoundary(a rune, b rune) bool


    This function will identify which word boundary class the runes belong
    to (according to the new tables added in maketables) and then look up in a
    rule table whether the boundary between them is a word boundary. See
    http://unicode.org/Public/UNIDATA/auxiliary/WordBreakTest.html#r999.0 for
    a visual representation.

    Later, perhaps, more functionality could be added. For instance, the
    strings package might want to separate a string into a slice of individual
    word strings based on the unicode spec. In either case, my proposed
    changes represent a solid step forward from what is there now, and they
    allow me to fix the title case bug that caused me to look in to this in the
    first place.

    Any thoughts?

    --
    You received this message because you are subscribed to the Google
    Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.


    --
    Trying this for a while: http://go/OnlyCheckEmailTwiceADay.
    Marcel van Lohuizen -- Google Switzerland GmbH -- Identifikationsnummer:
    CH-020.4.028.116-1
    --
    Trying this for a while: http://go/OnlyCheckEmailTwiceADay.
    Marcel van Lohuizen -- Google Switzerland GmbH -- Identifikationsnummer:
    CH-020.4.028.116-1

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-nuts @
categoriesgo
postedNov 20, '13 at 11:11p
activeNov 21, '13 at 4:32p
posts4
users2
websitegolang.org

People

Translate

site design / logo © 2022 Grokbase