FAQ
Hi,

If I execute the following code:

package main

import "fmt"

func main() {
     input := "யாதும் ஊரே"

     for _, letters := range input {
         fmt.Println(string(letters))
     }
}

I get an output as:













Now instead of the above output if I want to iterate the individual
characters (which will be a combination of one or more-than-one runes),
what is the way to do it ? I want to have the diacritics
http://en.wikipedia.org/wiki/Diacritic merged with their original
character. The example output I am looking for is:

யா
து
ம்


ரே


Now what is the way to get this ? I googled around for a bit but could not
get much help. I can manually specify the diacritics in my program and
parse them, but I see that golang already has the diacritics database
maintained for almost all languages http://golang.org/pkg/unicode/ and so I
wanted to know what is the best way to parse the individual characters
along with the diacritics ?

Thanks.

--
Sankar P
http://psankar.blogspot.com

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Search Discussions

  • RickyS at Oct 3, 2013 at 7:10 am
    I don't have deep knowledge of this subject.
    A for-loop will only convert utf-8 character-by-character into utf-32
    runes, consuming as many bytes as the utf-8 character occupies.
    I think you'll have to write a sort of iterator that uses the Unicode
    knowledge of combining characters to generate your own output, probably a
    set of utf-32 runes that the
    basic for-loop would not generate. I doubt that you'd need a 64-bit return
    value, but who knows?

    The easiest way to structure this is probably a goroutine that sends the
    results out on a simple channel, so you could read the results with a
    'for-range' loop.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Andy Balholm at Oct 3, 2013 at 2:21 pm
    I don't know how to do it, but I do know that what you are looking for are
    called "grapheme clusters". Try searching for that phrase; you can probably
    find an algorithm somewhere.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Sonia Keys at Oct 3, 2013 at 8:27 pm
    I think you want to test the character category with unicode.Is(). For
    your test string you need to test for Mn and Mc. I think it's good to test
    for Me as well. One way, http://play.golang.org/p/OX1383Qvbs. Another,
    http://play.golang.org/p/ZXSmjzgvCK.
    On Thursday, October 3, 2013 2:43:11 AM UTC-4, Sankar wrote:

    Hi,

    If I execute the following code:

    package main

    import "fmt"

    func main() {
    input := "யாதும் ஊரே"

    for _, letters := range input {
    fmt.Println(string(letters))
    }
    }

    I get an output as:













    Now instead of the above output if I want to iterate the individual
    characters (which will be a combination of one or more-than-one runes),
    what is the way to do it ? I want to have the diacritics
    http://en.wikipedia.org/wiki/Diacritic merged with their original
    character. The example output I am looking for is:

    யா
    து
    ம்


    ரே


    Now what is the way to get this ? I googled around for a bit but could not
    get much help. I can manually specify the diacritics in my program and
    parse them, but I see that golang already has the diacritics database
    maintained for almost all languages http://golang.org/pkg/unicode/ and so
    I wanted to know what is the best way to parse the individual characters
    along with the diacritics ?

    Thanks.

    --
    Sankar P
    http://psankar.blogspot.com
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Kevin Gillette at Oct 4, 2013 at 2:15 am
    What you want is the precise opposite (though luckily with nearly identical
    code) to what's happening at: <
    https://github.com/xtgo/slug/blob/master/slug.go#L23>. In my case, I'm
    normalizing each character so that diacritics are split apart from the base
    rune. All you should need is:

    import "code.google.com/p/go.text/unicode/norm"
    // ...
    for _, r := norm.NFC.String(yourUncombinedInput) {
    // r is each successive, combined rune
    }

    Depending on your exact needs, you may want to use norm.NFKC instead of
    NFC. See http://www.unicode.org/reports/tr15/#Compatibility_Composite_Figure for
    a diagram illustrating the difference. NFC tends to result in the ligatured
    "literary" form, while NFKC often tries to get an ASCII equivalent to the
    base character (if the base isn't already in the ASCII/Latin range), and
    graft the correct diacritics back onto them again.

    See also <http://godoc.org/code.google.com/p/go.text/unicode/norm#Form> for
    the unicode normalization package documentation.
    On Thursday, October 3, 2013 12:43:11 AM UTC-6, Sankar wrote:

    Hi,

    If I execute the following code:

    package main

    import "fmt"

    func main() {
    input := "யாதும் ஊரே"

    for _, letters := range input {
    fmt.Println(string(letters))
    }
    }

    I get an output as:













    Now instead of the above output if I want to iterate the individual
    characters (which will be a combination of one or more-than-one runes),
    what is the way to do it ? I want to have the diacritics
    http://en.wikipedia.org/wiki/Diacritic merged with their original
    character. The example output I am looking for is:

    யா
    து
    ம்


    ரே


    Now what is the way to get this ? I googled around for a bit but could not
    get much help. I can manually specify the diacritics in my program and
    parse them, but I see that golang already has the diacritics database
    maintained for almost all languages http://golang.org/pkg/unicode/ and so
    I wanted to know what is the best way to parse the individual characters
    along with the diacritics ?

    Thanks.

    --
    Sankar P
    http://psankar.blogspot.com
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Kevin Gillette at Oct 4, 2013 at 2:22 am
    A quick followup: NFC appears to sometimes split non-canonical (yet
    combined) characters into canonical, semi-split forms. There are some other
    useful stuff in the norm package that you can use alongside norm.Form, such
    as norm.Properties, to analyze groups of characters to determine how they
    relate. Also, norm.Iter should be more efficient than norm.Form.String if
    you're processing rune by rune, especially over a large input, but it's
    also less convenient for simple tasks.
    On Thursday, October 3, 2013 8:15:02 PM UTC-6, Kevin Gillette wrote:

    What you want is the precise opposite (though luckily with nearly
    identical code) to what's happening at: <
    https://github.com/xtgo/slug/blob/master/slug.go#L23>. In my case, I'm
    normalizing each character so that diacritics are split apart from the base
    rune. All you should need is:

    import "code.google.com/p/go.text/unicode/norm"
    // ...
    for _, r := norm.NFC.String(yourUncombinedInput) {
    // r is each successive, combined rune
    }

    Depending on your exact needs, you may want to use norm.NFKC instead of
    NFC. See
    http://www.unicode.org/reports/tr15/#Compatibility_Composite_Figure for a
    diagram illustrating the difference. NFC tends to result in the ligatured
    "literary" form, while NFKC often tries to get an ASCII equivalent to the
    base character (if the base isn't already in the ASCII/Latin range), and
    graft the correct diacritics back onto them again.

    See also <http://godoc.org/code.google.com/p/go.text/unicode/norm#Form>
    for the unicode normalization package documentation.
    On Thursday, October 3, 2013 12:43:11 AM UTC-6, Sankar wrote:

    Hi,

    If I execute the following code:

    package main

    import "fmt"

    func main() {
    input := "யாதும் ஊரே"

    for _, letters := range input {
    fmt.Println(string(letters))
    }
    }

    I get an output as:













    Now instead of the above output if I want to iterate the individual
    characters (which will be a combination of one or more-than-one runes),
    what is the way to do it ? I want to have the diacritics
    http://en.wikipedia.org/wiki/Diacritic merged with their original
    character. The example output I am looking for is:

    யா
    து
    ம்


    ரே


    Now what is the way to get this ? I googled around for a bit but could
    not get much help. I can manually specify the diacritics in my program and
    parse them, but I see that golang already has the diacritics database
    maintained for almost all languages http://golang.org/pkg/unicode/ and
    so I wanted to know what is the best way to parse the individual characters
    along with the diacritics ?

    Thanks.

    --
    Sankar P
    http://psankar.blogspot.com
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Sankar P at Oct 4, 2013 at 4:23 am
    Thank you so much everyone. I did not expect so many good answers to
    be honest :) The go community is amazing.

    I will pick Sonia Keys' answer for my case as it looks to be the
    simplest and easiest to my eyes and the most suitable for my need. I
    actually looked at the unicode.Is function before asking here but
    could not find out what to give for the RangeTable that followed it.
    The godoc page of Unicode had a variables section where some of the
    variables like _Diacritic where mentioned but I was not sure what else
    I should be using, whether Ideographic, Grapheme etc. should be used
    as I do not have an idea about all the things listed there.

    Now I see that Sonia Keys has used Mn, Mc, Me etc. but I could not
    find the documentation for these either in the godoc. I will use this
    code happily for now, but it will be good if we can get the
    explanation of these functions added to the godoc.

    Thank you so much.

    Sankar


    On Fri, Oct 4, 2013 at 7:52 AM, Kevin Gillette
    wrote:
    A quick followup: NFC appears to sometimes split non-canonical (yet
    combined) characters into canonical, semi-split forms. There are some other
    useful stuff in the norm package that you can use alongside norm.Form, such
    as norm.Properties, to analyze groups of characters to determine how they
    relate. Also, norm.Iter should be more efficient than norm.Form.String if
    you're processing rune by rune, especially over a large input, but it's also
    less convenient for simple tasks.

    On Thursday, October 3, 2013 8:15:02 PM UTC-6, Kevin Gillette wrote:

    What you want is the precise opposite (though luckily with nearly
    identical code) to what's happening at:
    <https://github.com/xtgo/slug/blob/master/slug.go#L23>. In my case, I'm
    normalizing each character so that diacritics are split apart from the base
    rune. All you should need is:
    import "code.google.com/p/go.text/unicode/norm"
    // ...
    for _, r := norm.NFC.String(yourUncombinedInput) {
    // r is each successive, combined rune
    }

    Depending on your exact needs, you may want to use norm.NFKC instead of
    NFC. See http://www.unicode.org/reports/tr15/#Compatibility_Composite_Figure
    for a diagram illustrating the difference. NFC tends to result in the
    ligatured "literary" form, while NFKC often tries to get an ASCII equivalent
    to the base character (if the base isn't already in the ASCII/Latin range),
    and graft the correct diacritics back onto them again.

    See also <http://godoc.org/code.google.com/p/go.text/unicode/norm#Form>
    for the unicode normalization package documentation.
    On Thursday, October 3, 2013 12:43:11 AM UTC-6, Sankar wrote:

    Hi,

    If I execute the following code:

    package main

    import "fmt"

    func main() {
    input := "யாதும் ஊரே"

    for _, letters := range input {
    fmt.Println(string(letters))
    }
    }

    I get an output as:













    Now instead of the above output if I want to iterate the individual
    characters (which will be a combination of one or more-than-one runes), what
    is the way to do it ? I want to have the diacritics
    http://en.wikipedia.org/wiki/Diacritic merged with their original character.
    The example output I am looking for is:

    யா
    து
    ம்


    ரே


    Now what is the way to get this ? I googled around for a bit but could
    not get much help. I can manually specify the diacritics in my program and
    parse them, but I see that golang already has the diacritics database
    maintained for almost all languages http://golang.org/pkg/unicode/ and so I
    wanted to know what is the best way to parse the individual characters along
    with the diacritics ?

    Thanks.

    --
    Sankar P
    http://psankar.blogspot.com
    --
    You received this message because you are subscribed to a topic in the
    Google Groups "golang-nuts" group.
    To unsubscribe from this topic, visit
    https://groups.google.com/d/topic/golang-nuts/9qeD5d_k42Q/unsubscribe.
    To unsubscribe from this group and all its topics, send an email to
    golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.


    --
    Sankar P
    http://psankar.blogspot.com

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Husylvan at Oct 4, 2013 at 10:21 am
    You may want to convert the input string into a slice of unicode code
    points before beginning the for loop
    and then convert each unicode code point back to a string.

    for _, letter := range []rune(input) {
             fmt.Println(string(letter))
    }


    2013년 10월 3일 목요일 오후 3시 43분 11초 UTC+9, Sankar 님의 말:
    Hi,

    If I execute the following code:

    package main

    import "fmt"

    func main() {
    input := "யாதும் ஊரே"

    for _, letters := range input {
    fmt.Println(string(letters))
    }
    }

    I get an output as:













    Now instead of the above output if I want to iterate the individual
    characters (which will be a combination of one or more-than-one runes),
    what is the way to do it ? I want to have the diacritics
    http://en.wikipedia.org/wiki/Diacritic merged with their original
    character. The example output I am looking for is:

    யா
    து
    ம்


    ரே


    Now what is the way to get this ? I googled around for a bit but could not
    get much help. I can manually specify the diacritics in my program and
    parse them, but I see that golang already has the diacritics database
    maintained for almost all languages http://golang.org/pkg/unicode/ and so
    I wanted to know what is the best way to parse the individual characters
    along with the diacritics ?

    Thanks.

    --
    Sankar P
    http://psankar.blogspot.com
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Luzon83 at Oct 4, 2013 at 10:45 am

    On Friday, October 4, 2013 11:34:26 AM UTC+2, husy...@gmail.com wrote:
    You may want to convert the input string into a slice of unicode code
    points before beginning the for loop
    and then convert each unicode code point back to a string.

    for _, letter := range []rune(input) {
    fmt.Println(string(letter))
    }
    Your code is equivalent to his code (range already iterates over runes by
    default) and doesn't do what he wants:
    http://play.golang.org/p/t2SKX2-VWE

    A user-perceived character (grapheme) can consist of more than one Unicode
    code point as his example demonstrates.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-nuts @
categoriesgo
postedOct 3, '13 at 6:43a
activeOct 4, '13 at 10:45a
posts9
users7
websitegolang.org

People

Translate

site design / logo © 2022 Grokbase