FAQ
I'm doing an analysis of a large data set (120-150M) records which
represent spatial data (high thoughput genome sequence data with
consideration of the, for those who are interested [1]).

The string fields of the records are from a small set of values (not
determinable prior to the analysis) - the data structure that describes
the fields is here [2] - but the record reading makes it likely
(certain?) that each instance of the string values will be distinct
items. I need to store the entire set of records, so would like to be
able to intern the string values. The application heap size is on the
order of 100GB when running the analysis - at the moment I just discard
these strings.

How would I go about this in a reasonably efficient manner?

I'm thinking something on the lines of this, but it feels magical:

type store map[string]string

func (is store) intern(s string) string {
  t, ok := is[s]
  if ok {
   return t
  }
  is[s] = s
  return s
}

The rationale for this is that if the string has been seen, the returned
string is the stored string, rather than the query. Ideally, I'd like to
not store the string except in the key space, but I can't think of a way
to get the string representation corresponding to the key rather than
the query.

Dan

[1]http://www.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf
[2]http://godoc.org/code.google.com/p/biogo.illumina#Metadata

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Search Discussions

  • Liam LeFey at May 22, 2013 at 3:23 am
    It looks good to me. If you're trying to reduce the memory cost, this might
    help depending on the internal representation of a map; (does it point to
    values or hold them?):

    type store map[string]*string

    func (is store) intern(s string) string {
             t, ok := is[s]
             if ok {
                     return *t
             }
             is[s] = &s
             return s
    }
    On Tuesday, May 21, 2013 8:21:16 PM UTC-6, kortschak wrote:

    I'm doing an analysis of a large data set (120-150M) records which
    represent spatial data (high thoughput genome sequence data with
    consideration of the, for those who are interested [1]).

    The string fields of the records are from a small set of values (not
    determinable prior to the analysis) - the data structure that describes
    the fields is here [2] - but the record reading makes it likely
    (certain?) that each instance of the string values will be distinct
    items. I need to store the entire set of records, so would like to be
    able to intern the string values. The application heap size is on the
    order of 100GB when running the analysis - at the moment I just discard
    these strings.

    How would I go about this in a reasonably efficient manner?

    I'm thinking something on the lines of this, but it feels magical:

    type store map[string]string

    func (is store) intern(s string) string {
    t, ok := is[s]
    if ok {
    return t
    }
    is[s] = s
    return s
    }

    The rationale for this is that if the string has been seen, the returned
    string is the stored string, rather than the query. Ideally, I'd like to
    not store the string except in the key space, but I can't think of a way
    to get the string representation corresponding to the key rather than
    the query.

    Dan

    [1]
    http://www.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf
    [2]http://godoc.org/code.google.com/p/biogo.illumina#Metadata
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Dan Kortschak at May 22, 2013 at 3:34 am

    On Tue, 2013-05-21 at 20:23 -0700, Liam LeFey wrote:
    If you're trying to reduce the memory cost, this might help depending
    on the internal representation of a map; (does it point to values or
    hold them?)
    The map holds the values, but strings are equivalent to a struct
    { uintpr; int }, so storing a string is two words; your approach would
    save one word at the cost of a deref. This is probably worth it in my
    case.

    Ideally, you'd be able to have a map[string]struct{} which means you
    save 2 words - but I don't think I can get that.


    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Jesse McNelis at May 22, 2013 at 3:40 am

    On Wed, May 22, 2013 at 1:34 PM, Dan Kortschak wrote:

    your approach would save one word at the cost of a deref.
    A deref and an additional allocation.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Dan Kortschak at May 22, 2013 at 4:28 am

    On Wed, 2013-05-22 at 13:40 +1000, Jesse McNelis wrote:
    A deref and an additional allocation.
    Mmmm, are you sure? Where? s escapes, but it escapes anyway when it's
    stored in the map.

    Though engaging the brain just for a second, given I'm expecting a small
    set of unique strings, the size of the individual elements is not that
    important. So the basic model probably wins - and an additional test for
    "" would improve things for my expected data too.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Jesse McNelis at May 22, 2013 at 4:46 am

    On Wed, May 22, 2013 at 2:28 PM, Dan Kortschak wrote:
    On Wed, 2013-05-22 at 13:40 +1000, Jesse McNelis wrote:
    A deref and an additional allocation.
    Mmmm, are you sure? Where? s escapes, but it escapes anyway when it's
    stored in the map.
    s doesn't escape when it's stored in a map, it's copied when it's stored in
    a map.
    If you take the address of s then s escapes and the pointer is copied in to
    the map.


    Though engaging the brain just for a second, given I'm expecting a small
    set of unique strings, the size of the individual elements is not that
    important. So the basic model probably wins - and an additional test for
    "" would improve things for my expected data too.

    --
    =====================
    http://jessta.id.au

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Dan Kortschak at May 22, 2013 at 4:53 am

    On Wed, 2013-05-22 at 14:46 +1000, Jesse McNelis wrote:
    s doesn't escape when it's stored in a map, it's copied when it's
    stored in a map.
    If you take the address of s then s escapes and the pointer is copied
    in to the map.
    Ahh yes. Otherwise you would be able to alter the key via the pointer.
    True?


    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Jesse McNelis at May 22, 2013 at 5:10 am

    On Wed, May 22, 2013 at 2:53 PM, Dan Kortschak wrote:
    On Wed, 2013-05-22 at 14:46 +1000, Jesse McNelis wrote:
    s doesn't escape when it's stored in a map, it's copied when it's
    stored in a map.
    If you take the address of s then s escapes and the pointer is copied
    in to the map.
    Ahh yes. Otherwise you would be able to alter the key via the pointer.
    True?
    I don't even know how that would make any sense.
    For this to happen the compiler would have had to detect that you were
    taking the
    address of a variable that you were also copying and instead of taking the
    address
      of the variable it would take the address of the copy.
    That sounds like madness.

    --
    =====================
    http://jessta.id.au

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Dan Kortschak at May 22, 2013 at 5:13 am

    On Wed, 2013-05-22 at 15:10 +1000, Jesse McNelis wrote:
    I don't even know how that would make any sense.
    For this to happen the compiler would have had to detect that you were
    taking the address of a variable that you were also copying and
    instead of taking the address of the variable it would take the
    address of the copy.
    That sounds like madness.
    Maybe, I'm trying to understand why the string escapes. So far I don't.
    I don't see how any of the above helps.


    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Liam LeFey at May 22, 2013 at 5:25 am
    I wouldn't try to guess what the compiler is doing. If you're interested,
    go dig through the source. Personally all I learn from that is that it
    gives me a headache.

    If your set of unique strings is small, I wouldn't worry about using
    pointers. It adds one mote of complexity (and those add up).

    If it's bigger than you expect, I'd just run some quick benchmarks, and
    make a decision based on the results, and what you need to optimize. Quick
    and dirty benchmarks are pretty easy using the testing package.
    On Tuesday, May 21, 2013 11:13:53 PM UTC-6, kortschak wrote:
    On Wed, 2013-05-22 at 15:10 +1000, Jesse McNelis wrote:
    I don't even know how that would make any sense.
    For this to happen the compiler would have had to detect that you were
    taking the address of a variable that you were also copying and
    instead of taking the address of the variable it would take the
    address of the copy.
    That sounds like madness.
    Maybe, I'm trying to understand why the string escapes. So far I don't.
    I don't see how any of the above helps.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Jesse McNelis at May 22, 2013 at 5:31 am

    On Wed, May 22, 2013 at 3:13 PM, Dan Kortschak wrote:
    On Wed, 2013-05-22 at 15:10 +1000, Jesse McNelis wrote:
    I don't even know how that would make any sense.
    For this to happen the compiler would have had to detect that you were
    taking the address of a variable that you were also copying and
    instead of taking the address of the variable it would take the
    address of the copy.
    That sounds like madness.
    Maybe, I'm trying to understand why the string escapes. So far I don't.
    I don't see how any of the above helps.
    is[s] = &s

    The value of s is copied to the map as the key.
    The value for this key is the address of the variable s (not the value, the
    variable).
    Thus the variable s escapes because there are pointers to it
    that will exist outside of the scope of the function it was declared in.

    --
    =====================
    http://jessta.id.au

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Dan Kortschak at May 22, 2013 at 5:35 am

    On Wed, 2013-05-22 at 15:31 +1000, Jesse McNelis wrote:
    The value of s is copied to the map as the key.
    The value for this key is the address of the variable s (not the
    value, the variable).
    Thus the variable s escapes because there are pointers to it
    that will exist outside of the scope of the function it was declared
    in.
    Thanks. That is much clearer.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Dmitry Vyukov at May 22, 2013 at 6:23 am

    On Wed, May 22, 2013 at 6:21 AM, Dan Kortschak wrote:
    I'm doing an analysis of a large data set (120-150M) records which
    represent spatial data (high thoughput genome sequence data with
    consideration of the, for those who are interested [1]).

    The string fields of the records are from a small set of values (not
    determinable prior to the analysis) - the data structure that describes
    the fields is here [2] - but the record reading makes it likely
    (certain?) that each instance of the string values will be distinct
    items. I need to store the entire set of records, so would like to be
    able to intern the string values. The application heap size is on the
    order of 100GB when running the analysis - at the moment I just discard
    these strings.

    How would I go about this in a reasonably efficient manner?

    I'm thinking something on the lines of this, but it feels magical:

    type store map[string]string

    func (is store) intern(s string) string {
    t, ok := is[s]
    if ok {
    return t
    }
    is[s] = s
    return s
    }

    The rationale for this is that if the string has been seen, the returned
    string is the stored string, rather than the query. Ideally, I'd like to
    not store the string except in the key space, but I can't think of a way
    to get the string representation corresponding to the key rather than
    the query.

    If it satisfies you wrt performance/memory, looks fine.

    Another option would be to use a sorted []string, but I don't think it
    will be any better.

    If the strings are short and have an upper bound on length, then it's
    possible to embed them into something line [16]byte. Then, of course,
    these are not strings, so one will need to implement the necessary
    operations on these [16]byte. But e.g. compare can be very efficient
    (2 uint64 compares).

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Dan Kortschak at May 22, 2013 at 6:28 am

    On Wed, 2013-05-22 at 10:23 +0400, Dmitry Vyukov wrote:
    If it satisfies you wrt performance/memory, looks fine.
    It seems to be doing the job.
    Another option would be to use a sorted []string, but I don't think it
    will be any better.
    The simplicity of the map trumps this at the moment.
    If the strings are short and have an upper bound on length, then it's
    possible to embed them into something line [16]byte. Then, of course,
    these are not strings, so one will need to implement the necessary
    operations on these [16]byte. But e.g. compare can be very efficient
    (2 uint64 compares).
    Interesting idea. Again - simplicity, but will keep in mind for later if
    performance is an issue; the strings are short.

    thanks

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-nuts @
categoriesgo
postedMay 22, '13 at 2:21a
activeMay 22, '13 at 6:28a
posts14
users4
websitegolang.org

People

Translate

site design / logo © 2022 Grokbase