FAQ
I have a structure which is section of a json file parsed into memory, and
this section is array of maps.

dataMap []map[string]interface{}

The 'interface{}' is because the json has no specific structure/schema and
this is just one of the section retrieved after applying rules from
template.

The 'interface{}' value will be seen as string and hence will be converted
to string before comparison. I think using 'fmt.Sprint' would be fine to
convert the value to string. hence the above map can be think of array of
map[string]string.

There are over 100000 array elements and I have to quickly find out unquie
values for certain keys across maps in array. Such that if maps have value
{"F1":"value a" ,"F2":"value x"}
in another map
{"F1":"value a", "F2":"value y"}
in some other map
{"F1":"value b", "F2":"value x"}

Processing for unique keys F1 & F2 & after processing I should get:
"F1":["value a", "value b"]
"F2":["value x", "value y"]

*Approach:*
1 For each key "F1" have a map like *kmap *map[string}bool. Create go
routine for each key. Each routine will iterate over array and look for
values inside maps for key like "F1"

2a Then store the "F1" values as keys inside kmap.
kmap["values a"]=true ? what will be the most efficient dummy type for
value?

2b Also before inserting values check if the values exists in kmap. Once
done the kmap keys are the unique values for the desired key in the
original data structure.

Is the above approach efficient because I have 100000+ elements? I thought
of using slice to collects values however look up will not be fast. Maps
would be good at that. But then when I retrieve values from map they are
not in proper sequence. This may not be necessary however if it turns out
to be then is there any way to track the sequence?

Also is it worth to have several routines work to find unique values for
single key "F1". In such case I need to put read- write lock on each kmap s?


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Search Discussions

  • Ahmed Waheed at Apr 29, 2014 at 2:15 pm
    You could use something like this <http://play.golang.org/p/fApRmJ4PqK>,
    not sure how "slow" it would be with a bigger dataset though
    On Tuesday, April 29, 2014 7:13:58 AM UTC+2, Abhijit Kadam wrote:

    I have a structure which is section of a json file parsed into memory, and
    this section is array of maps.

    dataMap []map[string]interface{}

    The 'interface{}' is because the json has no specific structure/schema and
    this is just one of the section retrieved after applying rules from
    template.

    The 'interface{}' value will be seen as string and hence will be converted
    to string before comparison. I think using 'fmt.Sprint' would be fine to
    convert the value to string. hence the above map can be think of array of
    map[string]string.

    There are over 100000 array elements and I have to quickly find out unquie
    values for certain keys across maps in array. Such that if maps have value
    {"F1":"value a" ,"F2":"value x"}
    in another map
    {"F1":"value a", "F2":"value y"}
    in some other map
    {"F1":"value b", "F2":"value x"}

    Processing for unique keys F1 & F2 & after processing I should get:
    "F1":["value a", "value b"]
    "F2":["value x", "value y"]

    *Approach:*
    1 For each key "F1" have a map like *kmap *map[string}bool. Create go
    routine for each key. Each routine will iterate over array and look for
    values inside maps for key like "F1"

    2a Then store the "F1" values as keys inside kmap.
    kmap["values a"]=true ? what will be the most efficient dummy type for
    value?

    2b Also before inserting values check if the values exists in kmap. Once
    done the kmap keys are the unique values for the desired key in the
    original data structure.

    Is the above approach efficient because I have 100000+ elements? I thought
    of using slice to collects values however look up will not be fast. Maps
    would be good at that. But then when I retrieve values from map they are
    not in proper sequence. This may not be necessary however if it turns out
    to be then is there any way to track the sequence?

    Also is it worth to have several routines work to find unique values for
    single key "F1". In such case I need to put read- write lock on each kmap s?

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Egon at Apr 29, 2014 at 2:41 pm
    Explain the full story, talk about the actual context<https://code.google.com/p/go-wiki/wiki/HowToAsk>; it
    allows people to orient to the actual problem much faster.

    What do you mean very large? 1GB/1TB/1PB?
    How many times do you need to find those?
    How will it be used?
    Does it need to be offline/online? How often is the data updated?
    What algorithm are you trying to implement on top of that?

    As a recommendation, do the the simplest approach first and see whether it
    works or not; maybe it's already sufficient.

    Also, without proper context it's difficult to recommend any advanced data
    structures and/or algorithms to solve the actual problem.

    Most efficient dummy type is struct{}.

    + egon
    On Tuesday, April 29, 2014 8:13:58 AM UTC+3, Abhijit Kadam wrote:

    I have a structure which is section of a json file parsed into memory, and
    this section is array of maps.

    dataMap []map[string]interface{}

    The 'interface{}' is because the json has no specific structure/schema and
    this is just one of the section retrieved after applying rules from
    template.

    The 'interface{}' value will be seen as string and hence will be converted
    to string before comparison. I think using 'fmt.Sprint' would be fine to
    convert the value to string. hence the above map can be think of array of
    map[string]string.

    There are over 100000 array elements and I have to quickly find out unquie
    values for certain keys across maps in array. Such that if maps have value
    {"F1":"value a" ,"F2":"value x"}
    in another map
    {"F1":"value a", "F2":"value y"}
    in some other map
    {"F1":"value b", "F2":"value x"}

    Processing for unique keys F1 & F2 & after processing I should get:
    "F1":["value a", "value b"]
    "F2":["value x", "value y"]

    *Approach:*
    1 For each key "F1" have a map like *kmap *map[string}bool. Create go
    routine for each key. Each routine will iterate over array and look for
    values inside maps for key like "F1"

    2a Then store the "F1" values as keys inside kmap.
    kmap["values a"]=true ? what will be the most efficient dummy type for
    value?

    2b Also before inserting values check if the values exists in kmap. Once
    done the kmap keys are the unique values for the desired key in the
    original data structure.

    Is the above approach efficient because I have 100000+ elements? I thought
    of using slice to collects values however look up will not be fast. Maps
    would be good at that. But then when I retrieve values from map they are
    not in proper sequence. This may not be necessary however if it turns out
    to be then is there any way to track the sequence?

    Also is it worth to have several routines work to find unique values for
    single key "F1". In such case I need to put read- write lock on each kmap s?

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Abhijit Kadam at Apr 29, 2014 at 6:08 pm
    File is 50 MB size and the array size is 100000. The data is not updated,
    it is a dump. The unique values to be found out from the columns of the
    matrix.
    "F1":["value a", "value b"]
    "F2":["value x", "value y"]

    Then printed like :
    F1 F2
    --------------------------
    value a value x
    value b value y

    Here is how I have done it and it takes few seconds to do it. Below is code
    extract. "rowMeta.ColsInfo" has information on what is to be processed,
    i.e. which keys to extract unique values for as there are many other
    columns that should be ignored. For each column to be looked for unique
    values I am starting a routine. That routine will find the unique values
    using a temp map and appends it to the allocated internal slice inside
    *rowsTowrite.*
    //Certain t

    *func ProcessArrayForUniqueValues(repeatData []interface{}, rowMeta
    *excel.RowInfo, csvfile *csvfile.CSVReport) {*
    * var rowsTowrite = make([][]string, len(rowMeta.ColsInfo),
    len(rowMeta.ColsInfo))*
    * var wg sync.WaitGroup*
    * for cn, col := range rowMeta.ColsInfo {*
    * if col.Key != "" {*
    * wg.Add(1)*
    * go func(index int, colInfo excel.ColInfo, data []interface{}) {*
    * defer wg.Done()*
    * kmap := make(map[string]bool)*
    * uvalues := make([]string, 0, 500)*
    * for _, rowRefTmp := range data {*
    * rowref := rowRefTmp.(map[string]interface{})*
    * uvalue := fmt.Sprint(rowref[colInfo.Key])*
    * _, ok := kmap[uvalue]*
    * if !ok {*
    * kmap[uvalue] = true*
    * uvalues = append(uvalues, uvalue)*
    * }*
    * }*
    * rowsTowrite[index] = uvalues*
    * }(cn, *col, repeatData)*
    * }*
    * }*
    * wg.Wait()*

    * //write to file*
    * maxColLength := 0*
    * for cn, _ := range rowsTowrite {*
    * slicelen := len(rowsTowrite[cn])*
    * if slicelen > maxColLength {*
    * maxColLength = slicelen*
    * }*
    * }*

    * for i := 0; i < maxColLength; i++ {*
    * var record []string*
    * for j := 0; j < len(rowsTowrite); j++ {*
    * tmpslice := rowsTowrite[j]*
    * if len(tmpslice) > i {*
    * record = append(record, tmpslice[i])*
    * } else {*
    * record = append(record, "")*
    * }*

    * }*
    * csvfile.Write(&record)//todo: send in batches of [][]string to WriteAll*
    * }*
    *}*

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Abhijit Kadam at Apr 29, 2014 at 6:10 pm
    When I said it takes few seconds to do it above, I mean it takes few
    seconds to run the code and find unique values from data structure :)

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Egon at Apr 29, 2014 at 6:27 pm
    That's still not the context of the problem. If I'm not able to write the
    whole program instead of you, then it's not the full context. Please
    read https://code.google.com/p/go-wiki/wiki/HowToAsk it will make your
    question and problem clearer. I can't see the business value for your
    question, which means I have no clue what the best code for solving the
    problem is. Nor do I have any clue what the actual input data looks like,
    or what domain it's related to.

    Anyways, things I noticed... you are passing in []interface{} instead of
    concrete type. Why won't you send in a specific type where the conversion
    to string has already been done?

    You can use map[string]struct{} instead of map[string]bool; the first one
    uses less memory.

    Did you intend to pass in *col instead of col? You are copying the data
    inside excel.ColInfo.

    (lot of assumptions made in this suggestion) I expect most time you are
    spending on fmt.Sprint, just normalize the input json and don't unmarshal
    the whole content by using json.RawMessage.

    You are spinning up a lot of goroutines, it might be faster to split the
    data (i.e. batching) between a few goroutines.

    + egon
    On Tuesday, April 29, 2014 9:07:59 PM UTC+3, Abhijit Kadam wrote:

    File is 50 MB size and the array size is 100000. The data is not updated,
    it is a dump. The unique values to be found out from the columns of the
    matrix.
    "F1":["value a", "value b"]
    "F2":["value x", "value y"]

    Then printed like :
    F1 F2
    --------------------------
    value a value x
    value b value y

    Here is how I have done it and it takes few seconds to do it. Below is
    code extract. "rowMeta.ColsInfo" has information on what is to be
    processed, i.e. which keys to extract unique values for as there are many
    other columns that should be ignored. For each column to be looked for
    unique values I am starting a routine. That routine will find the unique
    values using a temp map and appends it to the allocated internal slice
    inside *rowsTowrite.*
    //Certain t

    *func ProcessArrayForUniqueValues(repeatData []interface{}, rowMeta
    *excel.RowInfo, csvfile *csvfile.CSVReport) {*
    * var rowsTowrite = make([][]string, len(rowMeta.ColsInfo),
    len(rowMeta.ColsInfo))*
    * var wg sync.WaitGroup*
    * for cn, col := range rowMeta.ColsInfo {*
    * if col.Key != "" {*
    * wg.Add(1)*
    * go func(index int, colInfo excel.ColInfo, data []interface{}) {*
    * defer wg.Done()*
    * kmap := make(map[string]bool)*
    * uvalues := make([]string, 0, 500)*
    * for _, rowRefTmp := range data {*
    * rowref := rowRefTmp.(map[string]interface{})*
    * uvalue := fmt.Sprint(rowref[colInfo.Key])*
    * _, ok := kmap[uvalue]*
    * if !ok {*
    * kmap[uvalue] = true*
    * uvalues = append(uvalues, uvalue)*
    * }*
    * }*
    * rowsTowrite[index] = uvalues*
    * }(cn, *col, repeatData)*
    * }*
    * }*
    * wg.Wait()*

    * //write to file*
    * maxColLength := 0*
    * for cn, _ := range rowsTowrite {*
    * slicelen := len(rowsTowrite[cn])*
    * if slicelen > maxColLength {*
    * maxColLength = slicelen*
    * }*
    * }*

    * for i := 0; i < maxColLength; i++ {*
    * var record []string*
    * for j := 0; j < len(rowsTowrite); j++ {*
    * tmpslice := rowsTowrite[j]*
    * if len(tmpslice) > i {*
    * record = append(record, tmpslice[i])*
    * } else {*
    * record = append(record, "")*
    * }*

    * }*
    * csvfile.Write(&record)//todo: send in batches of [][]string to WriteAll*
    * }*
    *}*
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Abhijit Kadam at Apr 29, 2014 at 6:53 pm
    Egon, I apologize for not being much clear, because of certain reasons
    cannot talk on domain and business.
    Replies:
    Anyways, things I noticed... you are passing in []interface{} instead of
    concrete type. Why won't you send in a specific type where the conversion
    to string has already been done?
    ==> The json file has no specific structure/schema. A template has some
    information about the schema. The json and template comes as input. They
    can deviate little here and there.

    You can use map[string]struct{} instead of map[string]bool; the first one
    uses less memory.
    ==> ok.

    Did you intend to pass in *col instead of col? You are copying the data
    inside excel.ColInfo.
    ==> yes it is little data. will change it to use pointers. Thanks for
    catching that.

    (lot of assumptions made in this suggestion) I expect most time you are
    spending on fmt.Sprint, just normalize the input json and don't unmarshal
    the whole content by using json.RawMessage.
    ===> The processing is directed by template data and not the json file
    itself. The template can direct to access any data from any struct. Besides
    I have little understanding of json.RawMessage. Will see that

    You are spinning up a lot of goroutines, it might be faster to split the
    data (i.e. batching) between a few goroutines.
    ==> yes, will think on this, will take time for me. Right now I have
    limited columns in the reports to look for uniqueness.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Egon at Apr 29, 2014 at 7:29 pm

    On Tuesday, April 29, 2014 9:53:16 PM UTC+3, Abhijit Kadam wrote:
    Egon, I apologize for not being much clear, because of certain reasons
    cannot talk on domain and business.
    No problem... just mention that you cannot talk about the domain next time.
    That way, people will know that it's not because you forgot to mention it
    or just skipping it; and they won't bother asking for it :).

    Replies:
    Anyways, things I noticed... you are passing in []interface{} instead of
    concrete type. Why won't you send in a specific type where the conversion
    to string has already been done?
    ==> The json file has no specific structure/schema. A template has some
    information about the schema. The json and template comes as input. They
    can deviate little here and there.
    Then probably the best solution would implement a different kind of json
    parser; one that takes into account your template as well. This would be
    specially tailored approach for the problem that you are describing. As a
    rough sketch what I would do:

    First implement a json Decoder that allows you to do something like:
    type Decoder {
    Index() int
    Key() string
    Enter() Decoder, error
    AsString() string
    Skip() error
    }

    Then write your json decoding into "unique" maps directly on that... Of
    course the idea is a rough sketch; I'm not sure what the proper interface
    would look like. Maybe, you can start with
    the http://golang.org/src/pkg/encoding/json/scanner.go and implementing a
    custom "step" on the scanner struct. (by copying the scanner code)...

    tl;dr; the "json -> go structures -> map[string]struct{}" is unnecessary...
    you would probably want "json -> map[string]struct{}" for best
    performance... there's no builtin stuff for that, so you would need to
    build it yourself.

    You can use map[string]struct{} instead of map[string]bool; the first one
    uses less memory.
    ==> ok.

    Did you intend to pass in *col instead of col? You are copying the data
    inside excel.ColInfo.
    ==> yes it is little data. will change it to use pointers. Thanks for
    catching that.
    If it's very small amounts of data then using a pointer might be also
    slower... just measure :)

    (lot of assumptions made in this suggestion) I expect most time you are
    spending on fmt.Sprint, just normalize the input json and don't unmarshal
    the whole content by using json.RawMessage.
    ===> The processing is directed by template data and not the json file
    itself. The template can direct to access any data from any struct. Besides
    I have little understanding of json.RawMessage. Will see that
    json.RawMessage just means it unserializes rest of the structure as a
    string.

    You are spinning up a lot of goroutines, it might be faster to split the
    data (i.e. batching) between a few goroutines.
    ==> yes, will think on this, will take time for me. Right now I have
    limited columns in the reports to look for uniqueness.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Abhijit Kadam at May 6, 2014 at 10:19 am
    Egon, Any example of batching between the go routines?

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Egon at May 6, 2014 at 10:36 am

    On Tuesday, May 6, 2014 1:19:09 PM UTC+3, Abhijit Kadam wrote:
    Egon, Any example of batching between the go routines?
    v1. http://play.golang.org/p/JaaQjmdZNv
    v2. http://play.golang.org/p/pwbneX25uX

    I didn't bother doing all the tiny variations; see what performs better or
    looks nicer and then just profile. (you probably want to increase the size
    of buffers and batches in real world)

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Bakul Shah at Apr 29, 2014 at 6:45 pm

    On Mon, 28 Apr 2014 22:13:58 PDT Abhijit Kadam wrote:
    ------=_Part_55_20676219.1398748438065
    Content-Type: text/plain; charset=UTF-8

    I have a structure which is section of a json file parsed into memory, and
    this section is array of maps.

    dataMap []map[string]interface{}

    The 'interface{}' is because the json has no specific structure/schema and
    this is just one of the section retrieved after applying rules from
    template.

    The 'interface{}' value will be seen as string and hence will be converted
    to string before comparison. I think using 'fmt.Sprint' would be fine to
    convert the value to string. hence the above map can be think of array of
    map[string]string.

    There are over 100000 array elements and I have to quickly find out unquie
    values for certain keys across maps in array. Such that if maps have value
    {"F1":"value a" ,"F2":"value x"}
    in another map
    {"F1":"value a", "F2":"value y"}
    in some other map
    {"F1":"value b", "F2":"value x"}

    Processing for unique keys F1 & F2 & after processing I should get:
    "F1":["value a", "value b"]
    "F2":["value x", "value y"]
    How about something like this? [Not a solution, just to give
    you an idea]

    // create a map of string to string-bag
    for _, m := range dataMap {
      for k, v := range m {
       append(m1[k], v)
      }
    }
    // replace bags with sets
    for k, v := range m1 {
      m1[k] = bagToSet(v)
    }

    bag == where unlike in a set an object may appear more than once.

    Now you have a simpler (but more generally useful) problem to solve : )

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-nuts @
categoriesgo
postedApr 29, '14 at 5:14a
activeMay 6, '14 at 10:36a
posts11
users4
websitegolang.org

People

Translate

site design / logo © 2022 Grokbase