FAQ
Need some guidance for a small project that I am trying to do for hybrid
storage.

I want to process a flat file by
a) Sorting the flat file based on a certain field/column
b) Split the file into multiple files based on the field/column
c) move the files to separate storage boxes (can be mongo, box, oracle etc).

I am able to do a and b easily with awk shell scripts. But finding it a bit
difficult to connect to cloud based storage in point c.
I am writing a small project using Golang where I will be able to consume a
file, sort/split the file. You can consider this as a flat file with
multiple records.

I started with type struct and ioutil, but unable to move the file data
into predefined structure, unable to sort. Suddenly my hands on frozen to
rethink if gaoling is the right language for something I want to do. Any
help or samples will be great.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Search Discussions

  • Lars Seipel at Jul 26, 2015 at 9:36 pm

    On Sat, Jul 25, 2015 at 06:30:50PM -0700, Morpheus82 wrote:
    I want to process a flat file by
    a) Sorting the flat file based on a certain field/column
    b) Split the file into multiple files based on the field/column
    c) move the files to separate storage boxes (can be mongo, box, oracle etc).

    I am able to do a and b easily with awk shell scripts. But finding it a bit
    difficult to connect to cloud based storage in point c.
    What's the point where you get stuck when doing it in Go? How big are
    those files, anyway?

    If you're happy with your existing awk solution for a and b, you might
    also consider writing a Go program that takes the output from that and
    uploads it to your network storage.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Morpheus82 at Jul 26, 2015 at 11:44 pm
    Hi Lars, Thanks. I am trying to use native sed/awk in golang by passing
    arguments from the program and receiving arguments from the program.

    With native go, I am stuck at the point of sorting a file which does not
    have a delimiter. I am unable to define SortKeys as a substring of overall
    line. Is applying struct the only way?
    (Here is my equivalent on python lines = sorted(lines, key=lambda x: str(x[:
    8]))) m where "lines" is a list pulling each line from the file)

    The overall volume could be around 3 - 5 million records (150 bytes per
    record) in a single file which could create 200K files (or in S3 terms
    objects).
    So the need of parallel processing and the right language to choose as the
    SLA's are quite important.

    I am new to file processing in Golang.


    PS:
    If it was just manual, then below is the command in unix (though I have not
    tested S3CMD part from a shell script).

    "sed ’s/\(.\{16\}\)/&|/' files | awk -F “|” ’{print $1,$2 >
    (“sortedfile"$1)}' OFS=‘|’
    s3cmd -P put ~/Desktop/sorted* s3://mybucket/"
    I have also written a code in PERL, but might need another facade program
    for moving the file into S3. I am in the process of trying the same in
    Python.
    The way I am trying to construct the program is to split the file into two
    or 3 files, then sort them parallel and then merge them, then split based
    on the key. after split parallel move the files to S3.
    There are ETL tools like Talend,Ab-Initio etc available in market. But too
    expensive for a startup.
    On Monday, 27 July 2015 05:36:41 UTC+8, Lars Seipel wrote:
    On Sat, Jul 25, 2015 at 06:30:50PM -0700, Morpheus82 wrote:
    I want to process a flat file by
    a) Sorting the flat file based on a certain field/column
    b) Split the file into multiple files based on the field/column
    c) move the files to separate storage boxes (can be mongo, box, oracle etc).
    I am able to do a and b easily with awk shell scripts. But finding it a bit
    difficult to connect to cloud based storage in point c.
    What's the point where you get stuck when doing it in Go? How big are
    those files, anyway?

    If you're happy with your existing awk solution for a and b, you might
    also consider writing a Go program that takes the output from that and
    uploads it to your network storage.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Dustin at Jul 27, 2015 at 12:19 am
    Any reason you aren't using a database for this? It sounds like a good case
    for one, though I don't know what your plan is for the 200K files.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Srivatsan TA at Jul 27, 2015 at 12:39 am
    = Dustin,
  • Tamás Gulácsi at Jul 27, 2015 at 5:01 am
    // lines = sorted(lines, key=lambda x: str(x[:8])

    type byFirstEight []string
    func (s byFirstEight) Less(i,j int) bool { return s[i][:8]< s[j][:8] }
    ...

    sort.Sort(byFirstEight(lines))


    Don't forget to use buffered input/output (bufio pkg) and Flush it!

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Srivatsan TA at Jul 27, 2015 at 5:10 am
    Gracias, this helps.  Let me digest, understand and try it out.


    Any tips for performance fine tuning with buffio?

    ‎Srivats

    Sent from my BlackBerry 10 smartphone.
      Original Message
    From: Tamás Gulácsi
    Sent: Monday, 27 July, 2015 1:01 PM
    To: golang-nuts
    Subject: Re: [go-nuts] File processing in Golang

    // lines = sorted(lines, key=lambda x: str(x[:8])

    type byFirstEight []string
    func (s byFirstEight) Less(i,j int) bool { return s[i][:8]< s[j][:8] }
    ...

    sort.Sort(byFirstEight(lines))


    Don't forget to use buffered input/output (bufio pkg) and Flush it!

    --
    You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
    To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/D5xutIRGBRs/unsubscribe.
    To unsubscribe from this group and all its topics, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Tamás Gulácsi at Jul 27, 2015 at 5:40 am
    2015. július 27., hétfő 7:11:11 UTC+2 időpontban Morpheus82 a következőt
    írta:
    Gracias, this helps. Let me digest, understand and try it out.


    Any tips for performance fine tuning with buffio?

    Yes: benchmark it with different buffer sizes
    (bufio.NewReaderSize(os.Stdin, 1<<20), for example).

    Instead of the sed + awk combo, you could
    use http://play.golang.org/p/rlSPGVG8oE
    Maybe you could skip file writing, and upload directly with goamz.
    If you want/need, you could use goroutines to parallellize those uploads,
    too.


    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Kiki Sugiaman at Jul 27, 2015 at 1:46 am
    I don't see the problem with using redis with persistence. Or take a
    look at ledisdb (redis-like semantics with storage backend of your choice).

    On 26/07/15 11:30, Morpheus82 wrote:
    Need some guidance for a small project that I am trying to do for
    hybrid storage.

    I want to process a flat file by
    a) Sorting the flat file based on a certain field/column
    b) Split the file into multiple files based on the field/column
    c) move the files to separate storage boxes (can be mongo, box, oracle
    etc).

    I am able to do a and b easily with awk shell scripts. But finding it
    a bit difficult to connect to cloud based storage in point c.
    I am writing a small project using Golang where I will be able to
    consume a file, sort/split the file. You can consider this as a flat
    file with multiple records.

    I started with type struct and ioutil, but unable to move the file
    data into predefined structure, unable to sort. Suddenly my hands on
    frozen to rethink if gaoling is the right language for something I
    want to do. Any help or samples will be great.
    --
    You received this message because you are subscribed to the Google
    Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to golang-nuts+unsubscribe@googlegroups.com
    For more options, visit https://groups.google.com/d/optout.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Caleb at Jul 28, 2015 at 3:03 am
    Go is great at these kinds of tasks. Hopefully Tamas' code will help you.
    Here are a few suggestions for the S3 part:

        - You should probably throttle the number of concurrent goroutines you
        have processing the uploads - you wouldn't want to upload all 200 files
        simultaneously. If you use a channel you can pass along the file you'd like
        to upload and have a few goroutines churning away on the channel.
        - S3 requests can and do fail. Make sure to retry a few times on error
        before really giving up. (and maybe use time.Sleep between retries)
        - If you can rent a virtual machine on EC2 you can improve performance
        interacting with S3. This is particularly helpful if the data you are
        working with is already on S3.
    On Sunday, July 26, 2015 at 1:39:54 PM UTC-7, Morpheus82 wrote:

    Need some guidance for a small project that I am trying to do for hybrid
    storage.

    I want to process a flat file by
    a) Sorting the flat file based on a certain field/column
    b) Split the file into multiple files based on the field/column
    c) move the files to separate storage boxes (can be mongo, box, oracle
    etc).

    I am able to do a and b easily with awk shell scripts. But finding it a
    bit difficult to connect to cloud based storage in point c.
    I am writing a small project using Golang where I will be able to consume
    a file, sort/split the file. You can consider this as a flat file with
    multiple records.

    I started with type struct and ioutil, but unable to move the file data
    into predefined structure, unable to sort. Suddenly my hands on frozen to
    rethink if gaoling is the right language for something I want to do. Any
    help or samples will be great.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-nuts @
categoriesgo
postedJul 26, '15 at 8:39p
activeJul 28, '15 at 3:03a
posts10
users6
websitegolang.org

People

Translate

site design / logo © 2021 Grokbase