FAQ
Forwarding this to CDH User. They'll be able to answer the questions about
HDFS better.

On the Impala side, we don't do anything to figure out this is your setup.
In addition to how
this might affect HDFS, Impala will only see one replication and this
limits our ability to schedule
the query fragments. You might see more hotspots.

---------- Forwarded message ----------
From: Paul Birnie <pbirnie@gmail.com>
Date: Fri, Jul 12, 2013 at 5:36 AM
Subject: Streaming append to a text file and in parallel querying from
impala at the same time
To: impala-user@cloudera.org


Hi,

I wanted to try and see if I could append to a file on hdfs and use impala
to query the external table

result: it works (in that I can in parallel append to a text file and from
impala run queries whose results the appended lines in the text file

in order to get it to work i had to
     enable dfs.support.append on hdfs
     only one writer per file is supported
     had to set replication of the appended file to 1 (not 3)
     really dont know impala queries will perform once multiple impalad
nodes are accessing the single continually appended to data1.txt file

investigate further:
     Q) is it correct that I had to set replication to 1 (in order to get
this to work)
     Q) what are the performance implications of using impala in this way?

implementation details:

## Use cloudera manager safety value to enable append on hdfs
##http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Free/4.5.4/Cloudera-Manager-Free-Edition-User-Guide/cmfeug_topic_5_3.html
##
<property>
     <name>dfs.support.append</name>
     <value>true</value>
</property>

## wrote some code capable of appending to an hdfs file
package com.test.hdfs;

import java.io.IOException;
import java.io.PrintWriter;
import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

public class FileAppend1 {


     public static void main(String[] args) throws IOException {

         if (args.length == 0) {

             System.err.println("wrong argument list");
         }

         String uri = args[0];

         // get the content user want to append
         String content = args[1]; //"tradeappend0,EQUITY,book1,-6449";

         int iterationCount = Integer.parseInt(args[2]);

         // instantiate a configuration class
         Configuration conf = new Configuration();

         // get a HDFS filesystem instance
         FileSystem fs = FileSystem.get(URI.create(uri), conf);


         FSDataOutputStream fsout = fs.append(new Path(uri));


         for ( int i = 0 ; i < iterationCount; i++ ){


             fsout.writeBytes(content);
             fsout.writeBytes("\n");

             fsout.sync();
             fsout.flush();


             System.out.println( "wrote:" + i + "\n");
             try {
                 Thread.sleep( 100 );
             }
             catch( InterruptedException ie){}

         }
         fsout.close();

         fs.close();
     }
}


## Deployed writer into the cloudera vm as hdfsexperiement-1.0.1.jar
##
## note: the use of hdfs://localhost as the hdfs service does not
listen on a public ip address by default
## (I assume to prevent writes/reads from a hacker into the cloudera vm)
##
## note: its important to reference the exact same jar dependencies
that are running on the cluster
##
java -cp "/usr/lib/hadoop/client-0.20/*:hdfsexperiement-1.0.1.jar"
com.test.hdfs.FileAppend1
"hdfs://localhost/user/cloudera/mydb/day1/data1.txt"
"tradeappend1,EQUITY,book1,-6449" 1000000

## In parallel ran the following in the impala-shell
##
[localhost.localdomain:21000] > select count(*) from mydb.day1;
Query: select count(*) from mydb.day1
Query finished, fetching results ...
+----------+
count(*) |
+----------+
43940 |
+----------+
Returned 1 row(s) in 7.36s



## Encountered ERROR: java.io.IOException: Failed to add a datanode.
## Solution:
http://stackoverflow.com/questions/15347799/java-io-ioexception-failed-to-add-a-datanode-hdfs-hadoop


[cloudera@localhost ~]$ hadoop dfs -setrep -R -w 1
/user/cloudera

DEPRECATED: Use of this script to execute hdfs command is
deprecated.

Instead use the hdfs command for it.

Replication 1 set: /user/cloudera/mydb/day1/data1.txt

Waiting for /user/cloudera/mydb/day1/data1.txt ... done


## Question: Is it possible to write a single hdfs files from
multiple locations
## Answers: no, use a jms queue or a in memory queue to prepare and write data
##http://stackoverflow.com/questions/6389594/is-it-possible-to-append-to-hdfs-file-from-multiple-clients-in-parallel

Search Discussions

  • Nong Li at Jul 15, 2013 at 6:08 pm
    Messed that up, forwarding to cdh-user again.

    On Mon, Jul 15, 2013 at 11:01 AM, Nong Li wrote:

    Forwarding this to CDH User. They'll be able to answer the questions
    about HDFS better.

    On the Impala side, we don't do anything to figure out this is your setup.
    In addition to how
    this might affect HDFS, Impala will only see one replication and this
    limits our ability to schedule
    the query fragments. You might see more hotspots.


    ---------- Forwarded message ----------
    From: Paul Birnie <pbirnie@gmail.com>
    Date: Fri, Jul 12, 2013 at 5:36 AM
    Subject: Streaming append to a text file and in parallel querying from
    impala at the same time
    To: impala-user@cloudera.org


    Hi,

    I wanted to try and see if I could append to a file on hdfs and use impala
    to query the external table

    result: it works (in that I can in parallel append to a text file and from
    impala run queries whose results the appended lines in the text file

    in order to get it to work i had to
    enable dfs.support.append on hdfs
    only one writer per file is supported
    had to set replication of the appended file to 1 (not 3)
    really dont know impala queries will perform once multiple impalad
    nodes are accessing the single continually appended to data1.txt file

    investigate further:
    Q) is it correct that I had to set replication to 1 (in order to get
    this to work)
    Q) what are the performance implications of using impala in this way?

    implementation details:

    ## Use cloudera manager safety value to enable append on hdfs
    ##http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Free/4.5.4/Cloudera-Manager-Free-Edition-User-Guide/cmfeug_topic_5_3.html
    ##
    <property>
    <name>dfs.support.append</name>
    <value>true</value>
    </property>

    ## wrote some code capable of appending to an hdfs file
    package com.test.hdfs;

    import java.io.IOException;
    import java.io.PrintWriter;
    import java.net.URI;

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.FSDataOutputStream;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;

    public class FileAppend1 {


    public static void main(String[] args) throws IOException {

    if (args.length == 0) {

    System.err.println("wrong argument list");
    }

    String uri = args[0];

    // get the content user want to append
    String content = args[1]; //"tradeappend0,EQUITY,book1,-6449";

    int iterationCount = Integer.parseInt(args[2]);

    // instantiate a configuration class
    Configuration conf = new Configuration();

    // get a HDFS filesystem instance
    FileSystem fs = FileSystem.get(URI.create(uri), conf);


    FSDataOutputStream fsout = fs.append(new Path(uri));


    for ( int i = 0 ; i < iterationCount; i++ ){


    fsout.writeBytes(content);
    fsout.writeBytes("\n");

    fsout.sync();
    fsout.flush();


    System.out.println( "wrote:" + i + "\n");
    try {
    Thread.sleep( 100 );
    }
    catch( InterruptedException ie){}

    }
    fsout.close();

    fs.close();
    }
    }


    ## Deployed writer into the cloudera vm as hdfsexperiement-1.0.1.jar
    ##
    ## note: the use of hdfs://localhost as the hdfs service does not listen on a public ip address by default
    ## (I assume to prevent writes/reads from a hacker into the cloudera vm)
    ##
    ## note: its important to reference the exact same jar dependencies that are running on the cluster
    ##
    java -cp "/usr/lib/hadoop/client-0.20/*:hdfsexperiement-1.0.1.jar" com.test.hdfs.FileAppend1 "hdfs://localhost/user/cloudera/mydb/day1/data1.txt" "tradeappend1,EQUITY,book1,-6449" 1000000

    ## In parallel ran the following in the impala-shell
    ##
    [localhost.localdomain:21000] > select count(*) from mydb.day1;
    Query: select count(*) from mydb.day1
    Query finished, fetching results ...
    +----------+
    count(*) |
    +----------+
    43940 |
    +----------+
    Returned 1 row(s) in 7.36s



    ## Encountered ERROR: java.io.IOException: Failed to add a datanode.
    ## Solution:
    http://stackoverflow.com/questions/15347799/java-io-ioexception-failed-to-add-a-datanode-hdfs-hadoop


    [cloudera@localhost ~]$ hadoop dfs -setrep -R -w 1
    /user/cloudera

    DEPRECATED: Use of this script to execute hdfs command is
    deprecated.

    Instead use the hdfs command for it.

    Replication 1 set: /user/cloudera/mydb/day1/data1.txt

    Waiting for /user/cloudera/mydb/day1/data1.txt ... done


    ## Question: Is it possible to write a single hdfs files from multiple locations
    ## Answers: no, use a jms queue or a in memory queue to prepare and write data
    ##http://stackoverflow.com/questions/6389594/is-it-possible-to-append-to-hdfs-file-from-multiple-clients-in-parallel




Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedJul 15, '13 at 6:01p
activeJul 15, '13 at 6:08p
posts2
users1
websitecloudera.com
irc#hadoop

1 user in discussion

Nong Li: 2 posts

People

Translate

site design / logo © 2022 Grokbase