FAQ
I see that impala 1.0.1 supports "insert into values" - but its not
recommended for large volumes of data

Q) I am looking for a simple mechanism to append data into a file and have
this data "immediately" available as part of the impala table.

(The data is already structured and doesn't need PIG ETL etc)

I believe if I create a new file inside the tables directory, I will need
to call refresh


can I use a simple piece of java code that simply appends to a sequence
file?

writer = new SequenceFile.Writer(fs, conf, out, Text.class, Text.class);

Q) also do I need to explicitly enable hdfs file append functionality

Search Discussions

  • Gerrard Mcnulty at Jun 6, 2013 at 11:32 am
    So with a columnar format such as parquet, what's the best way to deal with
    streaming data? The data doesn't need to be real time, I can do something
    like load the data every 10 minutes, hour etc.

    I noticed that each time I call insert...select to parquet file, impala
    creates new files. So when streaming/near streaming, I end up with lots of
    small files. I presume this is bad for compression and performance?
    On Thursday, June 6, 2013 2:06:23 AM UTC+1, Alex Behm wrote:

    Hi Paul,

    you are correct:
    - We don't recommend "insert into values" for large volumes of data
    (either lots of data via a single query, or lots of data via many
    small such queries).
    - After adding a new file to a table directory, you'll need to refresh
    impala

    My understanding is that the main issue will be whether the desired
    file format supports appends.
    SequenceFile, for example, does not support appending to the same file
    (see https://issues.apache.org/jira/browse/HADOOP-7139)

    You'll need to check which file formats support appending to the same
    file, but my guess is that many don't (esp. the columnar formats
    benefit from bulk data loads)

    I believe you don't need to explicitly enable append in HDFS since
    version 0.2.1 (https://issues.apache.org/jira/browse/HDFS-1107).

    Cheers,

    Alex

    On Wed, Jun 5, 2013 at 5:38 PM, Paul Birnie wrote:

    I see that impala 1.0.1 supports "insert into values" - but its not
    recommended for large volumes of data

    Q) I am looking for a simple mechanism to append data into a file and have
    this data "immediately" available as part of the impala table.

    (The data is already structured and doesn't need PIG ETL etc)

    I believe if I create a new file inside the tables directory, I will need to
    call refresh


    can I use a simple piece of java code that simply appends to a sequence
    file?

    writer = new SequenceFile.Writer(fs, conf, out, Text.class, Text.class);

    Q) also do I need to explicitly enable hdfs file append functionality
  • Alex Behm at Jun 6, 2013 at 8:14 pm
    A reasonable approach is to ingest your streaming data into HDFS in
    whatever format is convenient for ingestion. Then you periodically
    convert the data in batch to parquet for faster querying via Impala.
    Obviously, the devil is in the details, but those depend heavily on
    your setup, data, ingestion rate, etc.

    Cheers,

    Alex
    On Thu, Jun 6, 2013 at 4:32 AM, wrote:
    So with a columnar format such as parquet, what's the best way to deal with
    streaming data? The data doesn't need to be real time, I can do something
    like load the data every 10 minutes, hour etc.

    I noticed that each time I call insert...select to parquet file, impala
    creates new files. So when streaming/near streaming, I end up with lots of
    small files. I presume this is bad for compression and performance?

    On Thursday, June 6, 2013 2:06:23 AM UTC+1, Alex Behm wrote:

    Hi Paul,

    you are correct:
    - We don't recommend "insert into values" for large volumes of data
    (either lots of data via a single query, or lots of data via many
    small such queries).
    - After adding a new file to a table directory, you'll need to refresh
    impala

    My understanding is that the main issue will be whether the desired
    file format supports appends.
    SequenceFile, for example, does not support appending to the same file
    (see https://issues.apache.org/jira/browse/HADOOP-7139)

    You'll need to check which file formats support appending to the same
    file, but my guess is that many don't (esp. the columnar formats
    benefit from bulk data loads)

    I believe you don't need to explicitly enable append in HDFS since
    version 0.2.1 (https://issues.apache.org/jira/browse/HDFS-1107).

    Cheers,

    Alex

    On Wed, Jun 5, 2013 at 5:38 PM, Paul Birnie wrote:

    I see that impala 1.0.1 supports "insert into values" - but its not
    recommended for large volumes of data

    Q) I am looking for a simple mechanism to append data into a file and
    have
    this data "immediately" available as part of the impala table.

    (The data is already structured and doesn't need PIG ETL etc)

    I believe if I create a new file inside the tables directory, I will
    need to
    call refresh


    can I use a simple piece of java code that simply appends to a sequence
    file?

    writer = new SequenceFile.Writer(fs, conf, out, Text.class, Text.class);

    Q) also do I need to explicitly enable hdfs file append functionality
  • Gerrard Mcnulty at Jun 7, 2013 at 8:14 am
    Hm, yeah, but then you can't query that data in impala until you batch it in
    On Thursday, June 6, 2013 9:14:29 PM UTC+1, Alex Behm wrote:

    A reasonable approach is to ingest your streaming data into HDFS in
    whatever format is convenient for ingestion. Then you periodically
    convert the data in batch to parquet for faster querying via Impala.
    Obviously, the devil is in the details, but those depend heavily on
    your setup, data, ingestion rate, etc.

    Cheers,

    Alex

    On Thu, Jun 6, 2013 at 4:32 AM, <gerrard...@gmail.com <javascript:>>
    wrote:
    So with a columnar format such as parquet, what's the best way to deal with
    streaming data? The data doesn't need to be real time, I can do something
    like load the data every 10 minutes, hour etc.

    I noticed that each time I call insert...select to parquet file, impala
    creates new files. So when streaming/near streaming, I end up with lots of
    small files. I presume this is bad for compression and performance?

    On Thursday, June 6, 2013 2:06:23 AM UTC+1, Alex Behm wrote:

    Hi Paul,

    you are correct:
    - We don't recommend "insert into values" for large volumes of data
    (either lots of data via a single query, or lots of data via many
    small such queries).
    - After adding a new file to a table directory, you'll need to refresh
    impala

    My understanding is that the main issue will be whether the desired
    file format supports appends.
    SequenceFile, for example, does not support appending to the same file
    (see https://issues.apache.org/jira/browse/HADOOP-7139)

    You'll need to check which file formats support appending to the same
    file, but my guess is that many don't (esp. the columnar formats
    benefit from bulk data loads)

    I believe you don't need to explicitly enable append in HDFS since
    version 0.2.1 (https://issues.apache.org/jira/browse/HDFS-1107).

    Cheers,

    Alex

    On Wed, Jun 5, 2013 at 5:38 PM, Paul Birnie wrote:

    I see that impala 1.0.1 supports "insert into values" - but its not
    recommended for large volumes of data

    Q) I am looking for a simple mechanism to append data into a file
    and
    have
    this data "immediately" available as part of the impala table.

    (The data is already structured and doesn't need PIG ETL etc)

    I believe if I create a new file inside the tables directory, I will
    need to
    call refresh


    can I use a simple piece of java code that simply appends to a
    sequence
    file?

    writer = new SequenceFile.Writer(fs, conf, out, Text.class,
    Text.class);
    Q) also do I need to explicitly enable hdfs file append
    functionality
  • Alex Behm at Jun 7, 2013 at 4:37 pm
    Sure you can. Two possible solutions to that problem:
    1. Use a table with mixed partition formats. All partitions except a
    special ingestion partition would use Parquet. When you query the
    table you'll get all the data (including the non-Parquet data from the
    special ingestion partition)
    2. Use two tables. One table is completely non-Parquet used for
    ingestion, and the other table is in Parquet. You can write queries
    against the UNION ALL of those two tables.

    Cheers,

    Alex
    On Fri, Jun 7, 2013 at 1:14 AM, wrote:
    Hm, yeah, but then you can't query that data in impala until you batch it in

    On Thursday, June 6, 2013 9:14:29 PM UTC+1, Alex Behm wrote:

    A reasonable approach is to ingest your streaming data into HDFS in
    whatever format is convenient for ingestion. Then you periodically
    convert the data in batch to parquet for faster querying via Impala.
    Obviously, the devil is in the details, but those depend heavily on
    your setup, data, ingestion rate, etc.

    Cheers,

    Alex
    On Thu, Jun 6, 2013 at 4:32 AM, wrote:
    So with a columnar format such as parquet, what's the best way to deal
    with
    streaming data? The data doesn't need to be real time, I can do
    something
    like load the data every 10 minutes, hour etc.

    I noticed that each time I call insert...select to parquet file, impala
    creates new files. So when streaming/near streaming, I end up with lots
    of
    small files. I presume this is bad for compression and performance?

    On Thursday, June 6, 2013 2:06:23 AM UTC+1, Alex Behm wrote:

    Hi Paul,

    you are correct:
    - We don't recommend "insert into values" for large volumes of data
    (either lots of data via a single query, or lots of data via many
    small such queries).
    - After adding a new file to a table directory, you'll need to refresh
    impala

    My understanding is that the main issue will be whether the desired
    file format supports appends.
    SequenceFile, for example, does not support appending to the same file
    (see https://issues.apache.org/jira/browse/HADOOP-7139)

    You'll need to check which file formats support appending to the same
    file, but my guess is that many don't (esp. the columnar formats
    benefit from bulk data loads)

    I believe you don't need to explicitly enable append in HDFS since
    version 0.2.1 (https://issues.apache.org/jira/browse/HDFS-1107).

    Cheers,

    Alex

    On Wed, Jun 5, 2013 at 5:38 PM, Paul Birnie wrote:

    I see that impala 1.0.1 supports "insert into values" - but its not
    recommended for large volumes of data

    Q) I am looking for a simple mechanism to append data into a file
    and
    have
    this data "immediately" available as part of the impala table.

    (The data is already structured and doesn't need PIG ETL etc)

    I believe if I create a new file inside the tables directory, I will
    need to
    call refresh


    can I use a simple piece of java code that simply appends to a
    sequence
    file?

    writer = new SequenceFile.Writer(fs, conf, out, Text.class,
    Text.class);

    Q) also do I need to explicitly enable hdfs file append
    functionality
  • Gerrard Mcnulty at Jun 10, 2013 at 9:26 am
    Option 1 (mixed partition formats) looks interesting. Can you do that in
    impala? I don't see it in the docs.

    On Friday, June 7, 2013 5:37:03 PM UTC+1, Alex Behm wrote:

    Sure you can. Two possible solutions to that problem:
    1. Use a table with mixed partition formats. All partitions except a
    special ingestion partition would use Parquet. When you query the
    table you'll get all the data (including the non-Parquet data from the
    special ingestion partition)
    2. Use two tables. One table is completely non-Parquet used for
    ingestion, and the other table is in Parquet. You can write queries
    against the UNION ALL of those two tables.

    Cheers,

    Alex

    On Fri, Jun 7, 2013 at 1:14 AM, <gerrard...@gmail.com <javascript:>>
    wrote:
    Hm, yeah, but then you can't query that data in impala until you batch
    it in
    On Thursday, June 6, 2013 9:14:29 PM UTC+1, Alex Behm wrote:

    A reasonable approach is to ingest your streaming data into HDFS in
    whatever format is convenient for ingestion. Then you periodically
    convert the data in batch to parquet for faster querying via Impala.
    Obviously, the devil is in the details, but those depend heavily on
    your setup, data, ingestion rate, etc.

    Cheers,

    Alex
    On Thu, Jun 6, 2013 at 4:32 AM, wrote:
    So with a columnar format such as parquet, what's the best way to
    deal
    with
    streaming data? The data doesn't need to be real time, I can do
    something
    like load the data every 10 minutes, hour etc.

    I noticed that each time I call insert...select to parquet file,
    impala
    creates new files. So when streaming/near streaming, I end up with
    lots
    of
    small files. I presume this is bad for compression and performance?

    On Thursday, June 6, 2013 2:06:23 AM UTC+1, Alex Behm wrote:

    Hi Paul,

    you are correct:
    - We don't recommend "insert into values" for large volumes of data
    (either lots of data via a single query, or lots of data via many
    small such queries).
    - After adding a new file to a table directory, you'll need to
    refresh
    impala

    My understanding is that the main issue will be whether the desired
    file format supports appends.
    SequenceFile, for example, does not support appending to the same
    file
    (see https://issues.apache.org/jira/browse/HADOOP-7139)

    You'll need to check which file formats support appending to the
    same
    file, but my guess is that many don't (esp. the columnar formats
    benefit from bulk data loads)

    I believe you don't need to explicitly enable append in HDFS since
    version 0.2.1 (https://issues.apache.org/jira/browse/HDFS-1107).

    Cheers,

    Alex

    On Wed, Jun 5, 2013 at 5:38 PM, Paul Birnie wrote:

    I see that impala 1.0.1 supports "insert into values" - but its
    not
    recommended for large volumes of data

    Q) I am looking for a simple mechanism to append data into a file
    and
    have
    this data "immediately" available as part of the impala table.

    (The data is already structured and doesn't need PIG ETL etc)

    I believe if I create a new file inside the tables directory, I
    will
    need to
    call refresh


    can I use a simple piece of java code that simply appends to a
    sequence
    file?

    writer = new SequenceFile.Writer(fs, conf, out, Text.class,
    Text.class);

    Q) also do I need to explicitly enable hdfs file append
    functionality
  • John Russell at Jun 10, 2013 at 5:39 pm
    Our syntax for CREATE/ALTER table is like Hive's. You may refer to their docs.
    The mixed partition business is an omission in the partitioning section of the Impala docs. I'll add the info for the next Impala doc refresh.

    Thanks,
    John
    On Jun 10, 2013, at 10:14 AM, Alex Behm wrote:

    Yes.
    1. Use CREATE TABLE to create a partitioned table
    2. Use ALTER TABLE to add a partition
    3. Use ALTER TABLE to change the format of that partition
    4. Any operation against that table (SELECT/INSERT) will read/write
    data in the appropriate format.

    Our syntax for CREATE/ALTER table is like Hive's. You may refer to their docs.

    Cheers,

    Alex
    On Mon, Jun 10, 2013 at 2:26 AM, wrote:
    Option 1 (mixed partition formats) looks interesting. Can you do that in
    impala? I don't see it in the docs.


    On Friday, June 7, 2013 5:37:03 PM UTC+1, Alex Behm wrote:

    Sure you can. Two possible solutions to that problem:
    1. Use a table with mixed partition formats. All partitions except a
    special ingestion partition would use Parquet. When you query the
    table you'll get all the data (including the non-Parquet data from the
    special ingestion partition)
    2. Use two tables. One table is completely non-Parquet used for
    ingestion, and the other table is in Parquet. You can write queries
    against the UNION ALL of those two tables.

    Cheers,

    Alex
    On Fri, Jun 7, 2013 at 1:14 AM, wrote:
    Hm, yeah, but then you can't query that data in impala until you batch
    it in

    On Thursday, June 6, 2013 9:14:29 PM UTC+1, Alex Behm wrote:

    A reasonable approach is to ingest your streaming data into HDFS in
    whatever format is convenient for ingestion. Then you periodically
    convert the data in batch to parquet for faster querying via Impala.
    Obviously, the devil is in the details, but those depend heavily on
    your setup, data, ingestion rate, etc.

    Cheers,

    Alex
    On Thu, Jun 6, 2013 at 4:32 AM, wrote:
    So with a columnar format such as parquet, what's the best way to
    deal
    with
    streaming data? The data doesn't need to be real time, I can do
    something
    like load the data every 10 minutes, hour etc.

    I noticed that each time I call insert...select to parquet file,
    impala
    creates new files. So when streaming/near streaming, I end up with
    lots
    of
    small files. I presume this is bad for compression and performance?

    On Thursday, June 6, 2013 2:06:23 AM UTC+1, Alex Behm wrote:

    Hi Paul,

    you are correct:
    - We don't recommend "insert into values" for large volumes of data
    (either lots of data via a single query, or lots of data via many
    small such queries).
    - After adding a new file to a table directory, you'll need to
    refresh
    impala

    My understanding is that the main issue will be whether the desired
    file format supports appends.
    SequenceFile, for example, does not support appending to the same
    file
    (see https://issues.apache.org/jira/browse/HADOOP-7139)

    You'll need to check which file formats support appending to the
    same
    file, but my guess is that many don't (esp. the columnar formats
    benefit from bulk data loads)

    I believe you don't need to explicitly enable append in HDFS since
    version 0.2.1 (https://issues.apache.org/jira/browse/HDFS-1107).

    Cheers,

    Alex


    On Wed, Jun 5, 2013 at 5:38 PM, Paul Birnie <pbi...@gmail.com>
    wrote:
    I see that impala 1.0.1 supports "insert into values" - but its
    not
    recommended for large volumes of data

    Q) I am looking for a simple mechanism to append data into a file
    and
    have
    this data "immediately" available as part of the impala table.

    (The data is already structured and doesn't need PIG ETL etc)

    I believe if I create a new file inside the tables directory, I
    will
    need to
    call refresh


    can I use a simple piece of java code that simply appends to a
    sequence
    file?

    writer = new SequenceFile.Writer(fs, conf, out, Text.class,
    Text.class);

    Q) also do I need to explicitly enable hdfs file append
    functionality

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedJun 6, '13 at 12:38a
activeJun 10, '13 at 5:39p
posts7
users4
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase