FAQ
Hi,
I have a large table with 800M records in RCFile format.
I am creating another table with 'STORED as PARQUETFILE' with schema same
as the first table.
*insert overwrite table pq_network_Fact partition (day_key)*
*select .... from rc_network_fact;*

When I try to insert data into the parquet table, query fails after a while
with 'Unknown Exception : [Errno 104] Connection reset by peer
Query failed'

Any guess as to why this could be happening?

Thanks,
Jaideep


--
_____________________________________________________________
The information contained in this communication is intended solely for the
use of the individual or entity to whom it is addressed and others
authorized to receive it. It may contain confidential or legally privileged
information. If you are not the intended recipient you are hereby notified
that any disclosure, copying, distribution or taking any action in reliance
on the contents of this information is strictly prohibited and may be
unlawful. If you have received this communication in error, please notify
us immediately by responding to this email and then delete it from your
system. The firm is neither liable for the proper and complete transmission
of the information contained in this communication nor for any delay in its
receipt.

Search Discussions

  • Miklos Christine at Apr 24, 2013 at 7:33 pm
    Hello Jaideep,

    Which version of impala are you using?
    If possible, can you provide the impalad logs on the host you are
    connecting to?
    The log should be located at /var/log/impalad/impalad.INFO

    Thanks,
    Miklos

    On Wed, Apr 24, 2013 at 3:05 AM, wrote:

    Hi,
    I have a large table with 800M records in RCFile format.
    I am creating another table with 'STORED as PARQUETFILE' with schema same
    as the first table.
    *insert overwrite table pq_network_Fact partition (day_key)*
    *select .... from rc_network_fact;*

    When I try to insert data into the parquet table, query fails after a
    while with 'Unknown Exception : [Errno 104] Connection reset by peer
    Query failed'

    Any guess as to why this could be happening?

    Thanks,
    Jaideep


    ______________________________**______________________________**_
    The information contained in this communication is intended solely for the
    use of the individual or entity to whom it is addressed and others
    authorized to receive it. It may contain confidential or legally privileged
    information. If you are not the intended recipient you are hereby notified
    that any disclosure, copying, distribution or taking any action in reliance
    on the contents of this information is strictly prohibited and may be
    unlawful. If you have received this communication in error, please notify
    us immediately by responding to this email and then delete it from your
    system. The firm is neither liable for the proper and complete transmission
    of the information contained in this communication nor for any delay in its
    receipt.
  • Jaideep Dhok at Apr 25, 2013 at 8:14 am
    Hi Miklos,
    I am using version 0.7.1

    Attaching imapald logs. I am also seeing that after I get the error,
    impala-shell loses connection. And a subsequent connect command fails with
    Thrift error.

    Thanks,
    Jaideep


    On Thu, Apr 25, 2013 at 1:02 AM, Miklos Christine wrote:

    Hello Jaideep,

    Which version of impala are you using?
    If possible, can you provide the impalad logs on the host you are
    connecting to?
    The log should be located at /var/log/impalad/impalad.INFO

    Thanks,
    Miklos

    On Wed, Apr 24, 2013 at 3:05 AM, wrote:

    Hi,
    I have a large table with 800M records in RCFile format.
    I am creating another table with 'STORED as PARQUETFILE' with schema
    same as the first table.
    *insert overwrite table pq_network_Fact partition (day_key)*
    *select .... from rc_network_fact;*

    When I try to insert data into the parquet table, query fails after a
    while with 'Unknown Exception : [Errno 104] Connection reset by peer
    Query failed'

    Any guess as to why this could be happening?

    Thanks,
    Jaideep


    ______________________________**______________________________**_
    The information contained in this communication is intended solely for
    the use of the individual or entity to whom it is addressed and others
    authorized to receive it. It may contain confidential or legally privileged
    information. If you are not the intended recipient you are hereby notified
    that any disclosure, copying, distribution or taking any action in reliance
    on the contents of this information is strictly prohibited and may be
    unlawful. If you have received this communication in error, please notify
    us immediately by responding to this email and then delete it from your
    system. The firm is neither liable for the proper and complete transmission
    of the information contained in this communication nor for any delay in its
    receipt.
    --
    _____________________________________________________________
    The information contained in this communication is intended solely for the
    use of the individual or entity to whom it is addressed and others
    authorized to receive it. It may contain confidential or legally privileged
    information. If you are not the intended recipient you are hereby notified
    that any disclosure, copying, distribution or taking any action in reliance
    on the contents of this information is strictly prohibited and may be
    unlawful. If you have received this communication in error, please notify
    us immediately by responding to this email and then delete it from your
    system. The firm is neither liable for the proper and complete transmission
    of the information contained in this communication nor for any delay in its
    receipt.
  • Miklos Christine at Apr 25, 2013 at 6:00 pm
    I see that you are running this on an AWS cluster.
    From the statestore logs, it looks like you are having connectivity issues
    to all the hosts that are running impalad instances.

    Please check your network settings and see if that host running your
    statestore can being all of other nodes in the network.

    Are any impala queries working?
    Do queries on the 800M record RCFile work from impala?

    Thanks,
    Miklos


    On Thu, Apr 25, 2013 at 1:14 AM, Jaideep Dhok wrote:

    Hi Miklos,
    I am using version 0.7.1

    Attaching imapald logs. I am also seeing that after I get the error,
    impala-shell loses connection. And a subsequent connect command fails with
    Thrift error.

    Thanks,
    Jaideep


    On Thu, Apr 25, 2013 at 1:02 AM, Miklos Christine wrote:

    Hello Jaideep,

    Which version of impala are you using?
    If possible, can you provide the impalad logs on the host you are
    connecting to?
    The log should be located at /var/log/impalad/impalad.INFO

    Thanks,
    Miklos

    On Wed, Apr 24, 2013 at 3:05 AM, wrote:

    Hi,
    I have a large table with 800M records in RCFile format.
    I am creating another table with 'STORED as PARQUETFILE' with schema
    same as the first table.
    *insert overwrite table pq_network_Fact partition (day_key)*
    *select .... from rc_network_fact;*

    When I try to insert data into the parquet table, query fails after a
    while with 'Unknown Exception : [Errno 104] Connection reset by peer
    Query failed'

    Any guess as to why this could be happening?

    Thanks,
    Jaideep


    ______________________________**______________________________**_
    The information contained in this communication is intended solely for
    the use of the individual or entity to whom it is addressed and others
    authorized to receive it. It may contain confidential or legally privileged
    information. If you are not the intended recipient you are hereby notified
    that any disclosure, copying, distribution or taking any action in reliance
    on the contents of this information is strictly prohibited and may be
    unlawful. If you have received this communication in error, please notify
    us immediately by responding to this email and then delete it from your
    system. The firm is neither liable for the proper and complete transmission
    of the information contained in this communication nor for any delay in its
    receipt.
    ______________________________**______________________________**_
    The information contained in this communication is intended solely for the
    use of the individual or entity to whom it is addressed and others
    authorized to receive it. It may contain confidential or legally privileged
    information. If you are not the intended recipient you are hereby notified
    that any disclosure, copying, distribution or taking any action in reliance
    on the contents of this information is strictly prohibited and may be
    unlawful. If you have received this communication in error, please notify
    us immediately by responding to this email and then delete it from your
    system. The firm is neither liable for the proper and complete transmission
    of the information contained in this communication nor for any delay in its
    receipt.
  • Mike Mansell at May 6, 2013 at 7:07 pm
    I'm encountering the same issue with the latest impala release:

    Shell version: Impala Shell v1.0 (d1bf0d1) built on Sun Apr 28 15:33:52 PDT
    2013
    Server version: impalad version 1.0 RELEASE (build
    d1bf0d1dac339af3692ffa17a5e3fdae0aed751f)


    I have a large amount of test data in gzip text files in s3 that I'm trying
    to get into impala / parquet for testing. I've created an external table
    in Hive and pulled it into a partitioned RCFile format table that I can
    access from impala and query fine. Attempting to insert overwrite into an
    equivalent table with PARQUETFILE format results in the shell error: 'Unknown
    Exception : [Errno 104] Connection reset by peer Query failed'. The same
    query (without the insert) in impala-shell also works.

    The impalad.INFO log looks like it's making progress but just stops. The
    statestored.INFO log shows that it starts getting connection refused on
    port 23000. This is running on AWS.

    impalad.INFO log:

    INFO0505 19:22:45.016000 Thread-9 com.cloudera.impala.service.JniFrontend]
    PLAN FRAGMENT 0
       PARTITION: HASH_PARTITIONED: account_id

       WRITE TO HDFS table=default.datacube_oct
         overwrite=true
         partitions: account_id

       1:EXCHANGE
          tuple ids: 0

    PLAN FRAGMENT 1
       PARTITION: RANDOM

       STREAM DATA SINK
         EXCHANGE ID: 1
         HASH_PARTITIONED: account_id

       0:SCAN HDFS
          table=default.datacube_rc #partitions=253 size=25.74GB
          predicates: month(time_id) = 10
          tuple ids: 0
    ...
    I0505 19:22:45.173516 24631 plan-fragment-executor.cc:213] Open():
    instance_id=d710fdcc1afe4652:b09975f58897eb3e
    I0505 19:22:45.173894 24632 coordinator.cc:571] Coordinator waiting for
    backends to finish, 2 remaining
    I0505 19:23:02.175143 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 2% Complete (13 out of 649)
    I0505 19:23:27.176903 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 4% Complete (28 out of 649)
    I0505 19:24:07.180315 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 6% Complete (41 out of 649)
    I0505 19:25:12.185873 24544 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 8% Complete (56 out of 649)
    I0505 19:25:52.188451 24544 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 10% Complete (65 out of 649)
    I0505 19:27:07.194136 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 12% Complete (79 out of 649)
    I0505 19:27:42.196530 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 14% Complete (93 out of 649)
    <log just ends here>


    statestored.INFO log:

    I0506 18:22:53.527773 8763 client-cache.cc:98] CreateClient(): adding new
    client for ip-<ipaddress>.ec2.internal:23000
    I0506 18:22:53.528506 8763 thrift-util.cc:85] TSocket::open() connect()
    <Host: ip-<ipaddress>.ec2.internal Port: 23000>Connection refused
    I0506 18:22:53.573663 8763 status.cc:42] Couldn't open transport for ip-
    <ipaddress>.ec2.internal:23000(connect() failed: Connection refused)
         @ 0x545fce impala::Status::Status()
         @ 0x520256 impala::ThriftClientImpl::Open()
         @ 0x4f1ed0 impala::ClientCacheHelper::CreateClient()
         @ 0x4f21ac impala::ClientCacheHelper::ReopenClient()
         @ 0x533f3d impala::StateStore::ProcessOneSubscriber()
         @ 0x536dc4 impala::StateStore::SubscriberUpdateLoop()
         @ 0x5d8953 thread_proxy
         @ 0x7fd5857e5e9a start_thread
         @ 0x7fd5846eb4bd (unknown)


    On Wednesday, April 24, 2013 3:05:59 AM UTC-7, jaidee...@inmobi.com wrote:

    Hi,
    I have a large table with 800M records in RCFile format.
    I am creating another table with 'STORED as PARQUETFILE' with schema same
    as the first table.
    *insert overwrite table pq_network_Fact partition (day_key)*
    *select .... from rc_network_fact;*

    When I try to insert data into the parquet table, query fails after a
    while with 'Unknown Exception : [Errno 104] Connection reset by peer
    Query failed'

    Any guess as to why this could be happening?

    Thanks,
    Jaideep

  • Ricky Saltzer at May 6, 2013 at 7:15 pm
    Hey Mike -

    How many nodes are in the cluster? Try connecting to the backend that is
    going offline via SSH and run the following command while you run the
    INSERT.

    $ watch -d -n1 "free -m"

    Monitor the buffers/cache free to see if maybe the node is running out of
    memory.

    Ricky

    On Mon, May 6, 2013 at 3:07 PM, Mike Mansell wrote:

    I'm encountering the same issue with the latest impala release:

    Shell version: Impala Shell v1.0 (d1bf0d1) built on Sun Apr 28 15:33:52
    PDT 2013
    Server version: impalad version 1.0 RELEASE (build
    d1bf0d1dac339af3692ffa17a5e3fdae0aed751f)


    I have a large amount of test data in gzip text files in s3 that I'm
    trying to get into impala / parquet for testing. I've created an external
    table in Hive and pulled it into a partitioned RCFile format table that I
    can access from impala and query fine. Attempting to insert overwrite into
    an equivalent table with PARQUETFILE format results in the shell error: 'Unknown
    Exception : [Errno 104] Connection reset by peer Query failed'. The same
    query (without the insert) in impala-shell also works.

    The impalad.INFO log looks like it's making progress but just stops. The
    statestored.INFO log shows that it starts getting connection refused on
    port 23000. This is running on AWS.

    impalad.INFO log:

    INFO0505 19:22:45.016000 Thread-9 com.cloudera.impala.service.JniFrontend]
    PLAN FRAGMENT 0
    PARTITION: HASH_PARTITIONED: account_id

    WRITE TO HDFS table=default.datacube_oct
    overwrite=true
    partitions: account_id

    1:EXCHANGE
    tuple ids: 0

    PLAN FRAGMENT 1
    PARTITION: RANDOM

    STREAM DATA SINK
    EXCHANGE ID: 1
    HASH_PARTITIONED: account_id

    0:SCAN HDFS
    table=default.datacube_rc #partitions=253 size=25.74GB
    predicates: month(time_id) = 10
    tuple ids: 0
    ...
    I0505 19:22:45.173516 24631 plan-fragment-executor.cc:213] Open():
    instance_id=d710fdcc1afe4652:b09975f58897eb3e
    I0505 19:22:45.173894 24632 coordinator.cc:571] Coordinator waiting for
    backends to finish, 2 remaining
    I0505 19:23:02.175143 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 2% Complete (13 out of 649)
    I0505 19:23:27.176903 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 4% Complete (28 out of 649)
    I0505 19:24:07.180315 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 6% Complete (41 out of 649)
    I0505 19:25:12.185873 24544 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 8% Complete (56 out of 649)
    I0505 19:25:52.188451 24544 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 10% Complete (65 out of 649)
    I0505 19:27:07.194136 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 12% Complete (79 out of 649)
    I0505 19:27:42.196530 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 14% Complete (93 out of 649)
    <log just ends here>


    statestored.INFO log:

    I0506 18:22:53.527773 8763 client-cache.cc:98] CreateClient(): adding new
    client for ip-<ipaddress>.ec2.internal:23000
    I0506 18:22:53.528506 8763 thrift-util.cc:85] TSocket::open() connect()
    <Host: ip-<ipaddress>.ec2.internal Port: 23000>Connection refused
    I0506 18:22:53.573663 8763 status.cc:42] Couldn't open transport for ip-
    <ipaddress>.ec2.internal:23000(connect() failed: Connection refused)
    @ 0x545fce impala::Status::Status()
    @ 0x520256 impala::ThriftClientImpl::Open()
    @ 0x4f1ed0 impala::ClientCacheHelper::CreateClient()
    @ 0x4f21ac impala::ClientCacheHelper::ReopenClient()
    @ 0x533f3d impala::StateStore::ProcessOneSubscriber()
    @ 0x536dc4 impala::StateStore::SubscriberUpdateLoop()
    @ 0x5d8953 thread_proxy
    @ 0x7fd5857e5e9a start_thread
    @ 0x7fd5846eb4bd (unknown)


    On Wednesday, April 24, 2013 3:05:59 AM UTC-7, jaidee...@inmobi.com wrote:

    Hi,
    I have a large table with 800M records in RCFile format.
    I am creating another table with 'STORED as PARQUETFILE' with schema
    same as the first table.
    *insert overwrite table pq_network_Fact partition (day_key)*
    *select .... from rc_network_fact;*

    When I try to insert data into the parquet table, query fails after a
    while with 'Unknown Exception : [Errno 104] Connection reset by peer
    Query failed'

    Any guess as to why this could be happening?

    Thanks,
    Jaideep


    --
    Ricky Saltzer
    Customer Operations Engineer
    http://www.cloudera.com
  • Mike Mansell at May 6, 2013 at 7:32 pm
    I suspected memory may have been the issue but this confirmed it. Memory
    runs low quickly then we get the shell error after it exhausts the buffers.


    It's running on an embarrassingly small cluster of 1. An AWS m2.2xlarge
    instance only because we were evaluating some other technologies that
    benchmarked some results on a single node of that size.

    I'll try on a real cluster with more memory. Any easier ways to load a
    large amount of data into a parquet backed table? Seems restrictive to
    have to use insert as select that must fit in memory to populate the table.

    Thanks for your speedy response that was helpful!

      - Mike


    On Mon, May 6, 2013 at 11:15 AM, Ricky Saltzer wrote:

    Hey Mike -

    How many nodes are in the cluster? Try connecting to the backend that is
    going offline via SSH and run the following command while you run the
    INSERT.

    $ watch -d -n1 "free -m"

    Monitor the buffers/cache free to see if maybe the node is running out of
    memory.

    Ricky

    On Mon, May 6, 2013 at 3:07 PM, Mike Mansell wrote:

    I'm encountering the same issue with the latest impala release:

    Shell version: Impala Shell v1.0 (d1bf0d1) built on Sun Apr 28 15:33:52
    PDT 2013
    Server version: impalad version 1.0 RELEASE (build
    d1bf0d1dac339af3692ffa17a5e3fdae0aed751f)


    I have a large amount of test data in gzip text files in s3 that I'm
    trying to get into impala / parquet for testing. I've created an external
    table in Hive and pulled it into a partitioned RCFile format table that I
    can access from impala and query fine. Attempting to insert overwrite into
    an equivalent table with PARQUETFILE format results in the shell error: 'Unknown
    Exception : [Errno 104] Connection reset by peer Query failed'. The
    same query (without the insert) in impala-shell also works.

    The impalad.INFO log looks like it's making progress but just stops. The
    statestored.INFO log shows that it starts getting connection refused on
    port 23000. This is running on AWS.

    impalad.INFO log:

    INFO0505 19:22:45.016000 Thread-9
    com.cloudera.impala.service.JniFrontend] PLAN FRAGMENT 0
    PARTITION: HASH_PARTITIONED: account_id

    WRITE TO HDFS table=default.datacube_oct
    overwrite=true
    partitions: account_id

    1:EXCHANGE
    tuple ids: 0

    PLAN FRAGMENT 1
    PARTITION: RANDOM

    STREAM DATA SINK
    EXCHANGE ID: 1
    HASH_PARTITIONED: account_id

    0:SCAN HDFS
    table=default.datacube_rc #partitions=253 size=25.74GB
    predicates: month(time_id) = 10
    tuple ids: 0
    ...
    I0505 19:22:45.173516 24631 plan-fragment-executor.cc:213] Open():
    instance_id=d710fdcc1afe4652:b09975f58897eb3e
    I0505 19:22:45.173894 24632 coordinator.cc:571] Coordinator waiting for
    backends to finish, 2 remaining
    I0505 19:23:02.175143 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 2% Complete (13 out of 649)
    I0505 19:23:27.176903 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 4% Complete (28 out of 649)
    I0505 19:24:07.180315 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 6% Complete (41 out of 649)
    I0505 19:25:12.185873 24544 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 8% Complete (56 out of 649)
    I0505 19:25:52.188451 24544 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 10% Complete (65 out of 649)
    I0505 19:27:07.194136 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 12% Complete (79 out of 649)
    I0505 19:27:42.196530 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 14% Complete (93 out of 649)
    <log just ends here>


    statestored.INFO log:

    I0506 18:22:53.527773 8763 client-cache.cc:98] CreateClient(): adding
    new client for ip-<ipaddress>.ec2.internal:23000
    I0506 18:22:53.528506 8763 thrift-util.cc:85] TSocket::open() connect()
    <Host: ip-<ipaddress>.ec2.internal Port: 23000>Connection refused
    I0506 18:22:53.573663 8763 status.cc:42] Couldn't open transport for ip-
    <ipaddress>.ec2.internal:23000(connect() failed: Connection refused)
    @ 0x545fce impala::Status::Status()
    @ 0x520256 impala::ThriftClientImpl::Open()
    @ 0x4f1ed0 impala::ClientCacheHelper::CreateClient()
    @ 0x4f21ac impala::ClientCacheHelper::ReopenClient()
    @ 0x533f3d impala::StateStore::ProcessOneSubscriber()
    @ 0x536dc4 impala::StateStore::SubscriberUpdateLoop()
    @ 0x5d8953 thread_proxy
    @ 0x7fd5857e5e9a start_thread
    @ 0x7fd5846eb4bd (unknown)



    On Wednesday, April 24, 2013 3:05:59 AM UTC-7, jaidee...@inmobi.comwrote:
    Hi,
    I have a large table with 800M records in RCFile format.
    I am creating another table with 'STORED as PARQUETFILE' with schema
    same as the first table.
    *insert overwrite table pq_network_Fact partition (day_key)*
    *select .... from rc_network_fact;*

    When I try to insert data into the parquet table, query fails after a
    while with 'Unknown Exception : [Errno 104] Connection reset by peer
    Query failed'

    Any guess as to why this could be happening?

    Thanks,
    Jaideep


    --
    Ricky Saltzer
    Customer Operations Engineer
    http://www.cloudera.com


    --

    **

    **

    *Mike Mansell *|* *Chief Technical Architect
    T: 510-653-8963 | M: 925-262-7830 | F: 510-653-0461
    mike.mansell@tubemogul.com | Twitter:
    @m_mansell<http://www.twitter.com/m_mansell>

    [image: Inline image 1]<http://www.tubemogul.com/solutions/playtime/brandsights>
  • Ricky Saltzer at May 6, 2013 at 8:05 pm
    Hey Mike -

    This may be due to the fact it's partitioned, is your insert using dynamic
    partitions?

    Ricky

    On Mon, May 6, 2013 at 3:32 PM, Mike Mansell wrote:

    I suspected memory may have been the issue but this confirmed it. Memory
    runs low quickly then we get the shell error after it exhausts the buffers.


    It's running on an embarrassingly small cluster of 1. An AWS m2.2xlarge
    instance only because we were evaluating some other technologies that
    benchmarked some results on a single node of that size.

    I'll try on a real cluster with more memory. Any easier ways to load a
    large amount of data into a parquet backed table? Seems restrictive to
    have to use insert as select that must fit in memory to populate the table.

    Thanks for your speedy response that was helpful!

    - Mike


    On Mon, May 6, 2013 at 11:15 AM, Ricky Saltzer wrote:

    Hey Mike -

    How many nodes are in the cluster? Try connecting to the backend that is
    going offline via SSH and run the following command while you run the
    INSERT.

    $ watch -d -n1 "free -m"

    Monitor the buffers/cache free to see if maybe the node is running out of
    memory.

    Ricky

    On Mon, May 6, 2013 at 3:07 PM, Mike Mansell wrote:

    I'm encountering the same issue with the latest impala release:

    Shell version: Impala Shell v1.0 (d1bf0d1) built on Sun Apr 28 15:33:52
    PDT 2013
    Server version: impalad version 1.0 RELEASE (build
    d1bf0d1dac339af3692ffa17a5e3fdae0aed751f)


    I have a large amount of test data in gzip text files in s3 that I'm
    trying to get into impala / parquet for testing. I've created an external
    table in Hive and pulled it into a partitioned RCFile format table that I
    can access from impala and query fine. Attempting to insert overwrite into
    an equivalent table with PARQUETFILE format results in the shell error: 'Unknown
    Exception : [Errno 104] Connection reset by peer Query failed'. The
    same query (without the insert) in impala-shell also works.

    The impalad.INFO log looks like it's making progress but just stops.
    The statestored.INFO log shows that it starts getting connection refused
    on port 23000. This is running on AWS.

    impalad.INFO log:

    INFO0505 19:22:45.016000 Thread-9
    com.cloudera.impala.service.JniFrontend] PLAN FRAGMENT 0
    PARTITION: HASH_PARTITIONED: account_id

    WRITE TO HDFS table=default.datacube_oct
    overwrite=true
    partitions: account_id

    1:EXCHANGE
    tuple ids: 0

    PLAN FRAGMENT 1
    PARTITION: RANDOM

    STREAM DATA SINK
    EXCHANGE ID: 1
    HASH_PARTITIONED: account_id

    0:SCAN HDFS
    table=default.datacube_rc #partitions=253 size=25.74GB
    predicates: month(time_id) = 10
    tuple ids: 0
    ...
    I0505 19:22:45.173516 24631 plan-fragment-executor.cc:213] Open():
    instance_id=d710fdcc1afe4652:b09975f58897eb3e
    I0505 19:22:45.173894 24632 coordinator.cc:571] Coordinator waiting for
    backends to finish, 2 remaining
    I0505 19:23:02.175143 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 2% Complete (13 out of 649)
    I0505 19:23:27.176903 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 4% Complete (28 out of 649)
    I0505 19:24:07.180315 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 6% Complete (41 out of 649)
    I0505 19:25:12.185873 24544 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 8% Complete (56 out of 649)
    I0505 19:25:52.188451 24544 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 10% Complete (65 out of 649)
    I0505 19:27:07.194136 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 12% Complete (79 out of 649)
    I0505 19:27:42.196530 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 14% Complete (93 out of 649)
    <log just ends here>


    statestored.INFO log:

    I0506 18:22:53.527773 8763 client-cache.cc:98] CreateClient(): adding
    new client for ip-<ipaddress>.ec2.internal:23000
    I0506 18:22:53.528506 8763 thrift-util.cc:85] TSocket::open() connect()
    <Host: ip-<ipaddress>.ec2.internal Port: 23000>Connection refused
    I0506 18:22:53.573663 8763 status.cc:42] Couldn't open transport for ip-
    <ipaddress>.ec2.internal:23000(connect() failed: Connection refused)
    @ 0x545fce impala::Status::Status()
    @ 0x520256 impala::ThriftClientImpl::Open()
    @ 0x4f1ed0 impala::ClientCacheHelper::CreateClient()
    @ 0x4f21ac impala::ClientCacheHelper::ReopenClient()
    @ 0x533f3d impala::StateStore::ProcessOneSubscriber()
    @ 0x536dc4 impala::StateStore::SubscriberUpdateLoop()
    @ 0x5d8953 thread_proxy
    @ 0x7fd5857e5e9a start_thread
    @ 0x7fd5846eb4bd (unknown)



    On Wednesday, April 24, 2013 3:05:59 AM UTC-7, jaidee...@inmobi.comwrote:
    Hi,
    I have a large table with 800M records in RCFile format.
    I am creating another table with 'STORED as PARQUETFILE' with schema
    same as the first table.
    *insert overwrite table pq_network_Fact partition (day_key)*
    *select .... from rc_network_fact;*

    When I try to insert data into the parquet table, query fails after a
    while with 'Unknown Exception : [Errno 104] Connection reset by peer
    Query failed'

    Any guess as to why this could be happening?

    Thanks,
    Jaideep


    --
    Ricky Saltzer
    Customer Operations Engineer
    http://www.cloudera.com


    --

    **

    **

    *Mike Mansell *|* *Chief Technical Architect
    T: 510-653-8963 | M: 925-262-7830 | F: 510-653-0461
    mike.mansell@tubemogul.com | Twitter: @m_mansell<http://www.twitter.com/m_mansell>

    [image: Inline image 1]<http://www.tubemogul.com/solutions/playtime/brandsights>

    --
    Ricky Saltzer
    Customer Operations Engineer
    http://www.cloudera.com
  • Mike Mansell at May 6, 2013 at 8:18 pm
    Yes it's using dynamic partitions.


    On Mon, May 6, 2013 at 12:05 PM, Ricky Saltzer wrote:

    Hey Mike -

    This may be due to the fact it's partitioned, is your insert using dynamic
    partitions?

    Ricky

    On Mon, May 6, 2013 at 3:32 PM, Mike Mansell wrote:

    I suspected memory may have been the issue but this confirmed it. Memory
    runs low quickly then we get the shell error after it exhausts the buffers.


    It's running on an embarrassingly small cluster of 1. An AWS m2.2xlarge
    instance only because we were evaluating some other technologies that
    benchmarked some results on a single node of that size.

    I'll try on a real cluster with more memory. Any easier ways to load a
    large amount of data into a parquet backed table? Seems restrictive to
    have to use insert as select that must fit in memory to populate the table.

    Thanks for your speedy response that was helpful!

    - Mike


    On Mon, May 6, 2013 at 11:15 AM, Ricky Saltzer wrote:

    Hey Mike -

    How many nodes are in the cluster? Try connecting to the backend that is
    going offline via SSH and run the following command while you run the
    INSERT.

    $ watch -d -n1 "free -m"

    Monitor the buffers/cache free to see if maybe the node is running out
    of memory.

    Ricky


    On Mon, May 6, 2013 at 3:07 PM, Mike Mansell <mike.mansell@tubemogul.com
    wrote:
    I'm encountering the same issue with the latest impala release:

    Shell version: Impala Shell v1.0 (d1bf0d1) built on Sun Apr 28 15:33:52
    PDT 2013
    Server version: impalad version 1.0 RELEASE (build
    d1bf0d1dac339af3692ffa17a5e3fdae0aed751f)


    I have a large amount of test data in gzip text files in s3 that I'm
    trying to get into impala / parquet for testing. I've created an external
    table in Hive and pulled it into a partitioned RCFile format table that I
    can access from impala and query fine. Attempting to insert overwrite into
    an equivalent table with PARQUETFILE format results in the shell error: 'Unknown
    Exception : [Errno 104] Connection reset by peer Query failed'. The
    same query (without the insert) in impala-shell also works.

    The impalad.INFO log looks like it's making progress but just stops.
    The statestored.INFO log shows that it starts getting connection refused
    on port 23000. This is running on AWS.

    impalad.INFO log:

    INFO0505 19:22:45.016000 Thread-9
    com.cloudera.impala.service.JniFrontend] PLAN FRAGMENT 0
    PARTITION: HASH_PARTITIONED: account_id

    WRITE TO HDFS table=default.datacube_oct
    overwrite=true
    partitions: account_id

    1:EXCHANGE
    tuple ids: 0

    PLAN FRAGMENT 1
    PARTITION: RANDOM

    STREAM DATA SINK
    EXCHANGE ID: 1
    HASH_PARTITIONED: account_id

    0:SCAN HDFS
    table=default.datacube_rc #partitions=253 size=25.74GB
    predicates: month(time_id) = 10
    tuple ids: 0
    ...
    I0505 19:22:45.173516 24631 plan-fragment-executor.cc:213] Open():
    instance_id=d710fdcc1afe4652:b09975f58897eb3e
    I0505 19:22:45.173894 24632 coordinator.cc:571] Coordinator waiting for
    backends to finish, 2 remaining
    I0505 19:23:02.175143 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 2% Complete (13 out of 649)
    I0505 19:23:27.176903 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 4% Complete (28 out of 649)
    I0505 19:24:07.180315 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 6% Complete (41 out of 649)
    I0505 19:25:12.185873 24544 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 8% Complete (56 out of 649)
    I0505 19:25:52.188451 24544 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 10% Complete (65 out of 649)
    I0505 19:27:07.194136 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 12% Complete (79 out of 649)
    I0505 19:27:42.196530 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 14% Complete (93 out of 649)
    <log just ends here>


    statestored.INFO log:

    I0506 18:22:53.527773 8763 client-cache.cc:98] CreateClient():
    adding new client for ip-<ipaddress>.ec2.internal:23000
    I0506 18:22:53.528506 8763 thrift-util.cc:85] TSocket::open()
    connect() <Host: ip-<ipaddress>.ec2.internal Port: 23000>Connection
    refused
    I0506 18:22:53.573663 8763 status.cc:42] Couldn't open transport for
    ip-<ipaddress>.ec2.internal:23000(connect() failed: Connection refused)
    @ 0x545fce impala::Status::Status()
    @ 0x520256 impala::ThriftClientImpl::Open()
    @ 0x4f1ed0 impala::ClientCacheHelper::CreateClient()
    @ 0x4f21ac impala::ClientCacheHelper::ReopenClient()
    @ 0x533f3d impala::StateStore::ProcessOneSubscriber()
    @ 0x536dc4 impala::StateStore::SubscriberUpdateLoop()
    @ 0x5d8953 thread_proxy
    @ 0x7fd5857e5e9a start_thread
    @ 0x7fd5846eb4bd (unknown)



    On Wednesday, April 24, 2013 3:05:59 AM UTC-7, jaidee...@inmobi.comwrote:
    Hi,
    I have a large table with 800M records in RCFile format.
    I am creating another table with 'STORED as PARQUETFILE' with schema
    same as the first table.
    *insert overwrite table pq_network_Fact partition (day_key)*
    *select .... from rc_network_fact;*

    When I try to insert data into the parquet table, query fails after a
    while with 'Unknown Exception : [Errno 104] Connection reset by peer
    Query failed'

    Any guess as to why this could be happening?

    Thanks,
    Jaideep


    --
    Ricky Saltzer
    Customer Operations Engineer
    http://www.cloudera.com


    --

    **

    **

    *Mike Mansell *|* *Chief Technical Architect
    T: 510-653-8963 | M: 925-262-7830 | F: 510-653-0461
    mike.mansell@tubemogul.com | Twitter: @m_mansell<http://www.twitter.com/m_mansell>

    [image: Inline image 1]<http://www.tubemogul.com/solutions/playtime/brandsights>

    --
    Ricky Saltzer
    Customer Operations Engineer
    http://www.cloudera.com


    --

    **

    **

    *Mike Mansell *|* *Chief Technical Architect
    T: 510-653-8963 | M: 925-262-7830 | F: 510-653-0461
    mike.mansell@tubemogul.com | Twitter:
    @m_mansell<http://www.twitter.com/m_mansell>

    [image: Inline image 1]<http://www.tubemogul.com/solutions/playtime/brandsights>
  • Ricky Saltzer at May 6, 2013 at 8:39 pm
    You may be hitting this bug

    https://issues.cloudera.org/browse/IMPALA-257

    While not the best long term solution, you might be able to split this up
    into a few inserts using some sort of predicate that would mitigate the
    amount of data being materialized.

    For example, your key is an *accountId**...*Maybe if you split the insert
    into a few queries like...

    INSERT INTO <table> PARTITION (accountId) SELECT col1,col2,col3,accountId
    FROM <other_table> WHERE accountId BETWEEN 0 AND 50;

    INSERT INTO <table> PARTITION (accountId) SELECT col1,col2,col3,accountId
    FROM <other_table> WHERE accountId BETWEEN 51 100;

    and so on..





    On Mon, May 6, 2013 at 4:18 PM, Mike Mansell wrote:

    Yes it's using dynamic partitions.


    On Mon, May 6, 2013 at 12:05 PM, Ricky Saltzer wrote:

    Hey Mike -

    This may be due to the fact it's partitioned, is your insert using
    dynamic partitions?

    Ricky

    On Mon, May 6, 2013 at 3:32 PM, Mike Mansell wrote:

    I suspected memory may have been the issue but this confirmed it.
    Memory runs low quickly then we get the shell error after it exhausts the
    buffers.

    It's running on an embarrassingly small cluster of 1. An AWS m2.2xlarge
    instance only because we were evaluating some other technologies that
    benchmarked some results on a single node of that size.

    I'll try on a real cluster with more memory. Any easier ways to load a
    large amount of data into a parquet backed table? Seems restrictive to
    have to use insert as select that must fit in memory to populate the table.

    Thanks for your speedy response that was helpful!

    - Mike


    On Mon, May 6, 2013 at 11:15 AM, Ricky Saltzer wrote:

    Hey Mike -

    How many nodes are in the cluster? Try connecting to the backend that
    is going offline via SSH and run the following command while you run the
    INSERT.

    $ watch -d -n1 "free -m"

    Monitor the buffers/cache free to see if maybe the node is running out
    of memory.

    Ricky


    On Mon, May 6, 2013 at 3:07 PM, Mike Mansell <
    mike.mansell@tubemogul.com> wrote:
    I'm encountering the same issue with the latest impala release:

    Shell version: Impala Shell v1.0 (d1bf0d1) built on Sun Apr 28
    15:33:52 PDT 2013
    Server version: impalad version 1.0 RELEASE (build
    d1bf0d1dac339af3692ffa17a5e3fdae0aed751f)


    I have a large amount of test data in gzip text files in s3 that I'm
    trying to get into impala / parquet for testing. I've created an external
    table in Hive and pulled it into a partitioned RCFile format table that I
    can access from impala and query fine. Attempting to insert overwrite into
    an equivalent table with PARQUETFILE format results in the shell error: 'Unknown
    Exception : [Errno 104] Connection reset by peer Query failed'. The
    same query (without the insert) in impala-shell also works.

    The impalad.INFO log looks like it's making progress but just stops.
    The statestored.INFO log shows that it starts getting connection refused
    on port 23000. This is running on AWS.

    impalad.INFO log:

    INFO0505 19:22:45.016000 Thread-9
    com.cloudera.impala.service.JniFrontend] PLAN FRAGMENT 0
    PARTITION: HASH_PARTITIONED: account_id

    WRITE TO HDFS table=default.datacube_oct
    overwrite=true
    partitions: account_id

    1:EXCHANGE
    tuple ids: 0

    PLAN FRAGMENT 1
    PARTITION: RANDOM

    STREAM DATA SINK
    EXCHANGE ID: 1
    HASH_PARTITIONED: account_id

    0:SCAN HDFS
    table=default.datacube_rc #partitions=253 size=25.74GB
    predicates: month(time_id) = 10
    tuple ids: 0
    ...
    I0505 19:22:45.173516 24631 plan-fragment-executor.cc:213] Open():
    instance_id=d710fdcc1afe4652:b09975f58897eb3e
    I0505 19:22:45.173894 24632 coordinator.cc:571] Coordinator waiting
    for backends to finish, 2 remaining
    I0505 19:23:02.175143 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 2% Complete (13 out of 649)
    I0505 19:23:27.176903 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 4% Complete (28 out of 649)
    I0505 19:24:07.180315 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 6% Complete (41 out of 649)
    I0505 19:25:12.185873 24544 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 8% Complete (56 out of 649)
    I0505 19:25:52.188451 24544 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 10% Complete (65 out of 649)
    I0505 19:27:07.194136 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 12% Complete (79 out of 649)
    I0505 19:27:42.196530 24426 progress-updater.cc:55] Query
    d710fdcc1afe4652:b09975f58897eb3c: 14% Complete (93 out of 649)
    <log just ends here>


    statestored.INFO log:

    I0506 18:22:53.527773 8763 client-cache.cc:98] CreateClient():
    adding new client for ip-<ipaddress>.ec2.internal:23000
    I0506 18:22:53.528506 8763 thrift-util.cc:85] TSocket::open()
    connect() <Host: ip-<ipaddress>.ec2.internal Port: 23000>Connection
    refused
    I0506 18:22:53.573663 8763 status.cc:42] Couldn't open transport for
    ip-<ipaddress>.ec2.internal:23000(connect() failed: Connection
    refused)
    @ 0x545fce impala::Status::Status()
    @ 0x520256 impala::ThriftClientImpl::Open()
    @ 0x4f1ed0 impala::ClientCacheHelper::CreateClient()
    @ 0x4f21ac impala::ClientCacheHelper::ReopenClient()
    @ 0x533f3d impala::StateStore::ProcessOneSubscriber()
    @ 0x536dc4 impala::StateStore::SubscriberUpdateLoop()
    @ 0x5d8953 thread_proxy
    @ 0x7fd5857e5e9a start_thread
    @ 0x7fd5846eb4bd (unknown)



    On Wednesday, April 24, 2013 3:05:59 AM UTC-7, jaidee...@inmobi.comwrote:
    Hi,
    I have a large table with 800M records in RCFile format.
    I am creating another table with 'STORED as PARQUETFILE' with schema
    same as the first table.
    *insert overwrite table pq_network_Fact partition (day_key)*
    *select .... from rc_network_fact;*

    When I try to insert data into the parquet table, query fails after a
    while with 'Unknown Exception : [Errno 104] Connection reset by peer
    Query failed'

    Any guess as to why this could be happening?

    Thanks,
    Jaideep


    --
    Ricky Saltzer
    Customer Operations Engineer
    http://www.cloudera.com


    --

    **

    **

    *Mike Mansell *|* *Chief Technical Architect
    T: 510-653-8963 | M: 925-262-7830 | F: 510-653-0461
    mike.mansell@tubemogul.com | Twitter: @m_mansell<http://www.twitter.com/m_mansell>

    [image: Inline image 1]<http://www.tubemogul.com/solutions/playtime/brandsights>

    --
    Ricky Saltzer
    Customer Operations Engineer
    http://www.cloudera.com


    --

    **

    **

    *Mike Mansell *|* *Chief Technical Architect
    T: 510-653-8963 | M: 925-262-7830 | F: 510-653-0461
    mike.mansell@tubemogul.com | Twitter: @m_mansell<http://www.twitter.com/m_mansell>

    [image: Inline image 1]<http://www.tubemogul.com/solutions/playtime/brandsights>

    --
    Ricky Saltzer
    Customer Operations Engineer
    http://www.cloudera.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedApr 24, '13 at 10:06a
activeMay 6, '13 at 8:39p
posts10
users4
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase