Grokbase Groups Hive user April 2011
FAQ
Hello,

We have a situation where the data coming from source systems to hive may
contain the common characters and delimiters such as |, tabs, new line
characters etc.

We may have to use multi character delimiters such as "|#" for columns and "||#"
for rows.

How can we achieve this? In this case our single rows may look like below (|#is
column delimiter and ||#is row delimiter

row 1 col1 |# row 1 col2 |# row 1 col 3 has
two
new line characters |# and this is
the last column of row 1 ||# row 2 col1 |# row 2 col2 |# row 2 col 3 has
one tab and one new line character |# and this is
the last column of row 2 ||#

Would custom SerDe help us handle this situation?

Thanks and Regards,
Shantian

Search Discussions

  • Shantian Purkad at Apr 28, 2011 at 5:29 pm
    Any suggestions?



    ________________________________
    From: Shantian Purkad <shantian_purkad@yahoo.com>
    To: user@hive.apache.org
    Sent: Tue, April 26, 2011 11:05:46 PM
    Subject: Multi character delimiter for Hive Columns and Rows


    Hello,

    We have a situation where the data coming from source systems to hive may
    contain the common characters and delimiters such as |, tabs, new line
    characters etc.

    We may have to use multi character delimiters such as "|#" for columns and
    "||#" for rows.

    How can we achieve this? In this case our single rows may look like below (|#is
    column delimiter and ||#is row delimiter

    row 1 col1 |# row 1 col2 |# row 1 col 3 has
    two
    new line characters |# and this is
    the last column of row 1 ||# row 2 col1 |# row 2 col2 |# row 2 col 3 has
    one tab and one new line character |# and this is
    the last column of row 2 ||#

    Would custom SerDe help us handle this situation?

    Thanks and Regards,
    Shantian
  • Richard Nadeau at Apr 28, 2011 at 6:03 pm
    A custom SerDe would be your best bet. We're using one to do exactly that.

    Regards,
    Rick
    On Apr 28, 2011 11:29 AM, "Shantian Purkad" wrote:
    Any suggestions?



    ________________________________
    From: Shantian Purkad <shantian_purkad@yahoo.com>
    To: user@hive.apache.org
    Sent: Tue, April 26, 2011 11:05:46 PM
    Subject: Multi character delimiter for Hive Columns and Rows


    Hello,

    We have a situation where the data coming from source systems to hive may
    contain the common characters and delimiters such as |, tabs, new line
    characters etc.

    We may have to use multi character delimiters such as "|#" for columns and
    "||#" for rows.

    How can we achieve this? In this case our single rows may look like below (|#is
    column delimiter and ||#is row delimiter

    row 1 col1 |# row 1 col2 |# row 1 col 3 has
    two
    new line characters |# and this is
    the last column of row 1 ||# row 2 col1 |# row 2 col2 |# row 2 col 3 has
    one tab and one new line character |# and this is
    the last column of row 2 ||#

    Would custom SerDe help us handle this situation?

    Thanks and Regards,
    Shantian
  • Rosanna Man at Apr 28, 2011 at 6:17 pm
    Hi all,

    We are using capacity scheduler to schedule resources among different queues
    for 1 user (hadoop) only. We have set the queues to have equal share of the
    resources. However, when 1st task starts in the first queue and is consuming
    all the resources, the 2nd task starts in the 2nd queue will be starved from
    reducer until the first task finished. A lot of processing is being stuck
    when a large query is executing.

    We are using 0.20.2 hive in amazon aws. We tried to use Fair Scheduler
    before but it gives an error when the mapper gives no output (which is fine
    in our use cases).

    Anyone can give us some advice?

    Thanks,
    Rosanna
  • Sreekanth Ramakrishnan at Apr 29, 2011 at 3:10 am
    Hi

    Currently CapacityScheduler does not have pre-emption. So basically when the Job1 starts finishing and freeing up the Job2's tasks will start getting scheduled. One way you can prevent that queue capacities are not elastic in nature is by setting max task limits on queues. That way your job1 will never execeed first queues capacity




    On 4/28/11 11:48 PM, "Rosanna Man" wrote:

    Hi all,

    We are using capacity scheduler to schedule resources among different queues for 1 user (hadoop) only. We have set the queues to have equal share of the resources. However, when 1st task starts in the first queue and is consuming all the resources, the 2nd task starts in the 2nd queue will be starved from reducer until the first task finished. A lot of processing is being stuck when a large query is executing.

    We are using 0.20.2 hive in amazon aws. We tried to use Fair Scheduler before but it gives an error when the mapper gives no output (which is fine in our use cases).

    Anyone can give us some advice?

    Thanks,
    Rosanna

    --
    Sreekanth Ramakrishnan
  • Rosanna Man at Apr 29, 2011 at 7:44 pm
    Hi Sreekanth,

    Thank you very much for your clarification. Setting the max task limits on
    queues will work but can we do something on the max user limit? Is it
    pre-emptible also? We are exploring about the possibility of running the
    queries with different users for capacity scheduler to maximize the use of
    the resources.

    Basically, our goal is to maximize the resources (mappers and reducers)
    while providing a fair share to the short tasks while a big task is running.
    How do you normally achieve hat?

    Thanks,
    Rosanna
    On 4/28/11 8:09 PM, "Sreekanth Ramakrishnan" wrote:

    Hi

    Currently CapacityScheduler does not have pre-emption. So basically when the
    Job1 starts finishing and freeing up the Job2¹s tasks will start getting
    scheduled. One way you can prevent that queue capacities are not elastic in
    nature is by setting max task limits on queues. That way your job1 will never
    execeed first queues capacity



    On 4/28/11 11:48 PM, "Rosanna Man" wrote:

    Hi all,

    We are using capacity scheduler to schedule resources among different queues
    for 1 user (hadoop) only. We have set the queues to have equal share of the
    resources. However, when 1st task starts in the first queue and is consuming
    all the resources, the 2nd task starts in the 2nd queue will be starved from
    reducer until the first task finished. A lot of processing is being stuck
    when a large query is executing.

    We are using 0.20.2 hive in amazon aws. We tried to use Fair Scheduler before
    but it gives an error when the mapper gives no output (which is fine in our
    use cases).

    Anyone can give us some advice?

    Thanks,
    Rosanna
  • Sreekanth Ramakrishnan at May 2, 2011 at 3:43 am
    The design goal of CapacityScheduler is maximizing the utilization of cluster resources but it does not fairly allocate the share amongst the total number of users present in the system.

    The user limit states the number of concurrent users who can use the slots in the queue. But then these limits are elastic in nature, as there is no preemption as the slots get freed up the new tasks will be allotted those slot to meet the user limit.

    In order for your requirement, you can possibly submit the large tasks to a queue which have max task limit set, so your long running jobs don't take up whole of the cluster capacity and submit shorter, smaller jobs to fast moving queue with something like 10% user limit which allows 10 concurrent user per queue.

    The actual distribution of the of the capacity across longer/shorter jobs depends on your workload.


    On 4/30/11 1:14 AM, "Rosanna Man" wrote:

    Hi Sreekanth,

    Thank you very much for your clarification. Setting the max task limits on queues will work but can we do something on the max user limit? Is it pre-emptible also? We are exploring about the possibility of running the queries with different users for capacity scheduler to maximize the use of the resources.

    Basically, our goal is to maximize the resources (mappers and reducers) while providing a fair share to the short tasks while a big task is running. How do you normally achieve hat?

    Thanks,
    Rosanna

    On 4/28/11 8:09 PM, "Sreekanth Ramakrishnan" wrote:

    Hi

    Currently CapacityScheduler does not have pre-emption. So basically when the Job1 starts finishing and freeing up the Job2's tasks will start getting scheduled. One way you can prevent that queue capacities are not elastic in nature is by setting max task limits on queues. That way your job1 will never execeed first queues capacity




    On 4/28/11 11:48 PM, "Rosanna Man" wrote:

    Hi all,

    We are using capacity scheduler to schedule resources among different queues for 1 user (hadoop) only. We have set the queues to have equal share of the resources. However, when 1st task starts in the first queue and is consuming all the resources, the 2nd task starts in the 2nd queue will be starved from reducer until the first task finished. A lot of processing is being stuck when a large query is executing.

    We are using 0.20.2 hive in amazon aws. We tried to use Fair Scheduler before but it gives an error when the mapper gives no output (which is fine in our use cases).

    Anyone can give us some advice?

    Thanks,
    Rosanna


    --
    Sreekanth Ramakrishnan
  • Rosanna Man at May 2, 2011 at 9:22 pm
    Hi Sreekanth,

    When you mention about setting the max task limit, do you mean by executing

    set mapred.capacity-scheduler.queue.<queue-name>.maximum-capacity = <a
    percentage> ?

    Is it only available on hadoop 0.21?

    Thanks,
    Rosanna
    On 5/1/11 8:42 PM, "Sreekanth Ramakrishnan" wrote:


    The design goal of CapacityScheduler is maximizing the utilization of cluster
    resources but it does not fairly allocate the share amongst the total number
    of users present in the system.

    The user limit states the number of concurrent users who can use the slots in
    the queue. But then these limits are elastic in nature, as there is no
    preemption as the slots get freed up the new tasks will be allotted those slot
    to meet the user limit.

    In order for your requirement, you can possibly submit the large tasks to a
    queue which have max task limit set, so your long running jobs don¹t take up
    whole of the cluster capacity and submit shorter, smaller jobs to fast moving
    queue with something like 10% user limit which allows 10 concurrent user per
    queue.

    The actual distribution of the of the capacity across longer/shorter jobs
    depends on your workload.

    On 4/30/11 1:14 AM, "Rosanna Man" wrote:

    Hi Sreekanth,

    Thank you very much for your clarification. Setting the max task limits on
    queues will work but can we do something on the max user limit? Is it
    pre-emptible also? We are exploring about the possibility of running the
    queries with different users for capacity scheduler to maximize the use of
    the resources.

    Basically, our goal is to maximize the resources (mappers and reducers) while
    providing a fair share to the short tasks while a big task is running. How do
    you normally achieve hat?

    Thanks,
    Rosanna
    On 4/28/11 8:09 PM, "Sreekanth Ramakrishnan" wrote:

    Hi

    Currently CapacityScheduler does not have pre-emption. So basically when the
    Job1 starts finishing and freeing up the Job2¹s tasks will start getting
    scheduled. One way you can prevent that queue capacities are not elastic in
    nature is by setting max task limits on queues. That way your job1 will
    never execeed first queues capacity



    On 4/28/11 11:48 PM, "Rosanna Man" wrote:

    Hi all,

    We are using capacity scheduler to schedule resources among different
    queues for 1 user (hadoop) only. We have set the queues to have equal share
    of the resources. However, when 1st task starts in the first queue and is
    consuming all the resources, the 2nd task starts in the 2nd queue will be
    starved from reducer until the first task finished. A lot of processing is
    being stuck when a large query is executing.

    We are using 0.20.2 hive in amazon aws. We tried to use Fair Scheduler
    before but it gives an error when the mapper gives no output (which is fine
    in our use cases).

    Anyone can give us some advice?

    Thanks,
    Rosanna
  • Sreekanth Ramakrishnan at May 3, 2011 at 4:10 am
    The queue specific configurations are not hive client specific, they have to be configured on JobTracker before JT is started up. All the Hive Cli should try setting is which queue they will want the DAG from hive query to be submitted to.

    So your capacity-scheduler.xml in $HADOOP_CONF_DIR should have:
    <property><name>mapred.capacity-scheduler.queue.myqueue.maximum-capacity</name><value>50</value></property>

    Also sorry for confusing you, this feature is only available in the 0.21 and Yahoo! Distribution of Hadoop. You could try to get the capacity scheduler jar from the Yahoo! Hadoop distribution and replace it in your normal Hadoop distribution of your cluster and restart it. AFAIK, I don't think scheduler contract between JT and Scheduler have changed between Apache Hadoop 0.20 and Yahoo! Hadoop 0.20. But I would suggest you to try it out at your own risk :-)


    On 5/3/11 2:52 AM, "Rosanna Man" wrote:

    Hi Sreekanth,

    When you mention about setting the max task limit, do you mean by executing

    set mapred.capacity-scheduler.queue.<queue-name>.maximum-capacity = <a percentage> ?

    Is it only available on hadoop 0.21?

    Thanks,
    Rosanna

    On 5/1/11 8:42 PM, "Sreekanth Ramakrishnan" wrote:


    The design goal of CapacityScheduler is maximizing the utilization of cluster resources but it does not fairly allocate the share amongst the total number of users present in the system.

    The user limit states the number of concurrent users who can use the slots in the queue. But then these limits are elastic in nature, as there is no preemption as the slots get freed up the new tasks will be allotted those slot to meet the user limit.

    In order for your requirement, you can possibly submit the large tasks to a queue which have max task limit set, so your long running jobs don't take up whole of the cluster capacity and submit shorter, smaller jobs to fast moving queue with something like 10% user limit which allows 10 concurrent user per queue.

    The actual distribution of the of the capacity across longer/shorter jobs depends on your workload.


    On 4/30/11 1:14 AM, "Rosanna Man" wrote:

    Hi Sreekanth,

    Thank you very much for your clarification. Setting the max task limits on queues will work but can we do something on the max user limit? Is it pre-emptible also? We are exploring about the possibility of running the queries with different users for capacity scheduler to maximize the use of the resources.

    Basically, our goal is to maximize the resources (mappers and reducers) while providing a fair share to the short tasks while a big task is running. How do you normally achieve hat?

    Thanks,
    Rosanna

    On 4/28/11 8:09 PM, "Sreekanth Ramakrishnan" wrote:

    Hi

    Currently CapacityScheduler does not have pre-emption. So basically when the Job1 starts finishing and freeing up the Job2's tasks will start getting scheduled. One way you can prevent that queue capacities are not elastic in nature is by setting max task limits on queues. That way your job1 will never execeed first queues capacity




    On 4/28/11 11:48 PM, "Rosanna Man" wrote:

    Hi all,

    We are using capacity scheduler to schedule resources among different queues for 1 user (hadoop) only. We have set the queues to have equal share of the resources. However, when 1st task starts in the first queue and is consuming all the resources, the 2nd task starts in the 2nd queue will be starved from reducer until the first task finished. A lot of processing is being stuck when a large query is executing.

    We are using 0.20.2 hive in amazon aws. We tried to use Fair Scheduler before but it gives an error when the mapper gives no output (which is fine in our use cases).

    Anyone can give us some advice?

    Thanks,
    Rosanna



    --
    Sreekanth Ramakrishnan
  • Shantian Purkad at May 2, 2011 at 5:07 am
    Hi,

    I am getting below error when I try to run the HIVE queries.

    Any idea what may be going wrong? The query was working fine until I restarted
    the HDFS & MapReduce services.

    hive> select geo_name from its_geo_code ;
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there's no reduce operator
    java.io.IOException: Call to <hostname>/<ipaddress>:8020 failed on local
    exception: java.io.EOFException
    at org.apache.hadoop.ipc.Client.wrapException(Client.java:1139)
    at org.apache.hadoop.ipc.Client.call(Client.java:1107)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
    at $Proxy4.setPermission(Unknown Source)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

    at java.lang.reflect.Method.invoke(Method.java:597)
    at
    org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)

    at
    org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)

    at $Proxy4.setPermission(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient.setPermission(DFSClient.java:855)
    at
    org.apache.hadoop.hdfs.DistributedFileSystem.setPermission(DistributedFileSystem.java:560)

    at
    org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:123)

    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:839)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at
    org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)

    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807)
    at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:657)
    at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:123)
    at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:130)
    at
    org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
    at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063)
    at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748)
    at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:164)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:241)
    at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:456)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
    Caused by: java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:375)
    at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:812)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:720)
    Job Submission failed with exception 'java.io.IOException(Call to
    <hostname>/<ipaddress>:8020 failed on local exception: java.io.EOFException)
  • Ankit bhatnagar at May 7, 2011 at 11:51 pm
    Hi
    Please check your hive-site.xml.

    Looks like issue with the cluster or config.

    Ankit

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedApr 27, '11 at 6:06a
activeMay 7, '11 at 11:51p
posts11
users5
websitehive.apache.org

People

Translate

site design / logo © 2021 Grokbase