FAQ
not sure how much this has to do with cascalog per se ... but i have
this really confounding issue and maybe someone can help? so i have
this job which is failing, the stack trace in the job logs look like

Caused by: java\.lang\.RuntimeException: java\.lang
\.ClassNotFoundException: views\.visit-facts, compiling:(views/
visit_facts\.clj:1)
at clojure\.lang\.Compiler\.analyze(Compiler\.java:6235)
at clojure\.lang\.Compiler\.analyze(Compiler\.java:6177)
...
Caused by: java\.lang\.RuntimeException: java\.lang
\.ClassNotFoundException: views\.visit-facts
at clojure\.lang\.Util\.runtimeException(Util\.java:165)
at clojure\.lang\.RT\.classForName(RT\.java:2017)
...
Caused by: java\.lang\.ClassNotFoundException: views\.visit-facts
at java\.net\.URLClassLoader$1\.run(URLClassLoader\.java:202)
...

I sort of suspect that the job jar was not being replicated
correctly .. and looking at daemon logs i see that the namenode has
errors replicating jobtracker.info

INFO org.apache.hadoop.ipc.Server (IPC Server handler 6 on 9000): IPC
Server handler 6 on 9000, call addBlock(/mnt/var/lib/hadoop/tmp/mapred/
system/jobtracker.info, DFSClient_1731950709) from
10.194.15.165:51308: error: java.io.IOException: File /mnt/var/lib/
hadoop/tmp/mapred/system/jobtracker.info could only be replicated to 0
nodes, instead of 1

on the datanode side I see errors w/ receiving the job jar

namenode logs says:

2012-04-28 19:05:55,582 INFO org.apache.hadoop.hdfs.StateChange (IPC
Server handler 11 on 9000): DIR* NameSystem.completeFile: file /mnt/
var/lib/hadoop/tmp/mapred/system/job_201204281904_0001/job.jar is
closed by DFSClient_-387163361

datanode logs says:

INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace
(PacketResponder 0 for Block blk_832986479003239919_1004): src: /
10.195.89.124:53001, dest: /10.76.91.41:9200, bytes: 39825229, op:
HDFS_WRITE, cliID: DFSClient_-387163361, srvID:
DS-304531098-10.76.91.41-9200-1335639918679, blockid:
blk_832986479003239919_1004q

WARN org.apache.hadoop.hdfs.server.datanode.DataNode
(org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer@1a5d08):
DatanodeRegistration(10.76.91.41:9200,
storageID=DS-304531098-10.76.91.41-9200-1335639918679, infoPort=9102,
ipcPort=9201):Failed to transfer blk_138586677137070325_1009 to
10.37.67.149:9200 got java.net.SocketException: Original Exception :
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)


my first gut reaction was, maybe the jar is too big and too hard to
replicate? however the oddest part of all this is that

(1) other scripts in this jar work fine. i don't see the replication
problems with jobtracker.info or the job.jar
(2) portions of the visit-facts script work fine as well -- like the
subqueries it depends on run w/ out the above issues

so it seems to suggest that something specific to this script is
affecting how hadoop is replicating its jobtracker.info and job.jar --
which does not make a whole lot of sense to me.

i am running this on AWS EMR -- get the same problem for hadoop vs.
0.20 and 0.20.205

any insight or guesses welcome on this issue.

Search Discussions

  • Andrew Xue at Apr 28, 2012 at 11:08 pm
    actually I think i read the logs wrong -- the fail seems to happen
    when one datanode is transmitting the job.jar to another data
    node ...
    On Apr 28, 6:58 pm, Andrew Xue wrote:
    not sure how much this has to do with cascalog per se ... but i have
    this really confounding issue and maybe someone can help? so i have
    this job which is failing, the stack trace in the job logs look like

    Caused by: java\.lang\.RuntimeException: java\.lang
    \.ClassNotFoundException: views\.visit-facts, compiling:(views/
    visit_facts\.clj:1)
    at clojure\.lang\.Compiler\.analyze(Compiler\.java:6235)
    at clojure\.lang\.Compiler\.analyze(Compiler\.java:6177)
    ...
    Caused by: java\.lang\.RuntimeException: java\.lang
    \.ClassNotFoundException: views\.visit-facts
    at clojure\.lang\.Util\.runtimeException(Util\.java:165)
    at clojure\.lang\.RT\.classForName(RT\.java:2017)
    ...
    Caused by: java\.lang\.ClassNotFoundException: views\.visit-facts
    at java\.net\.URLClassLoader$1\.run(URLClassLoader\.java:202)
    ...

    I sort of suspect that the job jar was not being replicated
    correctly .. and looking at daemon logs i see that the namenode has
    errors replicating jobtracker.info

    INFO org.apache.hadoop.ipc.Server (IPC Server handler 6 on 9000): IPC
    Server handler 6 on 9000, call addBlock(/mnt/var/lib/hadoop/tmp/mapred/
    system/jobtracker.info, DFSClient_1731950709) from
    10.194.15.165:51308: error: java.io.IOException: File /mnt/var/lib/
    hadoop/tmp/mapred/system/jobtracker.info could only be replicated to 0
    nodes, instead of 1

    on the datanode side I see errors w/ receiving the job jar

    namenode logs says:

    2012-04-28 19:05:55,582 INFO org.apache.hadoop.hdfs.StateChange (IPC
    Server handler 11 on 9000): DIR* NameSystem.completeFile: file /mnt/
    var/lib/hadoop/tmp/mapred/system/job_201204281904_0001/job.jar is
    closed by DFSClient_-387163361

    datanode logs says:

    INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace
    (PacketResponder 0 for Block blk_832986479003239919_1004): src: /
    10.195.89.124:53001, dest: /10.76.91.41:9200, bytes: 39825229, op:
    HDFS_WRITE, cliID: DFSClient_-387163361, srvID:
    DS-304531098-10.76.91.41-9200-1335639918679, blockid:
    blk_832986479003239919_1004q

    WARN org.apache.hadoop.hdfs.server.datanode.DataNode
    (org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer@1a5d08):
    DatanodeRegistration(10.76.91.41:9200,
    storageID=DS-304531098-10.76.91.41-9200-1335639918679, infoPort=9102,
    ipcPort=9201):Failed to transfer blk_138586677137070325_1009 to
    10.37.67.149:9200 got java.net.SocketException: Original Exception :
    java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)

    my first gut reaction was, maybe the jar is too big and too hard to
    replicate? however the oddest part of all this is that

    (1) other scripts in this jar work fine. i don't see the replication
    problems with jobtracker.info or the job.jar
    (2) portions of the visit-facts script work fine as well -- like the
    subqueries it depends on run w/ out the above issues

    so it seems to suggest that something specific to this script is
    affecting how hadoop is replicating its jobtracker.info and job.jar --
    which does not make a whole lot of sense to me.

    i am running this on AWS EMR -- get the same problem for hadoop vs.
    0.20 and 0.20.205

    any insight or guesses welcome on this issue.
  • Andrew Xue at Apr 28, 2012 at 11:15 pm
    ... and the receiving data node is throwing a
    BlockAlreadyExistsException
    On Apr 28, 6:58 pm, Andrew Xue wrote:
    not sure how much this has to do with cascalog per se ... but i have
    this really confounding issue and maybe someone can help? so i have
    this job which is failing, the stack trace in the job logs look like

    Caused by: java\.lang\.RuntimeException: java\.lang
    \.ClassNotFoundException: views\.visit-facts, compiling:(views/
    visit_facts\.clj:1)
    at clojure\.lang\.Compiler\.analyze(Compiler\.java:6235)
    at clojure\.lang\.Compiler\.analyze(Compiler\.java:6177)
    ...
    Caused by: java\.lang\.RuntimeException: java\.lang
    \.ClassNotFoundException: views\.visit-facts
    at clojure\.lang\.Util\.runtimeException(Util\.java:165)
    at clojure\.lang\.RT\.classForName(RT\.java:2017)
    ...
    Caused by: java\.lang\.ClassNotFoundException: views\.visit-facts
    at java\.net\.URLClassLoader$1\.run(URLClassLoader\.java:202)
    ...

    I sort of suspect that the job jar was not being replicated
    correctly .. and looking at daemon logs i see that the namenode has
    errors replicating jobtracker.info

    INFO org.apache.hadoop.ipc.Server (IPC Server handler 6 on 9000): IPC
    Server handler 6 on 9000, call addBlock(/mnt/var/lib/hadoop/tmp/mapred/
    system/jobtracker.info, DFSClient_1731950709) from
    10.194.15.165:51308: error: java.io.IOException: File /mnt/var/lib/
    hadoop/tmp/mapred/system/jobtracker.info could only be replicated to 0
    nodes, instead of 1

    on the datanode side I see errors w/ receiving the job jar

    namenode logs says:

    2012-04-28 19:05:55,582 INFO org.apache.hadoop.hdfs.StateChange (IPC
    Server handler 11 on 9000): DIR* NameSystem.completeFile: file /mnt/
    var/lib/hadoop/tmp/mapred/system/job_201204281904_0001/job.jar is
    closed by DFSClient_-387163361

    datanode logs says:

    INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace
    (PacketResponder 0 for Block blk_832986479003239919_1004): src: /
    10.195.89.124:53001, dest: /10.76.91.41:9200, bytes: 39825229, op:
    HDFS_WRITE, cliID: DFSClient_-387163361, srvID:
    DS-304531098-10.76.91.41-9200-1335639918679, blockid:
    blk_832986479003239919_1004q

    WARN org.apache.hadoop.hdfs.server.datanode.DataNode
    (org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer@1a5d08):
    DatanodeRegistration(10.76.91.41:9200,
    storageID=DS-304531098-10.76.91.41-9200-1335639918679, infoPort=9102,
    ipcPort=9201):Failed to transfer blk_138586677137070325_1009 to
    10.37.67.149:9200 got java.net.SocketException: Original Exception :
    java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)

    my first gut reaction was, maybe the jar is too big and too hard to
    replicate? however the oddest part of all this is that

    (1) other scripts in this jar work fine. i don't see the replication
    problems with jobtracker.info or the job.jar
    (2) portions of the visit-facts script work fine as well -- like the
    subqueries it depends on run w/ out the above issues

    so it seems to suggest that something specific to this script is
    affecting how hadoop is replicating its jobtracker.info and job.jar --
    which does not make a whole lot of sense to me.

    i am running this on AWS EMR -- get the same problem for hadoop vs.
    0.20 and 0.20.205

    any insight or guesses welcome on this issue.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedApr 28, '12 at 10:58p
activeApr 28, '12 at 11:15p
posts3
users1
websiteclojure.org
irc#clojure

1 user in discussion

Andrew Xue: 3 posts

People

Translate

site design / logo © 2022 Grokbase