Since I was doing terms instead of just words, it was slightly more
complicated then just calling sort.
A simple solution might have been to reverse the order in which the
hadoop job outputs results, i.e.

value key instead of key value

But instead, I used the following set of commands to process the file
and sort it appropriately:

cat * | tr '\t' ' ' | sed 's/^\(.*\) \([0-9]*\)$/\2 \1/' | sort -n >

Which basically takes the number at the end of each line, and puts it
at the beginning, then calls sort.
Works like a charm. Thanks for the help.
On Sep 19, 2007, at 4:03 PM, Ted Dunning wrote:

I use something like this:

bin/hadoop -getmerge <output-directory> | sort +1n

This wokrs very well because the final counts are relatively small
to the original input. There is nothing that says you can't mix MR
programming with conventional code.
On 9/19/07 3:42 PM, "Ross Boucher" wrote:

This problem seems to have gone away by itself.

Now I have my job running, but I'm not entirely sure how to get the
output into something useful to me.

I've counting word frequencies, and I would like the output sorted by
frequency, rather than alphabetically. I would also like the final
output to be in one file, though I'm not sure if this is possible
given that its computed separately. I suppose it wouldn't be too
difficult to post process the files to get them sorted the way I
would like and in one file, but if anyone has some tips on how to do
this in my job itself, that would be great.


Ross Boucher

On Sep 19, 2007, at 2:59 PM, Owen O'Malley wrote:

On Sep 19, 2007, at 2:30 PM, Ross Boucher wrote:

Specifically, the job starts, and then each task that is scheduled
fails, with the following error:

Error initializing task_0007_m_000063_0:
java.io.IOException: /DFS_ROOT/tmp/mapred/system/submit_i849v1/
job.xml: No such file or directory
Look at the configuration of your mapred.system.dir. It MUST be the
same on both the cluster and submitting node. Note that
mapred.system.dir must be in the default file system, which must
also be the same on the cluster and submitting node. Note that
there is a jira (HADOOP-1100) that would have the cluster pass the
system directory to the client, which would get rid of this issue.

-- Owen

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 6 of 7 | next ›
Discussion Overview
groupcommon-user @
postedSep 19, '07 at 9:31p
activeSep 21, '07 at 6:01p



site design / logo © 2022 Grokbase