I am working with a large dataset of logs (approximately 1.5TB every
month). Each record in the log contains a list of fields and a common
query by the users on a daily basis is to filter records on a
particular field that matches a value. Right now the log data is not
organized in any particular fashion and it makes the querying very
slow. I am working on re-structuring the log data by splitting the
data into multiple smaller buckets such that the common field based
query is sped up.

I have completed re-structuring the data and now I want to compare the
performance improvement obtained by my effort. Can someone suggest me
a way to compare the performance of the map-reduce jobs? I can think
of the following:

1. CPU time spent
2. Wall clock time taken
3. #bytes shuffled

Also, is there any way to get these metric values through command line
rather than the web-interface? I appreciate any help. Thanks!

Prabu D

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedFeb 17, '13 at 6:45p
activeFeb 17, '13 at 6:45p

1 user in discussion

Prabu Dhakshinamurthy: 1 post



site design / logo © 2021 Grokbase