DistributedFileSystem#listStatus is very slow when listing a directory with a size of 1300

Key: HADOOP-6502
URL: https://issues.apache.org/jira/browse/HADOOP-6502
Project: Hadoop Common
Issue Type: Bug
Components: util
Affects Versions: 0.20.0
Reporter: Hairong Kuang
Priority: Critical
Fix For: 0.20.2, 0.21.0, 0.22.0

When listing a directory of around 1300 children, it takes hundreds of milliseconds. It turns out the slowdowness is caused by the change made by HADOOP-4187. The return value of listStatus is an array of FileStatus. When serializing each element of the array, ReflectionUtils#newInstance(Class<T>, Configuration) is called and then calls setConf, which calls setJobConf. SetJobConf checks if JobConf is on the class path by calling Configuration#getClassByName. Even though Configuration#getClassByName tries to optimize the lookup using a cached map, but since JobConf is not in the class path, so it is not in the cache. Every checkup ends up calling Class.ForName which is very expensive. Deserializing an array of 1300 entries requires calling of Class#ForName 1300 times!

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
postedJan 22, '10 at 1:25a
activeJan 22, '10 at 1:25a

1 user in discussion

Hairong Kuang (JIRA): 1 post



site design / logo © 2022 Grokbase