Confused. What do you mean by "query be distributed over all datanodes or just 1 node" . If your data is small enough so that it fits in just one block ( and replicated by hadoop ), then just one task will be run ( assuming default input split).
If the data is spread across multiple blocks, you can make it run on just one compute node by setting your input split to be large enough ( yes there are use cases for this when whole data is to be fed to a single mapper ). Else, the job will be scheduled on numerous nodes with each getting a block / chunk ( input split size set ) of your actual data. The nodes picked for running your job depends on data-locality to reduce network latency.
From: Divij Durve
Sent: Wednesday, July 15, 2009 2:32 AM
To: email@example.com; firstname.lastname@example.org
Subject: Question about job distribution
If i have a query that i would normally fire on a database and i want o fire
that using the data loaded into multiple nodes on hadoop. Will the query be
distributed over all the datanodes so it returns results faster or will it
just send it to 1 node? If so is there a way to get it to distribute the
query instead of sending it to 1 node?