FAQ
Hello Sir!

I am new to hadoop. I have a project based on webservices. I have my
information in 4 databases with different files in each one of them. Say,
images in one, video, documents etc. My task is to develop a web service
which accepts the keyword from the client and process the request and send
back the actual requested file back to the user. Now I have to use Hadoop
distributed file system in this project.

I have the following questions:

1) How should I start with the design?
2) Should I upload all the files and create Map, Reduce and Driver code and
once I run my application will it automatically go the file system and get
back the results to me?
3) How do i handle the binary data? I want to store binary format data using
MTOM in my databse.

Please let me know how I should proceed. I dont know much about this hadoop
and am I searching for some help. It would be great if you could assist me.
Thanks again

--
View this message in context: http://www.nabble.com/Need-Info-tp25901902p25901902.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Search Discussions

  • Sudha sadhasivam at Oct 15, 2009 at 6:40 am
    Dear Shwitzu

    The steps are listed below:

    Kindly go through wordcount and multifile word count for you project.

    Modify the program to list the files containing the keywords along with fine names. Use file names as keys.

    Store the files in 4 different input directories – one for each file type if needed. Else you can also have it in a single input directory.

    Use word count example with extensions suggested to retrieve file names having the keywords and store the result in output directory or display the links.

    Map – parallelized reading of multiple files –
    Input key-value pair is filename–filecontents
    Output key-value pair is filename – keyword and count.

    Reduce – combining output from key-value pairs of map function

    Input key-value pair is filename – keyword and count.
    Output key-value pairs is keyword – filenames having the keywords

    The answers to your questions are:
    1) How should I start with the design?
    Identify the files to be saved in the HDFS input disrectory.
    Go through the word count example.
    2) Upload all the files and create Map, Reduce and Driver code and
    once I run my application will it automatically go the file system and get
    back the results to me?
    Move all the files from local file system to HDFS / save it directly to HDFS by using suitable DFS command like copyfromlocal() - Go through DFS commands
    3) How do i handle the binary data? I want to store binary format data using
    MTOM in my databse.
    It can be handled in the same way as a conventional file

    G Sudha Sadasivam


    tzu@gmail.com> wrote:


    From: shwitzu <shwitzu@gmail.com>
    Subject: Need Info
    To: core-user@hadoop.apache.org
    Date: Thursday, October 15, 2009, 7:19 AM



    Hello Sir!

    I am new to hadoop. I have a project  based on webservices. I have my
    information in 4 databases with different files in each one of them. Say,
    images in one, video, documents etc. My task is to develop a web service
    which accepts the keyword from the client and process the request and send
    back the actual requested file back to the user. Now I have to use Hadoop
    distributed file system in this project.

    I have the following questions:

    1) How should I start with the design?
    2)  Should I upload all the files and create Map, Reduce and Driver code and
    once I run my application will it automatically go the file system and get
    back the results to me?
    3) How do i handle the binary data? I want to store binary format data using
    MTOM in my databse.

    Please let me know how I should proceed. I dont know much about this hadoop
    and am I searching for some help. It would be great if you could assist me.
    Thanks again

    --
    View this message in context: http://www.nabble.com/Need-Info-tp25901902p25901902.html
    Sent from the Hadoop core-user mailing list archive at Nabble.com.
  • Shwitzu at Oct 21, 2009 at 6:27 pm
    Thanks for Responding.

    I read about HDFS and understood how it works and I also installed hadoop in
    my windows using cygwin and tried a sample driver code and made sure it
    works.

    But my concern is, given the problem statement how should I proceed

    Could you please give me some clue/ pseudo code or a design.

    Thanks in anticipation.



    Doss_IPH wrote:
    First and for most, you need to understand about hadoop platform
    infrastructures.
    Currently, I am working in real time application using hadoop. I think
    that Hadoop will be fit to your requirements.
    Hadoop is mainly for three things,
    1. Scalability no limit for storage
    2. Peta bytes of data processing in distributed parallel mode.
    3. Fault tolerance (Automatically Block Replication) recovering data from
    failure.


    shwitzu wrote:
    Hello Sir!

    I am new to hadoop. I have a project based on webservices. I have my
    information in 4 databases with different files in each one of them. Say,
    images in one, video, documents etc. My task is to develop a web service
    which accepts the keyword from the client and process the request and
    send back the actual requested file back to the user. Now I have to use
    Hadoop distributed file system in this project.

    I have the following questions:

    1) How should I start with the design?
    2) Should I upload all the files and create Map, Reduce and Driver code
    and once I run my application will it automatically go the file system
    and get back the results to me?
    3) How do i handle the binary data? I want to store binary format data
    using MTOM in my databse.

    Please let me know how I should proceed. I dont know much about this
    hadoop and am I searching for some help. It would be great if you could
    assist me. Thanks again
    --
    View this message in context: http://www.nabble.com/Need-Info-tp25901902p25996385.html
    Sent from the Hadoop core-user mailing list archive at Nabble.com.
  • Sudha sadhasivam at Oct 22, 2009 at 3:43 am
    Is it a modification for hadoop or an application?

    If it is an application, write the basic algorithm. See which portions can be parallelised.
    Then put the parallelisable code into the map portion.
    For this try out sample examples on wordcount, grep, sort, etc in hadoop. Then you can understand the input(K,V) pairs and output collector format.

    The rest of the portions is like in  java
    G Sudha Sadasivam

    --- On Thu, 10/22/09, shwitzu wrote:


    From: shwitzu <shwitzu@gmail.com>
    Subject: Re: Need Info
    To: core-user@hadoop.apache.org
    Date: Thursday, October 22, 2009, 6:07 AM



    Thanks for Responding.

    I read about HDFS and understood how it works and I also installed hadoop in
    my windows using cygwin and tried a sample driver code and made sure it
    works.

    But my concern is, given the problem statement how should I proceed

    Could you please give me some clue/ pseudo code or a design.

    Thanks in anticipation.



    Doss_IPH wrote:
    First and for most, you need to understand about hadoop platform
    infrastructures.
    Currently, I am working in real time application using hadoop. I think
    that Hadoop will be fit to your requirements.
    Hadoop is mainly for three things,
    1. Scalability no limit for storage
    2. Peta bytes of data processing in distributed parallel mode.
    3. Fault tolerance (Automatically Block Replication) recovering data from
    failure.


    shwitzu wrote:
    Hello Sir!

    I am new to hadoop. I have a project  based on webservices. I have my
    information in 4 databases with different files in each one of them. Say,
    images in one, video, documents etc. My task is to develop a web service
    which accepts the keyword from the client and process the request and
    send back the actual requested file back to the user. Now I have to use
    Hadoop distributed file system in this project.

    I have the following questions:

    1) How should I start with the design?
    2)  Should I upload all the files and create Map, Reduce and Driver code
    and once I run my application will it automatically go the file system
    and get back the results to me?
    3) How do i handle the binary data? I want to store binary format data
    using MTOM in my databse.

    Please let me know how I should proceed. I dont know much about this
    hadoop and am I searching for some help. It would be great if you could
    assist me. Thanks again
    --
    View this message in context: http://www.nabble.com/Need-Info-tp25901902p25996385.html
    Sent from the Hadoop core-user mailing list archive at Nabble.com.
  • Doss_IPH at Oct 22, 2009 at 3:49 am
    Hi,
    you can use this pseudo code for loading data to HDFS.

    import java.io.File;
    import java.net.URI;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.FileStatus;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.hdfs.DistributedFileSystem;

    /**
    * @author: Arockia Doss S
    * @emailto: doss@intellipowerhive.com
    * @url: http://www.intellipowerhive.com,http://www.dossinfotech.com
    * @comments: You can use and modify this code for your use.
    * @About this: This below code works in hadoop-0.19.0 version platform.
    * If you want to test this code, you have set the hadoop
    libraries in your class path.
    * You need to give set of parameters before running it (like
    Hadoop path, Host, Users).
    */

    public class HadoopConfiguration {
    //Hadoop Absolute Path
    private static final String CLUSTERPATH="/home/hadoop-0.19.0/";
    private static final String SITEFILE = "conf/hadoop-site.xml";
    private static final String DEFAULTFILE = "conf/hadoop-default.xml";
    //Hadoop Name Node Host
    private static final String HADOOPHOST = "192.168.1.11";
    //Hadoop Root and its users list
    private static final String HOSTUSERS = "root,doss";
    private static Configuration conf = new Configuration();
    private static DistributedFileSystem dfs = new DistributedFileSystem();

    public HadoopConfiguration() throws java.lang.Exception{
    Path sitepath = new Path(CLUSTERPATH+SITEFILE);
    Path defaultpath = new Path(CLUSTERPATH+DEFAULTFILE);
    getConf().set("fs.default.name","hdfs://"+HADOOPHOST+":9000/");
    getConf().addResource(sitepath);
    getConf().addResource(defaultpath);
    getConf().set("hadoop.job.ugi", HOSTUSERS);
    dfs.initialize(new URI("hdfs://"+HADOOPHOST+":9000/"), conf);
    }

    public static Configuration getConf(){
    return conf;
    }

    public static void main(String[] args){
    try{
    HadoopConfiguration h = new HadoopConfiguration();
    FileSystem fs = FileSystem.get(h.getConf());

    //Copy sample.xls file to HDFS, The file will be there after
    copying it.
    fs.copyFromLocalFile(new Path("/home/sample.xls"),new
    Path("/home/xls/"));

    //Move sample.doc file to HDFS, The file will not be there after
    moving it.
    fs.moveFromLocalFile(new Path("/home/sample.doc"),new
    Path("/home/doc/"));

    //This below code gives to list the files from HDFS
    FileStatus[] fileStatus = fs.listStatus(new Path("/home/xls"));
    for(int i=0;i<fileStatus.length;i++){
    Path path = fileStatus[i].getPath();
    }

    }catch(java.lang.Exception e){
    System.out.println(e);
    }
    }

    }




    shwitzu wrote:
    Thanks for Responding.

    I read about HDFS and understood how it works and I also installed hadoop
    in my windows using cygwin and tried a sample driver code and made sure it
    works.

    But my concern is, given the problem statement how should I proceed

    Could you please give me some clue/ pseudo code or a design.

    Thanks in anticipation.



    Doss_IPH wrote:
    First and for most, you need to understand about hadoop platform
    infrastructures.
    Currently, I am working in real time application using hadoop. I think
    that Hadoop will be fit to your requirements.
    Hadoop is mainly for three things,
    1. Scalability no limit for storage
    2. Peta bytes of data processing in distributed parallel mode.
    3. Fault tolerance (Automatically Block Replication) recovering data from
    failure.


    shwitzu wrote:
    Hello Sir!

    I am new to hadoop. I have a project based on webservices. I have my
    information in 4 databases with different files in each one of them.
    Say, images in one, video, documents etc. My task is to develop a web
    service which accepts the keyword from the client and process the
    request and send back the actual requested file back to the user. Now I
    have to use Hadoop distributed file system in this project.

    I have the following questions:

    1) How should I start with the design?
    2) Should I upload all the files and create Map, Reduce and Driver code
    and once I run my application will it automatically go the file system
    and get back the results to me?
    3) How do i handle the binary data? I want to store binary format data
    using MTOM in my databse.

    Please let me know how I should proceed. I dont know much about this
    hadoop and am I searching for some help. It would be great if you could
    assist me. Thanks again
    --
    View this message in context: http://www.nabble.com/Need-Info-tp25901902p26003660.html
    Sent from the Hadoop core-user mailing list archive at Nabble.com.
  • Steve Loughran at Oct 29, 2009 at 11:42 am

    shwitzu wrote:
    Thanks for Responding,

    I read about HDFS and understood how it works and I also installed hadoop in
    my windows using cygwin and tried a sample driver code and made sure it
    works.

    But my concern is, given the problem statement how should I proceed

    Could you please give me some clue/ pseudo code or a design.
    I would start the design process the way you would with any other large
    project

    -understand the problem
    -get an understanding of the available solution space -and their limitations
    -come up with one or more possible solutions
    -identify the risky bits of the system, the assumptions you may have,
    the requirements you have of other things, the bits you dont' really
    understand
    -prototype something that tests those assumptions, acts as a first demo
    of what is possible -or an immediate
    -start with the basic tests and automated deployment
    -evolve

    Plus all the scheduling stuff that goes with it

    Asking a mailing list for pseudo code or a design is doomed. Really.
    This is a major distributed application and you need to be thinking
    about it at scale, and you need to understand both the user needs and
    the capabilities of the underlying technologies. Nobody else on this
    list understands the needs, and there is no way for you to be sure that
    any of us understand the technologies. Which means there is no way you
    can trust any of us to produce a design that works -even if anyone was
    prepared to sit down and do your work for you.

    sorry, but that's not how open source communities tend to work. Help
    with problems, bugs, yes. Design your application -no. You are on your
    own there.

    -steve

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 15, '09 at 1:50a
activeOct 29, '09 at 11:42a
posts6
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase