FAQ
Hi,
I am currently working on a system design and I am interested in hearing
some ideas how hadoop/hbase can be used to solve a couple of tricky issues.

Basically I have a data set consisting of roughly 1-5 million documents
growing at a rate of about 100-500 thousand a year. Each document is not of
significant size (say 100k). Documents describe dependencies that are
analyzed, intermediate results are cached as new documents, and documents
and intermediate results are indexed. Intermediate results are about a
factor 3 times the number of documents. Seems like a perfect thing to use
map reduce to handle analytics and indexing, both on an initial set, and
when changes occur.

My question regards being able to handle multiple sets of artifacts, and
privacy.

I would like to be able to handle multiple sets of private documents and
private intermediate results. When analyzing and indexing these the
documents and partial results must run on servers that are private to the
client and naturally also be stored in a (to our client) private and secure
fashion.
A client would have a lot less private data than what is stored in our
global and publicly available data set.

Interested in hearing ideas regarding how this can be done.

Regards
- henrik

Search Discussions

  • Ted Dunning at Nov 19, 2007 at 5:54 pm
    Hadoop currently has no sense of user-private data. I believe that this
    capability is under development, but I don't know what the time-line for
    completion is (if any). You should not expect the capabilities to be fully
    usable when they are first release.

    In spite of this lack, I think you could meet your requirements in a few
    different ways. For instance, you could emulate user isolation by running
    multiple hadoop clusters on the same machines under different uid's.
    Virtualization would be another option, but I would guess that you would
    rather share disk space so that large clusters like your public data set
    could expand easily to nearly all of available disk and so that as clusters
    create large numbers of temporary files, they could use more disk space.
    These approaches only address user isolation, not the general problem of
    user permissions. In particular, the idea that you might have some public
    files and some private files accessed by the same program is not addressed.

    This sounds much more difficult than it actually is. There is a single
    command that can be used to launch a cluster and all of the instances could
    share configuration.

    Another approach would be to use document encryption and build a custom
    input format. This is very easy to do. You would leave public files in
    plain text and encrypt private files with customer specific keys. That way,
    programs accessing private files could only access files that you want them
    to. Standard encryption utilities available in Java are fast enough that
    you shouldn't have a major speed penalty. We use AES for a lot of our input
    files and while I would prefer a plain text format for speed, our systems go
    at a respectable pace.

    You are right, btw, that your problem sounds ideally suited for map-reduce.
    I would recommend that you batch your documents many to a single file. That
    will massively improve throughput since you avoid many seek times.

    On 11/19/07 8:22 AM, "glashammer" wrote:

    Hi,
    I am currently working on a system design and I am interested in hearing
    some ideas how hadoop/hbase can be used to solve a couple of tricky issues.

    Basically I have a data set consisting of roughly 1-5 million documents
    growing at a rate of about 100-500 thousand a year. Each document is not of
    significant size (say 100k). Documents describe dependencies that are
    analyzed, intermediate results are cached as new documents, and documents
    and intermediate results are indexed. Intermediate results are about a
    factor 3 times the number of documents. Seems like a perfect thing to use
    map reduce to handle analytics and indexing, both on an initial set, and
    when changes occur.

    My question regards being able to handle multiple sets of artifacts, and
    privacy.

    I would like to be able to handle multiple sets of private documents and
    private intermediate results. When analyzing and indexing these the
    documents and partial results must run on servers that are private to the
    client and naturally also be stored in a (to our client) private and secure
    fashion.
    A client would have a lot less private data than what is stored in our
    global and publicly available data set.

    Interested in hearing ideas regarding how this can be done.

    Regards
    - henrik
  • Glashammer at Nov 19, 2007 at 10:56 pm
    Thanks Ted,
    I was thinking something like what you suggest with "running multiple
    clusters" and possibly making all the "global public data" read only, all
    writes would go into diskspace owned/controlled by the client - i.e. each
    client running their own map/reduce cluster but using the global data as
    "precalculated input".

    Regards
    - henrik

    -----Original Message-----
    From: Ted Dunning
    Sent: måndag 19 november 2007 18:53
    To: hadoop-user@lucene.apache.org
    Subject: Re: design question - multiple artifact sets and privacy


    Hadoop currently has no sense of user-private data. I believe that this
    capability is under development, but I don't know what the time-line for
    completion is (if any). You should not expect the capabilities to be fully
    usable when they are first release.

    In spite of this lack, I think you could meet your requirements in a few
    different ways. For instance, you could emulate user isolation by running
    multiple hadoop clusters on the same machines under different uid's.
    Virtualization would be another option, but I would guess that you would
    rather share disk space so that large clusters like your public data set
    could expand easily to nearly all of available disk and so that as clusters
    create large numbers of temporary files, they could use more disk space.
    These approaches only address user isolation, not the general problem of
    user permissions. In particular, the idea that you might have some public
    files and some private files accessed by the same program is not addressed.

    This sounds much more difficult than it actually is. There is a single
    command that can be used to launch a cluster and all of the instances could
    share configuration.

    Another approach would be to use document encryption and build a custom
    input format. This is very easy to do. You would leave public files in
    plain text and encrypt private files with customer specific keys. That way,
    programs accessing private files could only access files that you want them
    to. Standard encryption utilities available in Java are fast enough that
    you shouldn't have a major speed penalty. We use AES for a lot of our input
    files and while I would prefer a plain text format for speed, our systems go
    at a respectable pace.

    You are right, btw, that your problem sounds ideally suited for map-reduce.
    I would recommend that you batch your documents many to a single file. That
    will massively improve throughput since you avoid many seek times.

    On 11/19/07 8:22 AM, "glashammer" wrote:

    Hi,
    I am currently working on a system design and I am interested in hearing
    some ideas how hadoop/hbase can be used to solve a couple of tricky issues.
    Basically I have a data set consisting of roughly 1-5 million documents
    growing at a rate of about 100-500 thousand a year. Each document is not of
    significant size (say 100k). Documents describe dependencies that are
    analyzed, intermediate results are cached as new documents, and documents
    and intermediate results are indexed. Intermediate results are about a
    factor 3 times the number of documents. Seems like a perfect thing to use
    map reduce to handle analytics and indexing, both on an initial set, and
    when changes occur.

    My question regards being able to handle multiple sets of artifacts, and
    privacy.

    I would like to be able to handle multiple sets of private documents and
    private intermediate results. When analyzing and indexing these the
    documents and partial results must run on servers that are private to the
    client and naturally also be stored in a (to our client) private and secure
    fashion.
    A client would have a lot less private data than what is stored in our
    global and publicly available data set.

    Interested in hearing ideas regarding how this can be done.

    Regards
    - henrik
  • Henrik Lindberg at Nov 20, 2007 at 4:14 pm
    Hi,
    I am currently working on a system design and I am interested in hearing
    some ideas how hadoop/hbase can be used to solve a couple of tricky issues.

    Basically I have a data set consisting of roughly 1-5 million documents
    growing at a rate of about 100-500 thousand a year. Each document is not of
    significant size (say 100k). Documents describe dependencies that are
    analyzed, intermediate results are cached as new documents, and documents
    and intermediate results are indexed. Intermediate results are about a
    factor 3 times the number of documents. Seems like a perfect thing to use
    map reduce to handle analytics and indexing, both on an initial set, and
    when changes occur.

    My question regards being able to handle multiple sets of artifacts, and
    privacy.

    I would like to be able to handle multiple sets of private documents and
    private intermediate results. When analysing and indexing these the
    documents and partial results must run on servers that are private to the
    client and naturaly also be stored in a (to our client) private and secure
    fashion.
    A client would have a lot less private data than what is stored in our
    global and publically available data set.

    Interested in hearing ideas regarding how this can be done.

    Regards
    - henrik

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedNov 19, '07 at 4:23p
activeNov 20, '07 at 4:14p
posts4
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Henrik Lindberg: 3 posts Ted Dunning: 1 post

People

Translate

site design / logo © 2022 Grokbase