FAQ
Hi Erlang/CouchDB,

Recently I am trying to read the source code of CouchDB, and got some
knowledge that how the CouchDB booting up.

Right now I want to learn, when send a request, for example POST a
document, what parts of CouchDB code will do from handle the request to
save the data into disk / filesystem.

In other words, POST doc ===> save data to disk / filesystem, what parts of
code will work for the whole procedure?

Regards & Thanks!
David

Search Discussions

  • Jan Lehnardt at Nov 1, 2012 at 11:23 am
    Heya David,
    On Nov 1, 2012, at 08:39 , 高大为 wrote:

    Hi Erlang/CouchDB,

    Recently I am trying to read the source code of CouchDB, and got some
    knowledge that how the CouchDB booting up.

    Right now I want to learn, when send a request, for example POST a
    document, what parts of CouchDB code will do from handle the request to
    save the data into disk / filesystem.

    In other words, POST doc ===> save data to disk / filesystem, what parts of
    code will work for the whole procedure?

    Regards & Thanks!
    David

    I’ve been waiting for an excuse to do a deep-dive like this, thanks! :)

    Given that you already dug around some yourself, I omit the “how to get
    to the code” section. I am on current master 1a9143e.

    Let’s start at the HTTP API: src/couchdb/couch_httpd.erl

    `couch_httpd` is the main entry point for all request handling in
    CouchDB. Its responsibilities are:

    - Read the CouchDB configuration to configure itself with all
    settings a user wishes to have for handling requests.
    - Set up a socket to listen on for incoming requests.
    - Set up a list of request handlers that map API actions to
    internal module calls that do actual work.
    - Start Mochiweb to handle everything related to HTTP.
    - Export a number of functions that the request handler sub
    modules can use to handle requests.

    The sub-modules are all in src/couchdb/:

    - couch_httpd.erl
    - couch_httpd_auth.erl
    - couch_httpd_db.erl
    - couch_httpd_external.erl
    - couch_httpd_misc_handlers.erl
    - couch_httpd_oauth.erl
    - couch_httpd_proxy.erl
    - couch_httpd_rewrite.erl
    - couch_httpd_stats_handlers.er
    - couch_httpd_vhost.erl

    The mapping of request handlers to URLs happens in the CouchDB
    configuration. The defaults are set in etc/couchdb/default.ini,
    which in source form is called etc/couchdb/default.ini.tpl.in,
    meaning that there are two layers of replacing variables going
    on until we get a final default.ini. For the request handlers,
    we can look at default.ini.tpl.in.

    The mapping of URLs to request handlers happen on three layers:

    - Global handlers for things like `/`, `/_utils`, `_config` etc.
    - Database handlers like `/db/_all_docs` or `/db/_compact`.
    - Design document handlers like `/db/_design/docid/_view`


    With this knowledge, let’s trace this HTTP Requst:

    POST /db/docid
    ...
    {"a":1}

    Or in `curl`:

    $ curl -X PUT http://127.0.0.1:5984/db/docid -d '{"a":1}'


    The request goes to a `/db` URL, so we’ll have a look at the
    `[httpd_database_handlers]` section of default.ini.tpl.in:


    [httpd_db_handlers]
    _all_docs = {couch_mrview_http, handle_all_docs_req}
    _changes = {couch_httpd_db, handle_changes_req}
    _compact = {couch_httpd_db, handle_compact_req}
    _design = {couch_httpd_db, handle_design_req}
    _temp_view = {couch_mrview_http, handle_temp_view_req}
    _view_cleanup = {couch_mrview_http, handle_cleanup_req}

    Hm, nothing that looks like a handler for creating documents.

    Let’s go back to couch_httpd.erl. In line 138 we see that we
    start Mochiweb with a list of handlers, first of all the
    `DefaultFun`, maybe we need to look at that. We are tracking
    it back to line 102. There’s a bit of gibberish about “arity”,
    we’ll ignore that for now. Then we see that we *do* rely on
    the config system:

    couch_config:get("httpd", "default_handler"…).

    So let’s look at the `[httpd]` section of default.ini.tpl.in:

    default_handler = {couch_httpd_db, handle_request}

    That looks promising, let’s find that in code, at
    src/couchdb/couch_httpd_db.erl, line 36.

    `handle_request()` first checks whether we want to create or
    delete a database, but when it sees we don’t, it passes our
    request along to `do_db_req()` (line 230), which turns out
    just to be a wrapper that opens a database and calls a callback,
    so back to where `do_db_req()` is called, we see `db_req/2` is
    passed as a callback.

    Now `db_req()` has various clauses to differentiate the different
    HTTP request methods it is called with and to allow for all sorts
    of special URLs to be called. We are interested in PUT, but we
    don‘t find that PUT is handled anywhere in particular. We do see
    however, that all the clauses before the last-but-one handle
    something that is *not* put, so we know that our clause is on
    line 464:

    db_req(#httpd{path_parts=[_, DocId]}=Req, Db) ->
    db_doc_req(Req, Db, DocId);

    Which turns to be yet another indirection, so let’s go with it.
    `do_doc_req` again has a number of clauses to deal with various
    request types. Lucky for us, there is a PUT clause on line 563:

    db_doc_req(#httpd{method='PUT'}=Req, Db, DocId) ->

    First, the function checks whether we have a valid `DocId`.
    Assuming we do, it checks whether the request is a HTTP multipart
    request or a regular one. We have a regular one and are lucky
    again, our part of code here is rather small:

    Body = couch_httpd:json_body(Req),
    Doc = couch_doc_from_req(Req, DocId, Body),
    update_doc(Req, Db, DocId, Doc)

    The first line fetches the JSON document body from the `Req`
    variable. At this point, `Body` should equal: `<<"{\"a\":1}">>,
    an Erlang binary that encodes the JSON body we passed in as
    a request.

    The second line turns the JSON, together with the `DocId` into
    a CouchDB document.

    Finally, we pass all we have now to the `update_doc` function we
    check out later.

    `couch_doc_from_req()` figures out whether we are trying to update
    and existing doc with our PUT request, or whether we want to create
    a new one. In our case, not much is done, in the update case, we
    need to pass in a `rev=` query parameter and that is checked here.

    In either case though, this function returns a value of the type
    `#doc{}`, which is a record that is defined in src/couchdb/couch_db.hrl,
    line 99, if you are curious.

    With all that in place, we can finally visit `update_doc()`. It again
    has a few clauses starting in line 716 (we are still in couch_db_httpd.erl)

    `update_doc` deals with a number query parameters again until it finally
    calls `couch_db:update_doc()`.

    This is our entry into the innards of CouchDB.

    Enter `couch_db` in src/couchdb/couch_db.erl. Our function
    `update_doc()` is defined in line 422, and it ultimately seems to
    be a wrapper around `update_docs()` (plural) in the lines starting
    at 688. Update docs has two independent clauses:

    update_docs(Db, Docs, Options, replicated_changes) ->

    and

    update_docs(Db, Docs, Options, interactive_edit) ->

    The first one handles replications that can create conflicts in
    document revision lists. The second one deals with regular
    database operations. So that that is for us.

    Our `update_docs()` does a number of things:

    - prepare for yet more request parameters.
    - separate our `_local` docs and regular docs (ours is a regular one.
    - validate our document against `validate_update_function`s, if they exist.
    - check whether we provided the correct `rev` in case of updates.
    -

    And Finally:

    {ok, CommitResults} = write_and_commit(Db, DocBuckets4, NonRepDocs, Options2),

    Let’s jump there, line 831:

    After doing some more preparations that I will gloss over, we see
    that CouchDB keeps around an `UpdatePid` in the `#db{}` record that
    is passed down with us so far. This `UpdatePid` is the process ID of
    a process that deals with database updates.

    In CouchDB, each database has a single process handling writes to the
    database, to ensure a consistent database file.

    In `write_and_commit()` we send a message to that process with the message
    `update_docs` (in line 839):

    UpdatePid ! {update_docs, self(), DocBuckets, NonRepDocs, MergeConflicts, FullCommit},

    Let’s see where that message is handled.

    We need to know that the module that the `UpdatePid` runs is an
    instance of the `couch_db_updater` module. We would have found that
    out in `couch_db:init()`.

    The `update_docs` message is handled in src/couchdb/couch_db_update.erl
    in line 223.

    After receiving the whole message, with all docs (in our case, a list with
    just our document) is sent to `update_docs_int()` (line 672).

    `open_docs_int()` handles access to CouchDB’s main database data structure,
    the B+-tree. In fact, there are two B+-trees in each database at the same
    time: the fulldocinfo_by_id_btree and the docinfo_by_seq_btree. The first
    one contains all document data indexed by document id. The second one
    includes pointers to the fulldocinfo btree indexed by update sequence. The
    by_seq btree is what drives CouchDB’s /_changes feature which in turn
    powers replication, compaction and view creation.

    A new document is inserted in both indexes in lines 705 and 706:

    {ok, DocInfoByIdBTree2} = couch_btree:add_remove(DocInfoByIdBTree, IndexFullDocInfos, []),
    {ok, DocInfoBySeqBTree2} = couch_btree:add_remove(DocInfoBySeqBTree, IndexDocInfos, RemoveSeqs),

    At this point, our docs lives in the database structure, has been
    assigned a new `rev`, but it has not yet been written to disk. The
    last operation in `update_docs_int()` is `commit_data()` which
    sounds promising. Let’s jump down.

    The definition starts in line 781, the relevant bit for us in line 785.
    The way CouchDB write changes to disk is in this fashion:

    1. write all changes to the data and index trees to the disk.
    3. write a header to disk that has the current pointers to the index
    trees that we wrote in 1.

    Writing to disk does not yet mean that the data actually arrived on
    disk. It might, but we only know for sure after we call the `fsync`
    system call. From Erlang, we call `couch_file:sync()`.

    Now there are different classes of behaviour possible in the list above.
    Notice how I left out 2.

    Writing a CouchDB file (which can be either a database file or a view index)
    can give different storage guarantees. The options are to fsync before
    the header is written, or after, or both. An fsync is a potentially
    expensive operation, so we have fine grained control over this here.

    The full list is:

    1. write all changes to the data and index trees to the disk.
    2. fsync.
    3. write a header to disk that has the current pointers to the index
    trees that we wrote in 1.
    4. fsync.

    2.-4. happen in `commit_data()`, but wait, where did 1. happen?

    For that, we need to jump back to `update_docs_int()`, line 697:

    % Write out the document summaries (the bodies are stored in the nodes of
    % the trees, the attachments are already written to disk)
    {ok, FlushedFullDocInfos} = flush_trees(Db2, NewFullDocInfos, []),

    `flush_trees()` is defined in line 519. It iterates over the new data
    in the database and recursively writes it to disk in line 547:

    {ok, NewSummaryPointer, SummarySize} =
    couch_file:append_raw_chunk(Fd, Summary),

    Finally, we drop into `couch_file`, the lowest level of CouchDB.
    `append_raw_chunk()` is defined in line 111 and it is just a small
    wrapper that sends the `append_bin` message to the process that
    manages the file descriptor for our database file.

    `append_bin` is handled in line 373. It takes the data to be
    written and pads it out to make it a multiple of `?SIZE_BLOCK`
    (which is 4096 bytes).

    In line 376 our data is finally written to disk:

    file:write(Fd, Blocks)

    From here on out we now go back up into `couch_db_updater` and
    deal with the header business we looked at earlier, from there
    it jumps back up into `couch_db` which waits for a success in
    writing the data, and when that shows up, it hands it back to
    `couch_httpd_db` which uses `couch_httpd` to send the successful
    writing of the document as an HTTP response.

    This concludes our little tour.

    I hope this was helpful! Let us know if there are any questions.

    Jan
    --
  • Garren Smith at Nov 1, 2012 at 12:48 pm
    This is a brilliant explanation, Thanks Jan. Nice way to learn how everything fits together.

    Cheers
    Garren
    On 01 Nov 2012, at 1:22 PM, Jan Lehnardt wrote:

    Heya David,
    On Nov 1, 2012, at 08:39 , 高大为 wrote:

    Hi Erlang/CouchDB,

    Recently I am trying to read the source code of CouchDB, and got some
    knowledge that how the CouchDB booting up.

    Right now I want to learn, when send a request, for example POST a
    document, what parts of CouchDB code will do from handle the request to
    save the data into disk / filesystem.

    In other words, POST doc ===> save data to disk / filesystem, what parts of
    code will work for the whole procedure?

    Regards & Thanks!
    David

    I’ve been waiting for an excuse to do a deep-dive like this, thanks! :)

    Given that you already dug around some yourself, I omit the “how to get
    to the code” section. I am on current master 1a9143e.

    Let’s start at the HTTP API: src/couchdb/couch_httpd.erl

    `couch_httpd` is the main entry point for all request handling in
    CouchDB. Its responsibilities are:

    - Read the CouchDB configuration to configure itself with all
    settings a user wishes to have for handling requests.
    - Set up a socket to listen on for incoming requests.
    - Set up a list of request handlers that map API actions to
    internal module calls that do actual work.
    - Start Mochiweb to handle everything related to HTTP.
    - Export a number of functions that the request handler sub
    modules can use to handle requests.

    The sub-modules are all in src/couchdb/:

    - couch_httpd.erl
    - couch_httpd_auth.erl
    - couch_httpd_db.erl
    - couch_httpd_external.erl
    - couch_httpd_misc_handlers.erl
    - couch_httpd_oauth.erl
    - couch_httpd_proxy.erl
    - couch_httpd_rewrite.erl
    - couch_httpd_stats_handlers.er
    - couch_httpd_vhost.erl

    The mapping of request handlers to URLs happens in the CouchDB
    configuration. The defaults are set in etc/couchdb/default.ini,
    which in source form is called etc/couchdb/default.ini.tpl.in,
    meaning that there are two layers of replacing variables going
    on until we get a final default.ini. For the request handlers,
    we can look at default.ini.tpl.in.

    The mapping of URLs to request handlers happen on three layers:

    - Global handlers for things like `/`, `/_utils`, `_config` etc.
    - Database handlers like `/db/_all_docs` or `/db/_compact`.
    - Design document handlers like `/db/_design/docid/_view`


    With this knowledge, let’s trace this HTTP Requst:

    POST /db/docid
    ...
    {"a":1}

    Or in `curl`:

    $ curl -X PUT http://127.0.0.1:5984/db/docid -d '{"a":1}'


    The request goes to a `/db` URL, so we’ll have a look at the
    `[httpd_database_handlers]` section of default.ini.tpl.in:


    [httpd_db_handlers]
    _all_docs = {couch_mrview_http, handle_all_docs_req}
    _changes = {couch_httpd_db, handle_changes_req}
    _compact = {couch_httpd_db, handle_compact_req}
    _design = {couch_httpd_db, handle_design_req}
    _temp_view = {couch_mrview_http, handle_temp_view_req}
    _view_cleanup = {couch_mrview_http, handle_cleanup_req}

    Hm, nothing that looks like a handler for creating documents.

    Let’s go back to couch_httpd.erl. In line 138 we see that we
    start Mochiweb with a list of handlers, first of all the
    `DefaultFun`, maybe we need to look at that. We are tracking
    it back to line 102. There’s a bit of gibberish about “arity”,
    we’ll ignore that for now. Then we see that we *do* rely on
    the config system:

    couch_config:get("httpd", "default_handler"…).

    So let’s look at the `[httpd]` section of default.ini.tpl.in:

    default_handler = {couch_httpd_db, handle_request}

    That looks promising, let’s find that in code, at
    src/couchdb/couch_httpd_db.erl, line 36.

    `handle_request()` first checks whether we want to create or
    delete a database, but when it sees we don’t, it passes our
    request along to `do_db_req()` (line 230), which turns out
    just to be a wrapper that opens a database and calls a callback,
    so back to where `do_db_req()` is called, we see `db_req/2` is
    passed as a callback.

    Now `db_req()` has various clauses to differentiate the different
    HTTP request methods it is called with and to allow for all sorts
    of special URLs to be called. We are interested in PUT, but we
    don‘t find that PUT is handled anywhere in particular. We do see
    however, that all the clauses before the last-but-one handle
    something that is *not* put, so we know that our clause is on
    line 464:

    db_req(#httpd{path_parts=[_, DocId]}=Req, Db) ->
    db_doc_req(Req, Db, DocId);

    Which turns to be yet another indirection, so let’s go with it.
    `do_doc_req` again has a number of clauses to deal with various
    request types. Lucky for us, there is a PUT clause on line 563:

    db_doc_req(#httpd{method='PUT'}=Req, Db, DocId) ->

    First, the function checks whether we have a valid `DocId`.
    Assuming we do, it checks whether the request is a HTTP multipart
    request or a regular one. We have a regular one and are lucky
    again, our part of code here is rather small:

    Body = couch_httpd:json_body(Req),
    Doc = couch_doc_from_req(Req, DocId, Body),
    update_doc(Req, Db, DocId, Doc)

    The first line fetches the JSON document body from the `Req`
    variable. At this point, `Body` should equal: `<<"{\"a\":1}">>,
    an Erlang binary that encodes the JSON body we passed in as
    a request.

    The second line turns the JSON, together with the `DocId` into
    a CouchDB document.

    Finally, we pass all we have now to the `update_doc` function we
    check out later.

    `couch_doc_from_req()` figures out whether we are trying to update
    and existing doc with our PUT request, or whether we want to create
    a new one. In our case, not much is done, in the update case, we
    need to pass in a `rev=` query parameter and that is checked here.

    In either case though, this function returns a value of the type
    `#doc{}`, which is a record that is defined in src/couchdb/couch_db.hrl,
    line 99, if you are curious.

    With all that in place, we can finally visit `update_doc()`. It again
    has a few clauses starting in line 716 (we are still in couch_db_httpd.erl)

    `update_doc` deals with a number query parameters again until it finally
    calls `couch_db:update_doc()`.

    This is our entry into the innards of CouchDB.

    Enter `couch_db` in src/couchdb/couch_db.erl. Our function
    `update_doc()` is defined in line 422, and it ultimately seems to
    be a wrapper around `update_docs()` (plural) in the lines starting
    at 688. Update docs has two independent clauses:

    update_docs(Db, Docs, Options, replicated_changes) ->

    and

    update_docs(Db, Docs, Options, interactive_edit) ->

    The first one handles replications that can create conflicts in
    document revision lists. The second one deals with regular
    database operations. So that that is for us.

    Our `update_docs()` does a number of things:

    - prepare for yet more request parameters.
    - separate our `_local` docs and regular docs (ours is a regular one.
    - validate our document against `validate_update_function`s, if they exist.
    - check whether we provided the correct `rev` in case of updates.
    -

    And Finally:

    {ok, CommitResults} = write_and_commit(Db, DocBuckets4, NonRepDocs, Options2),

    Let’s jump there, line 831:

    After doing some more preparations that I will gloss over, we see
    that CouchDB keeps around an `UpdatePid` in the `#db{}` record that
    is passed down with us so far. This `UpdatePid` is the process ID of
    a process that deals with database updates.

    In CouchDB, each database has a single process handling writes to the
    database, to ensure a consistent database file.

    In `write_and_commit()` we send a message to that process with the message
    `update_docs` (in line 839):

    UpdatePid ! {update_docs, self(), DocBuckets, NonRepDocs, MergeConflicts, FullCommit},

    Let’s see where that message is handled.

    We need to know that the module that the `UpdatePid` runs is an
    instance of the `couch_db_updater` module. We would have found that
    out in `couch_db:init()`.

    The `update_docs` message is handled in src/couchdb/couch_db_update.erl
    in line 223.

    After receiving the whole message, with all docs (in our case, a list with
    just our document) is sent to `update_docs_int()` (line 672).

    `open_docs_int()` handles access to CouchDB’s main database data structure,
    the B+-tree. In fact, there are two B+-trees in each database at the same
    time: the fulldocinfo_by_id_btree and the docinfo_by_seq_btree. The first
    one contains all document data indexed by document id. The second one
    includes pointers to the fulldocinfo btree indexed by update sequence. The
    by_seq btree is what drives CouchDB’s /_changes feature which in turn
    powers replication, compaction and view creation.

    A new document is inserted in both indexes in lines 705 and 706:

    {ok, DocInfoByIdBTree2} = couch_btree:add_remove(DocInfoByIdBTree, IndexFullDocInfos, []),
    {ok, DocInfoBySeqBTree2} = couch_btree:add_remove(DocInfoBySeqBTree, IndexDocInfos, RemoveSeqs),

    At this point, our docs lives in the database structure, has been
    assigned a new `rev`, but it has not yet been written to disk. The
    last operation in `update_docs_int()` is `commit_data()` which
    sounds promising. Let’s jump down.

    The definition starts in line 781, the relevant bit for us in line 785.
    The way CouchDB write changes to disk is in this fashion:

    1. write all changes to the data and index trees to the disk.
    3. write a header to disk that has the current pointers to the index
    trees that we wrote in 1.

    Writing to disk does not yet mean that the data actually arrived on
    disk. It might, but we only know for sure after we call the `fsync`
    system call. From Erlang, we call `couch_file:sync()`.

    Now there are different classes of behaviour possible in the list above.
    Notice how I left out 2.

    Writing a CouchDB file (which can be either a database file or a view index)
    can give different storage guarantees. The options are to fsync before
    the header is written, or after, or both. An fsync is a potentially
    expensive operation, so we have fine grained control over this here.

    The full list is:

    1. write all changes to the data and index trees to the disk.
    2. fsync.
    3. write a header to disk that has the current pointers to the index
    trees that we wrote in 1.
    4. fsync.

    2.-4. happen in `commit_data()`, but wait, where did 1. happen?

    For that, we need to jump back to `update_docs_int()`, line 697:

    % Write out the document summaries (the bodies are stored in the nodes of
    % the trees, the attachments are already written to disk)
    {ok, FlushedFullDocInfos} = flush_trees(Db2, NewFullDocInfos, []),

    `flush_trees()` is defined in line 519. It iterates over the new data
    in the database and recursively writes it to disk in line 547:

    {ok, NewSummaryPointer, SummarySize} =
    couch_file:append_raw_chunk(Fd, Summary),

    Finally, we drop into `couch_file`, the lowest level of CouchDB.
    `append_raw_chunk()` is defined in line 111 and it is just a small
    wrapper that sends the `append_bin` message to the process that
    manages the file descriptor for our database file.

    `append_bin` is handled in line 373. It takes the data to be
    written and pads it out to make it a multiple of `?SIZE_BLOCK`
    (which is 4096 bytes).

    In line 376 our data is finally written to disk:

    file:write(Fd, Blocks)

    From here on out we now go back up into `couch_db_updater` and
    deal with the header business we looked at earlier, from there
    it jumps back up into `couch_db` which waits for a success in
    writing the data, and when that shows up, it hands it back to
    `couch_httpd_db` which uses `couch_httpd` to send the successful
    writing of the document as an HTTP response.

    This concludes our little tour.

    I hope this was helpful! Let us know if there are any questions.

    Jan
    --
  • Ryan Graham at Nov 1, 2012 at 4:58 pm
    Great, but didn't David asking about POST, not PUT? *ducks*

    Seriously, though, great post. This will really help with learning CouchDB
    and Erlang. Thanks!

    ~Ryan

    On Thu, Nov 1, 2012 at 5:47 AM, Garren Smith wrote:

    This is a brilliant explanation, Thanks Jan. Nice way to learn how
    everything fits together.

    Cheers
    Garren
    On 01 Nov 2012, at 1:22 PM, Jan Lehnardt wrote:

    Heya David,
    On Nov 1, 2012, at 08:39 , 高大为 wrote:

    Hi Erlang/CouchDB,

    Recently I am trying to read the source code of CouchDB, and got some
    knowledge that how the CouchDB booting up.

    Right now I want to learn, when send a request, for example POST a
    document, what parts of CouchDB code will do from handle the request to
    save the data into disk / filesystem.

    In other words, POST doc ===> save data to disk / filesystem, what
    parts of
    code will work for the whole procedure?

    Regards & Thanks!
    David

    I’ve been waiting for an excuse to do a deep-dive like this, thanks! :)

    Given that you already dug around some yourself, I omit the “how to get
    to the code” section. I am on current master 1a9143e.

    Let’s start at the HTTP API: src/couchdb/couch_httpd.erl

    `couch_httpd` is the main entry point for all request handling in
    CouchDB. Its responsibilities are:

    - Read the CouchDB configuration to configure itself with all
    settings a user wishes to have for handling requests.
    - Set up a socket to listen on for incoming requests.
    - Set up a list of request handlers that map API actions to
    internal module calls that do actual work.
    - Start Mochiweb to handle everything related to HTTP.
    - Export a number of functions that the request handler sub
    modules can use to handle requests.

    The sub-modules are all in src/couchdb/:

    - couch_httpd.erl
    - couch_httpd_auth.erl
    - couch_httpd_db.erl
    - couch_httpd_external.erl
    - couch_httpd_misc_handlers.erl
    - couch_httpd_oauth.erl
    - couch_httpd_proxy.erl
    - couch_httpd_rewrite.erl
    - couch_httpd_stats_handlers.er
    - couch_httpd_vhost.erl

    The mapping of request handlers to URLs happens in the CouchDB
    configuration. The defaults are set in etc/couchdb/default.ini,
    which in source form is called etc/couchdb/default.ini.tpl.in,
    meaning that there are two layers of replacing variables going
    on until we get a final default.ini. For the request handlers,
    we can look at default.ini.tpl.in.

    The mapping of URLs to request handlers happen on three layers:

    - Global handlers for things like `/`, `/_utils`, `_config` etc.
    - Database handlers like `/db/_all_docs` or `/db/_compact`.
    - Design document handlers like `/db/_design/docid/_view`


    With this knowledge, let’s trace this HTTP Requst:

    POST /db/docid
    ...
    {"a":1}

    Or in `curl`:

    $ curl -X PUT http://127.0.0.1:5984/db/docid -d '{"a":1}'


    The request goes to a `/db` URL, so we’ll have a look at the
    `[httpd_database_handlers]` section of default.ini.tpl.in:


    [httpd_db_handlers]
    _all_docs = {couch_mrview_http, handle_all_docs_req}
    _changes = {couch_httpd_db, handle_changes_req}
    _compact = {couch_httpd_db, handle_compact_req}
    _design = {couch_httpd_db, handle_design_req}
    _temp_view = {couch_mrview_http, handle_temp_view_req}
    _view_cleanup = {couch_mrview_http, handle_cleanup_req}

    Hm, nothing that looks like a handler for creating documents.

    Let’s go back to couch_httpd.erl. In line 138 we see that we
    start Mochiweb with a list of handlers, first of all the
    `DefaultFun`, maybe we need to look at that. We are tracking
    it back to line 102. There’s a bit of gibberish about “arity”,
    we’ll ignore that for now. Then we see that we *do* rely on
    the config system:

    couch_config:get("httpd", "default_handler"…).

    So let’s look at the `[httpd]` section of default.ini.tpl.in:

    default_handler = {couch_httpd_db, handle_request}

    That looks promising, let’s find that in code, at
    src/couchdb/couch_httpd_db.erl, line 36.

    `handle_request()` first checks whether we want to create or
    delete a database, but when it sees we don’t, it passes our
    request along to `do_db_req()` (line 230), which turns out
    just to be a wrapper that opens a database and calls a callback,
    so back to where `do_db_req()` is called, we see `db_req/2` is
    passed as a callback.

    Now `db_req()` has various clauses to differentiate the different
    HTTP request methods it is called with and to allow for all sorts
    of special URLs to be called. We are interested in PUT, but we
    don‘t find that PUT is handled anywhere in particular. We do see
    however, that all the clauses before the last-but-one handle
    something that is *not* put, so we know that our clause is on
    line 464:

    db_req(#httpd{path_parts=[_, DocId]}=Req, Db) ->
    db_doc_req(Req, Db, DocId);

    Which turns to be yet another indirection, so let’s go with it.
    `do_doc_req` again has a number of clauses to deal with various
    request types. Lucky for us, there is a PUT clause on line 563:

    db_doc_req(#httpd{method='PUT'}=Req, Db, DocId) ->

    First, the function checks whether we have a valid `DocId`.
    Assuming we do, it checks whether the request is a HTTP multipart
    request or a regular one. We have a regular one and are lucky
    again, our part of code here is rather small:

    Body = couch_httpd:json_body(Req),
    Doc = couch_doc_from_req(Req, DocId, Body),
    update_doc(Req, Db, DocId, Doc)

    The first line fetches the JSON document body from the `Req`
    variable. At this point, `Body` should equal: `<<"{\"a\":1}">>,
    an Erlang binary that encodes the JSON body we passed in as
    a request.

    The second line turns the JSON, together with the `DocId` into
    a CouchDB document.

    Finally, we pass all we have now to the `update_doc` function we
    check out later.

    `couch_doc_from_req()` figures out whether we are trying to update
    and existing doc with our PUT request, or whether we want to create
    a new one. In our case, not much is done, in the update case, we
    need to pass in a `rev=` query parameter and that is checked here.

    In either case though, this function returns a value of the type
    `#doc{}`, which is a record that is defined in src/couchdb/couch_db.hrl,
    line 99, if you are curious.

    With all that in place, we can finally visit `update_doc()`. It again
    has a few clauses starting in line 716 (we are still in
    couch_db_httpd.erl)
    `update_doc` deals with a number query parameters again until it finally
    calls `couch_db:update_doc()`.

    This is our entry into the innards of CouchDB.

    Enter `couch_db` in src/couchdb/couch_db.erl. Our function
    `update_doc()` is defined in line 422, and it ultimately seems to
    be a wrapper around `update_docs()` (plural) in the lines starting
    at 688. Update docs has two independent clauses:

    update_docs(Db, Docs, Options, replicated_changes) ->

    and

    update_docs(Db, Docs, Options, interactive_edit) ->

    The first one handles replications that can create conflicts in
    document revision lists. The second one deals with regular
    database operations. So that that is for us.

    Our `update_docs()` does a number of things:

    - prepare for yet more request parameters.
    - separate our `_local` docs and regular docs (ours is a regular one.
    - validate our document against `validate_update_function`s, if they exist.
    - check whether we provided the correct `rev` in case of updates.
    -

    And Finally:

    {ok, CommitResults} = write_and_commit(Db, DocBuckets4, NonRepDocs,
    Options2),
    Let’s jump there, line 831:

    After doing some more preparations that I will gloss over, we see
    that CouchDB keeps around an `UpdatePid` in the `#db{}` record that
    is passed down with us so far. This `UpdatePid` is the process ID of
    a process that deals with database updates.

    In CouchDB, each database has a single process handling writes to the
    database, to ensure a consistent database file.

    In `write_and_commit()` we send a message to that process with the message
    `update_docs` (in line 839):

    UpdatePid ! {update_docs, self(), DocBuckets, NonRepDocs,
    MergeConflicts, FullCommit},
    Let’s see where that message is handled.

    We need to know that the module that the `UpdatePid` runs is an
    instance of the `couch_db_updater` module. We would have found that
    out in `couch_db:init()`.

    The `update_docs` message is handled in src/couchdb/couch_db_update.erl
    in line 223.

    After receiving the whole message, with all docs (in our case, a list with
    just our document) is sent to `update_docs_int()` (line 672).

    `open_docs_int()` handles access to CouchDB’s main database data
    structure,
    the B+-tree. In fact, there are two B+-trees in each database at the same
    time: the fulldocinfo_by_id_btree and the docinfo_by_seq_btree. The first
    one contains all document data indexed by document id. The second one
    includes pointers to the fulldocinfo btree indexed by update sequence. The
    by_seq btree is what drives CouchDB’s /_changes feature which in turn
    powers replication, compaction and view creation.

    A new document is inserted in both indexes in lines 705 and 706:

    {ok, DocInfoByIdBTree2} = couch_btree:add_remove(DocInfoByIdBTree,
    IndexFullDocInfos, []),
    {ok, DocInfoBySeqBTree2} = couch_btree:add_remove(DocInfoBySeqBTree,
    IndexDocInfos, RemoveSeqs),
    At this point, our docs lives in the database structure, has been
    assigned a new `rev`, but it has not yet been written to disk. The
    last operation in `update_docs_int()` is `commit_data()` which
    sounds promising. Let’s jump down.

    The definition starts in line 781, the relevant bit for us in line 785.
    The way CouchDB write changes to disk is in this fashion:

    1. write all changes to the data and index trees to the disk.
    3. write a header to disk that has the current pointers to the index
    trees that we wrote in 1.

    Writing to disk does not yet mean that the data actually arrived on
    disk. It might, but we only know for sure after we call the `fsync`
    system call. From Erlang, we call `couch_file:sync()`.

    Now there are different classes of behaviour possible in the list above.
    Notice how I left out 2.

    Writing a CouchDB file (which can be either a database file or a view index)
    can give different storage guarantees. The options are to fsync before
    the header is written, or after, or both. An fsync is a potentially
    expensive operation, so we have fine grained control over this here.

    The full list is:

    1. write all changes to the data and index trees to the disk.
    2. fsync.
    3. write a header to disk that has the current pointers to the index
    trees that we wrote in 1.
    4. fsync.

    2.-4. happen in `commit_data()`, but wait, where did 1. happen?

    For that, we need to jump back to `update_docs_int()`, line 697:

    % Write out the document summaries (the bodies are stored in the nodes of
    % the trees, the attachments are already written to disk)
    {ok, FlushedFullDocInfos} = flush_trees(Db2, NewFullDocInfos, []),

    `flush_trees()` is defined in line 519. It iterates over the new data
    in the database and recursively writes it to disk in line 547:

    {ok, NewSummaryPointer, SummarySize} =
    couch_file:append_raw_chunk(Fd, Summary),

    Finally, we drop into `couch_file`, the lowest level of CouchDB.
    `append_raw_chunk()` is defined in line 111 and it is just a small
    wrapper that sends the `append_bin` message to the process that
    manages the file descriptor for our database file.

    `append_bin` is handled in line 373. It takes the data to be
    written and pads it out to make it a multiple of `?SIZE_BLOCK`
    (which is 4096 bytes).

    In line 376 our data is finally written to disk:

    file:write(Fd, Blocks)

    From here on out we now go back up into `couch_db_updater` and
    deal with the header business we looked at earlier, from there
    it jumps back up into `couch_db` which waits for a success in
    writing the data, and when that shows up, it hands it back to
    `couch_httpd_db` which uses `couch_httpd` to send the successful
    writing of the document as an HTTP response.

    This concludes our little tour.

    I hope this was helpful! Let us know if there are any questions.

    Jan
    --

    --
    http://twitter.com/rmgraham
  • Jan Lehnardt at Nov 1, 2012 at 6:34 pm

    On Nov 1, 2012, at 17:58 , Ryan Graham wrote:

    Great, but didn't David asking about POST, not PUT? *ducks*
    Ah sorry, I didn't comment on that. Aside form an indirection in
    couch_httpd_db the code path is the same, and I chose this one
    because it is a little simpler.

    Seriously, though, great post. This will really help with learning CouchDB
    and Erlang. Thanks!
    Thanks, to the others as well, I’m glad you like it!

    Cheers
    Jan
    --
    ~Ryan

    On Thu, Nov 1, 2012 at 5:47 AM, Garren Smith wrote:

    This is a brilliant explanation, Thanks Jan. Nice way to learn how
    everything fits together.

    Cheers
    Garren
    On 01 Nov 2012, at 1:22 PM, Jan Lehnardt wrote:

    Heya David,
    On Nov 1, 2012, at 08:39 , 高大为 wrote:

    Hi Erlang/CouchDB,

    Recently I am trying to read the source code of CouchDB, and got some
    knowledge that how the CouchDB booting up.

    Right now I want to learn, when send a request, for example POST a
    document, what parts of CouchDB code will do from handle the request to
    save the data into disk / filesystem.

    In other words, POST doc ===> save data to disk / filesystem, what
    parts of
    code will work for the whole procedure?

    Regards & Thanks!
    David

    I’ve been waiting for an excuse to do a deep-dive like this, thanks! :)

    Given that you already dug around some yourself, I omit the “how to get
    to the code” section. I am on current master 1a9143e.

    Let’s start at the HTTP API: src/couchdb/couch_httpd.erl

    `couch_httpd` is the main entry point for all request handling in
    CouchDB. Its responsibilities are:

    - Read the CouchDB configuration to configure itself with all
    settings a user wishes to have for handling requests.
    - Set up a socket to listen on for incoming requests.
    - Set up a list of request handlers that map API actions to
    internal module calls that do actual work.
    - Start Mochiweb to handle everything related to HTTP.
    - Export a number of functions that the request handler sub
    modules can use to handle requests.

    The sub-modules are all in src/couchdb/:

    - couch_httpd.erl
    - couch_httpd_auth.erl
    - couch_httpd_db.erl
    - couch_httpd_external.erl
    - couch_httpd_misc_handlers.erl
    - couch_httpd_oauth.erl
    - couch_httpd_proxy.erl
    - couch_httpd_rewrite.erl
    - couch_httpd_stats_handlers.er
    - couch_httpd_vhost.erl

    The mapping of request handlers to URLs happens in the CouchDB
    configuration. The defaults are set in etc/couchdb/default.ini,
    which in source form is called etc/couchdb/default.ini.tpl.in,
    meaning that there are two layers of replacing variables going
    on until we get a final default.ini. For the request handlers,
    we can look at default.ini.tpl.in.

    The mapping of URLs to request handlers happen on three layers:

    - Global handlers for things like `/`, `/_utils`, `_config` etc.
    - Database handlers like `/db/_all_docs` or `/db/_compact`.
    - Design document handlers like `/db/_design/docid/_view`


    With this knowledge, let’s trace this HTTP Requst:

    POST /db/docid
    ...
    {"a":1}

    Or in `curl`:

    $ curl -X PUT http://127.0.0.1:5984/db/docid -d '{"a":1}'


    The request goes to a `/db` URL, so we’ll have a look at the
    `[httpd_database_handlers]` section of default.ini.tpl.in:


    [httpd_db_handlers]
    _all_docs = {couch_mrview_http, handle_all_docs_req}
    _changes = {couch_httpd_db, handle_changes_req}
    _compact = {couch_httpd_db, handle_compact_req}
    _design = {couch_httpd_db, handle_design_req}
    _temp_view = {couch_mrview_http, handle_temp_view_req}
    _view_cleanup = {couch_mrview_http, handle_cleanup_req}

    Hm, nothing that looks like a handler for creating documents.

    Let’s go back to couch_httpd.erl. In line 138 we see that we
    start Mochiweb with a list of handlers, first of all the
    `DefaultFun`, maybe we need to look at that. We are tracking
    it back to line 102. There’s a bit of gibberish about “arity”,
    we’ll ignore that for now. Then we see that we *do* rely on
    the config system:

    couch_config:get("httpd", "default_handler"…).

    So let’s look at the `[httpd]` section of default.ini.tpl.in:

    default_handler = {couch_httpd_db, handle_request}

    That looks promising, let’s find that in code, at
    src/couchdb/couch_httpd_db.erl, line 36.

    `handle_request()` first checks whether we want to create or
    delete a database, but when it sees we don’t, it passes our
    request along to `do_db_req()` (line 230), which turns out
    just to be a wrapper that opens a database and calls a callback,
    so back to where `do_db_req()` is called, we see `db_req/2` is
    passed as a callback.

    Now `db_req()` has various clauses to differentiate the different
    HTTP request methods it is called with and to allow for all sorts
    of special URLs to be called. We are interested in PUT, but we
    don‘t find that PUT is handled anywhere in particular. We do see
    however, that all the clauses before the last-but-one handle
    something that is *not* put, so we know that our clause is on
    line 464:

    db_req(#httpd{path_parts=[_, DocId]}=Req, Db) ->
    db_doc_req(Req, Db, DocId);

    Which turns to be yet another indirection, so let’s go with it.
    `do_doc_req` again has a number of clauses to deal with various
    request types. Lucky for us, there is a PUT clause on line 563:

    db_doc_req(#httpd{method='PUT'}=Req, Db, DocId) ->

    First, the function checks whether we have a valid `DocId`.
    Assuming we do, it checks whether the request is a HTTP multipart
    request or a regular one. We have a regular one and are lucky
    again, our part of code here is rather small:

    Body = couch_httpd:json_body(Req),
    Doc = couch_doc_from_req(Req, DocId, Body),
    update_doc(Req, Db, DocId, Doc)

    The first line fetches the JSON document body from the `Req`
    variable. At this point, `Body` should equal: `<<"{\"a\":1}">>,
    an Erlang binary that encodes the JSON body we passed in as
    a request.

    The second line turns the JSON, together with the `DocId` into
    a CouchDB document.

    Finally, we pass all we have now to the `update_doc` function we
    check out later.

    `couch_doc_from_req()` figures out whether we are trying to update
    and existing doc with our PUT request, or whether we want to create
    a new one. In our case, not much is done, in the update case, we
    need to pass in a `rev=` query parameter and that is checked here.

    In either case though, this function returns a value of the type
    `#doc{}`, which is a record that is defined in src/couchdb/couch_db.hrl,
    line 99, if you are curious.

    With all that in place, we can finally visit `update_doc()`. It again
    has a few clauses starting in line 716 (we are still in
    couch_db_httpd.erl)
    `update_doc` deals with a number query parameters again until it finally
    calls `couch_db:update_doc()`.

    This is our entry into the innards of CouchDB.

    Enter `couch_db` in src/couchdb/couch_db.erl. Our function
    `update_doc()` is defined in line 422, and it ultimately seems to
    be a wrapper around `update_docs()` (plural) in the lines starting
    at 688. Update docs has two independent clauses:

    update_docs(Db, Docs, Options, replicated_changes) ->

    and

    update_docs(Db, Docs, Options, interactive_edit) ->

    The first one handles replications that can create conflicts in
    document revision lists. The second one deals with regular
    database operations. So that that is for us.

    Our `update_docs()` does a number of things:

    - prepare for yet more request parameters.
    - separate our `_local` docs and regular docs (ours is a regular one.
    - validate our document against `validate_update_function`s, if they exist.
    - check whether we provided the correct `rev` in case of updates.
    -

    And Finally:

    {ok, CommitResults} = write_and_commit(Db, DocBuckets4, NonRepDocs,
    Options2),
    Let’s jump there, line 831:

    After doing some more preparations that I will gloss over, we see
    that CouchDB keeps around an `UpdatePid` in the `#db{}` record that
    is passed down with us so far. This `UpdatePid` is the process ID of
    a process that deals with database updates.

    In CouchDB, each database has a single process handling writes to the
    database, to ensure a consistent database file.

    In `write_and_commit()` we send a message to that process with the message
    `update_docs` (in line 839):

    UpdatePid ! {update_docs, self(), DocBuckets, NonRepDocs,
    MergeConflicts, FullCommit},
    Let’s see where that message is handled.

    We need to know that the module that the `UpdatePid` runs is an
    instance of the `couch_db_updater` module. We would have found that
    out in `couch_db:init()`.

    The `update_docs` message is handled in src/couchdb/couch_db_update.erl
    in line 223.

    After receiving the whole message, with all docs (in our case, a list with
    just our document) is sent to `update_docs_int()` (line 672).

    `open_docs_int()` handles access to CouchDB’s main database data
    structure,
    the B+-tree. In fact, there are two B+-trees in each database at the same
    time: the fulldocinfo_by_id_btree and the docinfo_by_seq_btree. The first
    one contains all document data indexed by document id. The second one
    includes pointers to the fulldocinfo btree indexed by update sequence. The
    by_seq btree is what drives CouchDB’s /_changes feature which in turn
    powers replication, compaction and view creation.

    A new document is inserted in both indexes in lines 705 and 706:

    {ok, DocInfoByIdBTree2} = couch_btree:add_remove(DocInfoByIdBTree,
    IndexFullDocInfos, []),
    {ok, DocInfoBySeqBTree2} = couch_btree:add_remove(DocInfoBySeqBTree,
    IndexDocInfos, RemoveSeqs),
    At this point, our docs lives in the database structure, has been
    assigned a new `rev`, but it has not yet been written to disk. The
    last operation in `update_docs_int()` is `commit_data()` which
    sounds promising. Let’s jump down.

    The definition starts in line 781, the relevant bit for us in line 785.
    The way CouchDB write changes to disk is in this fashion:

    1. write all changes to the data and index trees to the disk.
    3. write a header to disk that has the current pointers to the index
    trees that we wrote in 1.

    Writing to disk does not yet mean that the data actually arrived on
    disk. It might, but we only know for sure after we call the `fsync`
    system call. From Erlang, we call `couch_file:sync()`.

    Now there are different classes of behaviour possible in the list above.
    Notice how I left out 2.

    Writing a CouchDB file (which can be either a database file or a view index)
    can give different storage guarantees. The options are to fsync before
    the header is written, or after, or both. An fsync is a potentially
    expensive operation, so we have fine grained control over this here.

    The full list is:

    1. write all changes to the data and index trees to the disk.
    2. fsync.
    3. write a header to disk that has the current pointers to the index
    trees that we wrote in 1.
    4. fsync.

    2.-4. happen in `commit_data()`, but wait, where did 1. happen?

    For that, we need to jump back to `update_docs_int()`, line 697:

    % Write out the document summaries (the bodies are stored in the nodes of
    % the trees, the attachments are already written to disk)
    {ok, FlushedFullDocInfos} = flush_trees(Db2, NewFullDocInfos, []),

    `flush_trees()` is defined in line 519. It iterates over the new data
    in the database and recursively writes it to disk in line 547:

    {ok, NewSummaryPointer, SummarySize} =
    couch_file:append_raw_chunk(Fd, Summary),

    Finally, we drop into `couch_file`, the lowest level of CouchDB.
    `append_raw_chunk()` is defined in line 111 and it is just a small
    wrapper that sends the `append_bin` message to the process that
    manages the file descriptor for our database file.

    `append_bin` is handled in line 373. It takes the data to be
    written and pads it out to make it a multiple of `?SIZE_BLOCK`
    (which is 4096 bytes).

    In line 376 our data is finally written to disk:

    file:write(Fd, Blocks)

    From here on out we now go back up into `couch_db_updater` and
    deal with the header business we looked at earlier, from there
    it jumps back up into `couch_db` which waits for a success in
    writing the data, and when that shows up, it hands it back to
    `couch_httpd_db` which uses `couch_httpd` to send the successful
    writing of the document as an HTTP response.

    This concludes our little tour.

    I hope this was helpful! Let us know if there are any questions.

    Jan
    --

    --
    http://twitter.com/rmgraham
  • Binbin Wang at Nov 2, 2012 at 1:11 am
    Hi Jan,

    That's great post, and greatly can help we guys who want to dive into the
    CouchDB source a lot. Thank you for your sharing!

    Regards & Thanks!
    Binbin

    2012/11/1 Jan Lehnardt <jan@apache.org>
    Heya David,
    On Nov 1, 2012, at 08:39 , 高大为 wrote:

    Hi Erlang/CouchDB,

    Recently I am trying to read the source code of CouchDB, and got some
    knowledge that how the CouchDB booting up.

    Right now I want to learn, when send a request, for example POST a
    document, what parts of CouchDB code will do from handle the request to
    save the data into disk / filesystem.

    In other words, POST doc ===> save data to disk / filesystem, what parts of
    code will work for the whole procedure?

    Regards & Thanks!
    David

    I’ve been waiting for an excuse to do a deep-dive like this, thanks! :)

    Given that you already dug around some yourself, I omit the “how to get
    to the code” section. I am on current master 1a9143e.

    Let’s start at the HTTP API: src/couchdb/couch_httpd.erl

    `couch_httpd` is the main entry point for all request handling in
    CouchDB. Its responsibilities are:

    - Read the CouchDB configuration to configure itself with all
    settings a user wishes to have for handling requests.
    - Set up a socket to listen on for incoming requests.
    - Set up a list of request handlers that map API actions to
    internal module calls that do actual work.
    - Start Mochiweb to handle everything related to HTTP.
    - Export a number of functions that the request handler sub
    modules can use to handle requests.

    The sub-modules are all in src/couchdb/:

    - couch_httpd.erl
    - couch_httpd_auth.erl
    - couch_httpd_db.erl
    - couch_httpd_external.erl
    - couch_httpd_misc_handlers.erl
    - couch_httpd_oauth.erl
    - couch_httpd_proxy.erl
    - couch_httpd_rewrite.erl
    - couch_httpd_stats_handlers.er
    - couch_httpd_vhost.erl

    The mapping of request handlers to URLs happens in the CouchDB
    configuration. The defaults are set in etc/couchdb/default.ini,
    which in source form is called etc/couchdb/default.ini.tpl.in,
    meaning that there are two layers of replacing variables going
    on until we get a final default.ini. For the request handlers,
    we can look at default.ini.tpl.in.

    The mapping of URLs to request handlers happen on three layers:

    - Global handlers for things like `/`, `/_utils`, `_config` etc.
    - Database handlers like `/db/_all_docs` or `/db/_compact`.
    - Design document handlers like `/db/_design/docid/_view`


    With this knowledge, let’s trace this HTTP Requst:

    POST /db/docid
    ...
    {"a":1}

    Or in `curl`:

    $ curl -X PUT http://127.0.0.1:5984/db/docid -d '{"a":1}'


    The request goes to a `/db` URL, so we’ll have a look at the
    `[httpd_database_handlers]` section of default.ini.tpl.in:


    [httpd_db_handlers]
    _all_docs = {couch_mrview_http, handle_all_docs_req}
    _changes = {couch_httpd_db, handle_changes_req}
    _compact = {couch_httpd_db, handle_compact_req}
    _design = {couch_httpd_db, handle_design_req}
    _temp_view = {couch_mrview_http, handle_temp_view_req}
    _view_cleanup = {couch_mrview_http, handle_cleanup_req}

    Hm, nothing that looks like a handler for creating documents.

    Let’s go back to couch_httpd.erl. In line 138 we see that we
    start Mochiweb with a list of handlers, first of all the
    `DefaultFun`, maybe we need to look at that. We are tracking
    it back to line 102. There’s a bit of gibberish about “arity”,
    we’ll ignore that for now. Then we see that we *do* rely on
    the config system:

    couch_config:get("httpd", "default_handler"…).

    So let’s look at the `[httpd]` section of default.ini.tpl.in:

    default_handler = {couch_httpd_db, handle_request}

    That looks promising, let’s find that in code, at
    src/couchdb/couch_httpd_db.erl, line 36.

    `handle_request()` first checks whether we want to create or
    delete a database, but when it sees we don’t, it passes our
    request along to `do_db_req()` (line 230), which turns out
    just to be a wrapper that opens a database and calls a callback,
    so back to where `do_db_req()` is called, we see `db_req/2` is
    passed as a callback.

    Now `db_req()` has various clauses to differentiate the different
    HTTP request methods it is called with and to allow for all sorts
    of special URLs to be called. We are interested in PUT, but we
    don‘t find that PUT is handled anywhere in particular. We do see
    however, that all the clauses before the last-but-one handle
    something that is *not* put, so we know that our clause is on
    line 464:

    db_req(#httpd{path_parts=[_, DocId]}=Req, Db) ->
    db_doc_req(Req, Db, DocId);

    Which turns to be yet another indirection, so let’s go with it.
    `do_doc_req` again has a number of clauses to deal with various
    request types. Lucky for us, there is a PUT clause on line 563:

    db_doc_req(#httpd{method='PUT'}=Req, Db, DocId) ->

    First, the function checks whether we have a valid `DocId`.
    Assuming we do, it checks whether the request is a HTTP multipart
    request or a regular one. We have a regular one and are lucky
    again, our part of code here is rather small:

    Body = couch_httpd:json_body(Req),
    Doc = couch_doc_from_req(Req, DocId, Body),
    update_doc(Req, Db, DocId, Doc)

    The first line fetches the JSON document body from the `Req`
    variable. At this point, `Body` should equal: `<<"{\"a\":1}">>,
    an Erlang binary that encodes the JSON body we passed in as
    a request.

    The second line turns the JSON, together with the `DocId` into
    a CouchDB document.

    Finally, we pass all we have now to the `update_doc` function we
    check out later.

    `couch_doc_from_req()` figures out whether we are trying to update
    and existing doc with our PUT request, or whether we want to create
    a new one. In our case, not much is done, in the update case, we
    need to pass in a `rev=` query parameter and that is checked here.

    In either case though, this function returns a value of the type
    `#doc{}`, which is a record that is defined in src/couchdb/couch_db.hrl,
    line 99, if you are curious.

    With all that in place, we can finally visit `update_doc()`. It again
    has a few clauses starting in line 716 (we are still in couch_db_httpd.erl)

    `update_doc` deals with a number query parameters again until it finally
    calls `couch_db:update_doc()`.

    This is our entry into the innards of CouchDB.

    Enter `couch_db` in src/couchdb/couch_db.erl. Our function
    `update_doc()` is defined in line 422, and it ultimately seems to
    be a wrapper around `update_docs()` (plural) in the lines starting
    at 688. Update docs has two independent clauses:

    update_docs(Db, Docs, Options, replicated_changes) ->

    and

    update_docs(Db, Docs, Options, interactive_edit) ->

    The first one handles replications that can create conflicts in
    document revision lists. The second one deals with regular
    database operations. So that that is for us.

    Our `update_docs()` does a number of things:

    - prepare for yet more request parameters.
    - separate our `_local` docs and regular docs (ours is a regular one.
    - validate our document against `validate_update_function`s, if they
    exist.
    - check whether we provided the correct `rev` in case of updates.
    -

    And Finally:

    {ok, CommitResults} = write_and_commit(Db, DocBuckets4, NonRepDocs,
    Options2),

    Let’s jump there, line 831:

    After doing some more preparations that I will gloss over, we see
    that CouchDB keeps around an `UpdatePid` in the `#db{}` record that
    is passed down with us so far. This `UpdatePid` is the process ID of
    a process that deals with database updates.

    In CouchDB, each database has a single process handling writes to the
    database, to ensure a consistent database file.

    In `write_and_commit()` we send a message to that process with the message
    `update_docs` (in line 839):

    UpdatePid ! {update_docs, self(), DocBuckets, NonRepDocs,
    MergeConflicts, FullCommit},

    Let’s see where that message is handled.

    We need to know that the module that the `UpdatePid` runs is an
    instance of the `couch_db_updater` module. We would have found that
    out in `couch_db:init()`.

    The `update_docs` message is handled in src/couchdb/couch_db_update.erl
    in line 223.

    After receiving the whole message, with all docs (in our case, a list with
    just our document) is sent to `update_docs_int()` (line 672).

    `open_docs_int()` handles access to CouchDB’s main database data structure,
    the B+-tree. In fact, there are two B+-trees in each database at the same
    time: the fulldocinfo_by_id_btree and the docinfo_by_seq_btree. The first
    one contains all document data indexed by document id. The second one
    includes pointers to the fulldocinfo btree indexed by update sequence. The
    by_seq btree is what drives CouchDB’s /_changes feature which in turn
    powers replication, compaction and view creation.

    A new document is inserted in both indexes in lines 705 and 706:

    {ok, DocInfoByIdBTree2} = couch_btree:add_remove(DocInfoByIdBTree,
    IndexFullDocInfos, []),
    {ok, DocInfoBySeqBTree2} = couch_btree:add_remove(DocInfoBySeqBTree,
    IndexDocInfos, RemoveSeqs),

    At this point, our docs lives in the database structure, has been
    assigned a new `rev`, but it has not yet been written to disk. The
    last operation in `update_docs_int()` is `commit_data()` which
    sounds promising. Let’s jump down.

    The definition starts in line 781, the relevant bit for us in line 785.
    The way CouchDB write changes to disk is in this fashion:

    1. write all changes to the data and index trees to the disk.
    3. write a header to disk that has the current pointers to the index
    trees that we wrote in 1.

    Writing to disk does not yet mean that the data actually arrived on
    disk. It might, but we only know for sure after we call the `fsync`
    system call. From Erlang, we call `couch_file:sync()`.

    Now there are different classes of behaviour possible in the list above.
    Notice how I left out 2.

    Writing a CouchDB file (which can be either a database file or a view
    index)
    can give different storage guarantees. The options are to fsync before
    the header is written, or after, or both. An fsync is a potentially
    expensive operation, so we have fine grained control over this here.

    The full list is:

    1. write all changes to the data and index trees to the disk.
    2. fsync.
    3. write a header to disk that has the current pointers to the index
    trees that we wrote in 1.
    4. fsync.

    2.-4. happen in `commit_data()`, but wait, where did 1. happen?

    For that, we need to jump back to `update_docs_int()`, line 697:

    % Write out the document summaries (the bodies are stored in the nodes
    of
    % the trees, the attachments are already written to disk)
    {ok, FlushedFullDocInfos} = flush_trees(Db2, NewFullDocInfos, []),

    `flush_trees()` is defined in line 519. It iterates over the new data
    in the database and recursively writes it to disk in line 547:

    {ok, NewSummaryPointer, SummarySize} =
    couch_file:append_raw_chunk(Fd, Summary),

    Finally, we drop into `couch_file`, the lowest level of CouchDB.
    `append_raw_chunk()` is defined in line 111 and it is just a small
    wrapper that sends the `append_bin` message to the process that
    manages the file descriptor for our database file.

    `append_bin` is handled in line 373. It takes the data to be
    written and pads it out to make it a multiple of `?SIZE_BLOCK`
    (which is 4096 bytes).

    In line 376 our data is finally written to disk:

    file:write(Fd, Blocks)

    From here on out we now go back up into `couch_db_updater` and
    deal with the header business we looked at earlier, from there
    it jumps back up into `couch_db` which waits for a success in
    writing the data, and when that shows up, it hands it back to
    `couch_httpd_db` which uses `couch_httpd` to send the successful
    writing of the document as an HTTP response.

    This concludes our little tour.

    I hope this was helpful! Let us know if there are any questions.

    Jan
    --


    --
    Wang.bupt

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouperlang @
categoriescouchdb
postedNov 1, '12 at 8:16a
activeNov 2, '12 at 1:11a
posts6
users5
websitecouchdb.apache.org
irc#couchdb

People

Translate

site design / logo © 2021 Grokbase