FAQ
As you might have picked up I'm working on an REST api that uses JSON in the
request. I need to also allow large file uploads.

HTTP::Body::OctetStream will chunk the request body and send to a temp file,
but Catalyst::Action::Deserialize::JSON will load the temp file into memory.
Obviously, want to limit that.

AFAIK, there's no way to stream parse JSON (so that only part is in memory
at any given time). What would be the recommended serialization for
uploaded files -- just use multipart/form-data for the uploads?

BTW -- I don't see any code in HTTP::Body to limit body size. Doesn't that
seem like a pretty easy DoS for Catalyst apps? I do set a request size
limit in the web server, but if I need to allow 1/2GB uploads or so then
could kill the machine pretty easily, no?



--
Bill Moseley
moseley@hank.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.scsys.co.uk/pipermail/catalyst/attachments/20100205/d6b04540/attachment.htm

Search Discussions

  • Tomas Doran at Feb 6, 2010 at 4:54 am

    On 5 Feb 2010, at 20:54, Bill Moseley wrote:
    AFAIK, there's no way to stream parse JSON (so that only part is in
    memory at any given time). What would be the recommended
    serialization for uploaded files -- just use multipart/form-data for
    the uploads?
    Don't?

    Why not just do a PUT request with all the data as unmangled binary?
    BTW -- I don't see any code in HTTP::Body to limit body size.
    Doesn't that seem like a pretty easy DoS for Catalyst apps? I do
    set a request size limit in the web server, but if I need to allow
    1/2GB uploads or so then could kill the machine pretty easily, no?
    Well, you set it at the web server.. That stops both overlarge content-
    length requests, and when the body exceeds the specified content length.

    But yes, you have to provision temp file space for n files in flight x
    max file size...

    (I have an HTTP::Body subclass I use to stream stuff directly into
    mogilefs rather than getting a temp file - code on request)...

    Cheers
    t0m
  • Bill Moseley at Feb 6, 2010 at 4:28 pm
    On Fri, Feb 5, 2010 at 8:56 PM, Tomas Doran wrote:
    On 5 Feb 2010, at 20:54, Bill Moseley wrote:

    AFAIK, there's no way to stream parse JSON (so that only part is in memory
    at any given time). What would be the recommended serialization for
    uploaded files -- just use multipart/form-data for the uploads?
    Don't?
    Why not just do a PUT request with all the data as unmangled binary?

    As in don't provide a way to upload meta data along with the file (name,
    date, description, author, title, reference id) like the web upload allows
    with multipart/form-data? Or invent some new serialization where the meta
    data is embedded in the upload? Or do a POST with the file, then flag the
    new upload as incomplete until a PUT is done to set associated meta data?

    The API is suppose to offer much of the same functionality as the web
    interface. JSON is somewhat nice because, well, customers have requested
    it, and also that it lends itself to more complex (not flat) data
    representations. Of course, urlencoded doesn't have to be flat -- we have
    some YUI-based AJAX code that sends json in $c->req->params->{data}. But I
    digress.

    The 'multipart/form-data' is nice because if the client is well behaved
    uploads are chunked to disk. XML can also do this, too (I have an
    HTTP::Body subclass for XML-RPC that chunks base64 elements to disk).


    BTW -- I don't see any code in HTTP::Body to limit body size. Doesn't
    that seem like a pretty easy DoS for Catalyst apps? I do set a request size
    limit in the web server, but if I need to allow 1/2GB uploads or so then
    could kill the machine pretty easily, no?
    Well, you set it at the web server.. That stops both overlarge
    content-length requests, and when the body exceeds the specified content
    length.
    Yes, for example in Apache LimitRequestBody can be set and if you send a
    content-length header larger than that value the request is rejected right
    away. And, IIRC, Apache will just discard any data over the what is
    specified in the content-length header (i.e. Catalyst won't see any data
    past the content length from Apache).


    But yes, you have to provision temp file space for n files in flight x max
    file size...
    You are making an assumption that the request body actually makes it to a
    temp file.

    Imagine you allow uploads of CD iso files, so say 700MB. So, you set the
    webserver's limit to that. Normally, when someone uploads HTTP::Body you
    expect OctetStream or form-data posts which ends up buffering to disk.

    Now, if someone sets their content type to Urlencoded then HTTP::Body just
    gathers up that 700MB in memory. MaxClients is 50, so do that math.

    Granted someone would have to work very hard to get enough data at once all
    to the same web server, and if an attacker is that determined they could
    find other equally damaging attacks. And a good load balancer can monitor
    memory on disk space on the web servers and stop sending requests to a
    server low on resources.


    Most applications don't have this problem since uploading that large of a
    file is likely rare. Well, that assumes that everyone is using something in
    front of Catalyst that limits upload size (like Apache's LimitRequestBody).

    It's unusual to have a very large valid Urlencoded (or non-upload form-data)
    body in a normal request (that's a lot of radio buttons and text to type!)
    so, it would is not be wise for HTTP::Body to limit the size of
    $self->{buffer} to something sane? I suppose it could flush to disk after
    getting too big, but that doesn't really help because some serializations
    require reading the entire thing into memory to parse.



    --
    Bill Moseley
    moseley@hank.org
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: http://lists.scsys.co.uk/pipermail/catalyst/attachments/20100206/d5ae3128/attachment.htm
  • Aristotle Pagaltzis at Feb 6, 2010 at 7:29 pm

    * Bill Moseley [2010-02-06 17:30]:
    As in don't provide a way to upload meta data along with the
    file (name, date, description, author, title, reference id)
    like the web upload allows with multipart/form-data? Or invent
    some new serialization where the meta data is embedded in the
    upload?
    Neither, depending on your metadata. The things you did mention
    could quite well be sent as request headers. No need to put
    another envelope inside the HTTP request envelope.

    Regards,
    --
    Aristotle Pagaltzis // <http://plasmasturm.org/>
  • Bill Moseley at Feb 6, 2010 at 10:36 pm

    On Sat, Feb 6, 2010 at 11:29 AM, Aristotle Pagaltzis wrote:

    * Bill Moseley [2010-02-06 17:30]:
    As in don't provide a way to upload meta data along with the
    file (name, date, description, author, title, reference id)
    like the web upload allows with multipart/form-data? Or invent
    some new serialization where the meta data is embedded in the
    upload?
    Neither, depending on your metadata. The things you did mention
    could quite well be sent as request headers. No need to put
    another envelope inside the HTTP request envelope.

    Could you be more specific? For example API request to

    1) create a new user in account #1234 with name, email, etc.
    2) create a user but also provide a photo when creating the user
    3) upload a document for the user and the document must include an
    associated collection of meta data (e.g. filename, timestamp, author etc.).
    The uploaded document must include
    this meta data before it can be accepted.





    --
    Bill Moseley
    moseley@hank.org
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: http://lists.scsys.co.uk/pipermail/catalyst/attachments/20100206/7db9217a/attachment.htm
  • Aristotle Pagaltzis at Feb 9, 2010 at 10:36 am

    * Bill Moseley [2010-02-06 23:35]:
    1) create a new user in account #1234 with name, email, etc.
    This is just a normal form POST.
    2) create a user but also provide a photo when creating the user
    I might separate this out into two requests ? whatever the POST
    request returns would contain a link to which the client can PUT
    the photo.
    3) upload a document for the user and the document must include
    an associated collection of meta data (e.g. filename,
    timestamp, author etc.). The uploaded document must include
    this meta data before it can be accepted.
    That sounds like the case I was thinking about: just do a PUT
    request with X-MyApp-Filename, X-MyApp-Timestamp etc headers.

    (Another option, which is better in some ways I think, would be
    the two-request approach as above, though that would be more
    complicated. Ie. the client POSTs the metadata, the server files
    the data away temporarily and returns a link to which the client
    can PUT the file, and only once that request has succeeded does
    the server store both metadata and file in their proper place.)

    Regards,
    --
    Aristotle Pagaltzis // <http://plasmasturm.org/>
  • Bill Moseley at Feb 9, 2010 at 3:12 pm

    On Tue, Feb 9, 2010 at 2:36 AM, Aristotle Pagaltzis wrote:

    3) upload a document for the user and the document must include
    an associated collection of meta data (e.g. filename,
    timestamp, author etc.). The uploaded document must include
    this meta data before it can be accepted.
    That sounds like the case I was thinking about: just do a PUT
    request with X-MyApp-Filename, X-MyApp-Timestamp etc headers.
    Of course, I left out the ability to upload multiple flies at once. Doing
    that with headers could get ugly. (X-MyApp-Filename-01,
    X-MyApp-Filename-02, ...) Of course, could just not provide that
    multiple-file upload ability to API users and limit it to web users. That
    would work ok.

    With XML-RPC we just have multiple <upload> struct elements that are
    containers for the meta data and the base64 file contents.


    (Another option, which is better in some ways I think, would be
    the two-request approach as above, though that would be more
    complicated. Ie. the client POSTs the metadata, the server files
    the data away temporarily and returns a link to which the client
    can PUT the file, and only once that request has succeeded does
    the server store both metadata and file in their proper place.)
    That's a bit of redesign of the application for a two-phase upload. Seems a
    shame to have to add new database tables and cron jobs to clean up
    incomplete uploads just because of my choice of serialization. I agree
    that's probably the cleanest design, though. From past experience I can
    assume some customers will have trouble adding request headers for the
    libraries they are using.

    form-data is possible serialization, but it's a flat serialization so also
    need to have fields like filename_01, title_01, filename_02, title_02 to
    handle multiple uploads at once. (Plus, the app already handles that
    form-data). I'm not sure how much meta data can be associated with an
    upload in form-data (other than filename, content-disposition, and
    content-type), or if the libraries clients use to create a request can be
    that creative.

    XML-RPC is ugly but nicely handles multiple uploads with associated meta
    data for each, and can be stream parsed so that the base64 file data is
    chunked to a temp file and not stored in memory.

    JSON provides the nice nested structures but, IIUC, has to be in-memory to
    parse. I hate those "out of memory!" messages, so it would be very nice to
    not have the file uploads in JSON.

    Not pretty at all, but maybe using form-data with a JSON-encoded "meta"
    field that has a list of uploads with associated meta-data including a
    field_name with each upload that associates it with a field that contained
    the uploaded file. Most client libraries have a way to send form-data, so
    that would be easy for customers to implement.

    None of those are great options.


    --
    Bill Moseley
    moseley@hank.org
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: http://lists.scsys.co.uk/pipermail/catalyst/attachments/20100209/05abe26b/attachment.htm
  • Aristotle Pagaltzis at Feb 9, 2010 at 7:27 pm

    * Bill Moseley [2010-02-09 16:10]:
    On Tue, Feb 9, 2010 at 2:36 AM, Aristotle Pagaltzis wrote:
    That sounds like the case I was thinking about: just do a PUT
    request with X-MyApp-Filename, X-MyApp-Timestamp etc headers.
    Of course, I left out the ability to upload multiple flies at
    once. Doing that with headers could get ugly.
    (X-MyApp-Filename-01, X-MyApp-Filename-02, ...) Of course,
    could just not provide that multiple-file upload ability to API
    users and limit it to web users. That would work ok.
    I would seriously just not provide multiple uploads via the API.
    For the browser UI they?re a necessity because it?s so awkward to
    upload files one at a time, but the API is a completely different
    category. This falls under ?batching?, and all the HTTP sages
    will tell you ?don?t do that?. It makes both the server and the
    client more complicated without any discernible upsides. (In
    fact, if you do pipelining, then separate PUT requests are
    actually more efficient in terms of roundtrips and overhead.)
    From past experience I can assume some customers will have
    trouble adding request headers for the libraries they are
    using.
    That would be a problem, yes. (Damn people treating HTTP as
    a transport protocol? *mutter*)
    form-data is possible serialization, but it's a flat
    serialization so also need to have fields like filename_01,
    title_01, filename_02, title_02 to handle multiple uploads at
    once. (Plus, the app already handles that form-data).
    Just don?t do batch uploads in the API.
    XML-RPC Yuck.
    JSON provides the nice nested structures but, IIUC, has to be
    in-memory to parse.
    Not in principle, although it may well be that there isn?t any
    library that implements a streaming parser yet.
    Not pretty at all, but maybe using form-data with
    a JSON-encoded "meta" field that has a list of uploads with
    associated meta-data including a field_name with each upload
    that associates it with a field that contained the uploaded
    file. Most client libraries have a way to send form-data, so
    that would be easy for customers to implement.

    None of those are great options.
    Actually, that sounds like a decent option if you really need
    a nested data structure and can?t use headers. (I?d still not
    do batch uploads, though.)

    Regards,
    --
    Aristotle Pagaltzis // <http://plasmasturm.org/>

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcatalyst @
categoriescatalyst, perl
postedFeb 5, '10 at 8:54p
activeFeb 9, '10 at 7:27p
posts8
users3
websitecatalystframework.org
irc#catalyst

People

Translate

site design / logo © 2022 Grokbase