Skip to content

defnull/multipart

Repository files navigation

Parser for multipart/form-data

Tests Status Latest Version License

This module provides multiple parsers for RFC-7578 multipart/form-data, both low-level for framework authors and high-level for WSGI application developers:

  • PushMultipartParser: A low-level incremental SansIO (non-blocking) parser suitable for asyncio and other time or memory constrained environments.
  • MultipartParser: A streaming parser emitting memory- and disk-buffered MultipartPart instances.
  • parse_form_data: A helper function to parse both multipart/form-data and application/x-www-form-urlencoded form submissions from a WSGI environment.

Installation

pip install multipart

Features

  • Pure python single file module with no dependencies.
  • 100% test coverage. Tested with inputs as seen from actual browsers and HTTP clients.
  • Parses multiple GB/s on modern hardware (see benchmarks).
  • Quickly rejects malicious or broken inputs and emits useful error messages.
  • Enforces configurable memory and disk resource limits to prevent DoS attacks.

Limitations: This parser implements multipart/form-data as it is used by actual modern browsers and HTTP clients, which means:

  • Just multipart/form-data, not suitable for email parsing.
  • No multipart/mixed support (deprecated in RFC 7578).
  • No base64 or quoted-printable transfer encoding (deprecated in RFC 7578).
  • No encoded-word or name=_charset_ encoding markers (discouraged in RFC 7578).
  • No support for clearly broken input (e.g. invalid line breaks or header names).

Usage and Examples

Here are some basic examples for the most common use cases. There are more parameters and features available than shown here, so check out the docstrings (or your IDEs built-in help) to get a full picture.

Helper function for WSGI or CGI

For WSGI application developers we strongly suggest using the parse_form_data helper function. It accepts a WSGI environ dictionary and parses both types of form submission (multipart/form-data and application/x-www-form-urlencoded) based on the actual content type of the request. You'll get two MultiDict instances in return, one for text fields and the other for file uploads:

from multipart import parse_form_data

def wsgi(environ, start_response):
  if environ["REQUEST_METHOD"] == "POST":
    forms, files = parse_form_data(environ)

    title  = forms["title"]  # string
    upload = files["upload"] # MultipartPart
    upload.save_as(...)

Note that form fields that are too large to fit into memory will end up as MultipartPart instances in the files dict instead. This is to protect your app from running out of memory or crashing. MultipartPart instances are buffered to temporary files on disk if they exceed a certain size. The default limits should be fine for most use cases, but can be configured if you need to. See MultipartParser for details.

Flask, Bottle & Co

Most WSGI web frameworks already have multipart functionality built in, but you may still get better throughput for large files (or better limits control) by switching parsers:

forms, files = multipart.parse_form_data(flask.request.environ)

Legacy CGI

If you are in the unfortunate position to have to rely on CGI, but can't use cgi.FieldStorage anymore, it's possible to build a minimal WSGI environment from a CGI environment and use that with parse_form_data. This is not a real WSGI environment, but it contains enough information for parse_form_data to do its job. Do not forget to add proper error handling.

import sys, os, multipart

environ = dict(os.environ.items())
environ['wsgi.input'] = sys.stdin.buffer
forms, files = multipart.parse_form_data(environ)

Stream parser: MultipartParser

The parse_form_data helper may be convenient, but it expects a WSGI environment and parses the entire request in one go before it returns any results. Using MultipartParser directly gives you more control and also allows you to process MultipartPart instances as soon as they arrive:

from multipart import parse_options_header, MultipartParser

def wsgi(environ, start_response):
  assert environ["REQUEST_METHOD"] == "POST"

  content_type = environ["CONTENT_TYPE"]
  content_type, content_params = parse_options_header(content_type)
  assert content_type == "multipart/form-data"

  stream = environ["wsgi.input"]
  boundary = content_params.get("boundary")
  charset = content_params.get("charset", "utf8")

  parser = MultipartParser(stream, boundary, charset)
  for part in parser:
    if part.filename:
      print(f"{part.name}: File upload ({part.size} bytes)")
      part.save_as(...)
    elif part.size < 1024:
      print(f"{part.name}: Text field ({part.value!r})")
    else:
      print(f"{part.name}: Test field, but too big to print :/")

Non-blocking parser: PushMultipartParser

The MultipartParser handles IO and file buffering for you, but relies on blocking APIs. If you need absolute control over the parsing process and want to avoid blocking IO at all cost, then have a look at PushMultipartParser, the low-level non-blocking incremental multipart/form-data parser that powers all the other parsers in this library:

from multipart import PushMultipartParser, MultipartSegment

async def process_multipart(reader: asyncio.StreamReader, boundary: str):
  with PushMultipartParser(boundary) as parser:
    while not parser.closed:

      chunk = await reader.read(1024*64)
      for result in parser.parse(chunk):

        if isinstance(result, MultipartSegment):
          print(f"== Start of segment: {result.name}")
          if result.filename:
            print(f"== Client-side filename: {result.filename}")
          for header, value in result.headerlist:
            print(f"{header}: {value}")
        elif result:  # Result is a non-empty bytearray
          print(f"[received {len(result)} bytes of data]")
        else:         # Result is None
          print(f"== End of segment")

License

Code and documentation are available under MIT License (see LICENSE).