kludex / python-multipart Goto Github PK

View Code? Open in Web Editor NEW

265.0 265.0 47.0 360 KB

A streaming multipart parser for Python.

Home Page: https://kludex.github.io/python-multipart/

License: Other

Python 100.00%

python-multipart's Introduction

Hi there! 👋

My name is Marcelo, and I'm from 🇧🇷 (and half 🇮🇹)! 😎

I'm a maintainer of Uvicorn and Starlette, and I'm also considered a FastAPI Expert. 🤓

I'm currently working at Pydantic. 🚀

I also have a YouTube channel, where I talk about Python, FastAPI, and other stuff. 🎥

Besides that, I also speak at Python conferences! 🎤

python-multipart's People

Contributors

Stargazers

Watchers

Forkers

plaidcloud tumb1er mont5piques kgk kinware carlwgeorge superduper yunstanford stonebig biantongtong rish-ablacon pushp-garg nonameentered shadowgitup jiongshushu hugovk lizardlad frankfanslc stoneidolon mgorny axelfauvel kianmeng ambro17 nlhnt jonathangpk capuanob fortinsecurity mayhemheroes arpitjain799 h0rn3t livioalves julestalloen jvstme manicme eltbus tiangolo lorenpike tselepakis onuralpszr manunio janusheide yecril23pl benix0x jhnstrk cmanage1

python-multipart's Issues

DEPRECATION: python-multipart is being installed using the legacy 'setup.py install' method

DEPRECATION: python-multipart is being installed using the legacy 'setup.py install' method, 
because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 
will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. 
Discussion can be found at https://github.com/pypa/pip/issues/8559

pypa/pip#8559
python-multipart==0.0.5

Integration with google oss-fuzz fuzzing service

Hi, I would like to help integrate this project into OSS-Fuzz.

As an initial step for integration I have created this PR: google/oss-fuzz#8398, it contains necessary logic from an OSS-Fuzz perspective to integrate python-multipart, This includes developing initial fuzzers as well as integrating into OSS-Fuzz.
Essentially, OSS-Fuzz is a free service run by Google that performs continuous fuzzing of important open source projects.
If you would like to integrate, could I please have an email(s), it must be associated with a google account like gmail (why?). by doing that, the provided email(s) will get access to the data produced by OSS-Fuzz, such as bug reports, coverage reports and more stats.
As an alternative, if you don't have a gmail id or you don't wish to provide your email, but still wish to integrate. I can add my mail id for time being and monitor bug/crashes.
Notice the email(s) affiliated with the project will be public in the OSS-Fuzz repo, as they will be part of a configuration file.

Need help with an example

Forgive me, I'm a Python noob and I'm struggling to make use of your project. I've made a request to a system that is returning a multipart response.

r2 = requests.get(location, 
	headers = {
		'Authorization': f'Bearer {token}',
		'x-api-key': self.clientId,
		'Accept': '*/*'
	})

I know this request is running ok. I'm trying to take that and pass it to your library, but nothing I've tried has worked well. Do you have an example where you've got the request object and you want to parse it?

AttributeError

AttributeError("'list' object has no attribute 'read'")

IndexError in multipart.QueryStringParser._internal_write

Hi,

This looks like a great project, and I'm trying to incorporate it into use with a Tornado server in order to handle files uploaded as multipart/form-data. It looks like it could be a perfect fit. However, I've come across a show-stopping crash that I've spent some time trying to solve, hoping for it to be obvious, to no avail.

Tornado somewhat-recently introduced a new feature, a decorator for its RequestHandler class called @stream_request_body. If you're not familiar with Tornado, just know that the decorator streams received data chunks to a function where you can write the code to handle those chunks. In my case, I'm immediately sending the chunks to a MultipartParser object. However, if there is more than one chunk (which seems to occur for content-length payloads that are larger than 64 KiB), there seems to be a "random chance" that the following error will occur:

Traceback (most recent call last):
  [ ... Tornado & my stuff ...]

  File "...src/multipart/multipart/multipart.py", line 1055, in write
    l = self._internal_write(data, data_len)
  File "...src/multipart/multipart/multipart.py", line 1314, in _internal_write
    c = data[i]
IndexError: string index out of range

In the above traceback, i seems to always be equal to len(data), so it is an off-by-one error that occurs sometimes. Here is a link to the line where this occurs. If I catch this exception, the same error may occur an arbitrary number more times, and the final file will have lost data (I haven't examined exactly how much).

Remember that I said there is a "random chance" of it occurring. Sometimes, it doesn't occur at all, for the same file that has been seen to fail another time.

I also occasionally get logger WARNING messages that look like this, which is more rare:

Did not find boundary character 'i' at index 2

I get a message like the above for each character in the boundary, and always at index 2. I think this is just a different manifestation of the same error.

I have taken a top-down approach to debugging this, but I've not been able to find anything yet. When I write the chunks of data directly to a file (instead of the Parser object), the end result is consistent and correct in that the only differences between uploads are at the head and tail of the file, where the boundaries are, and no data is missing from the original file. This rules out a Tornado bug. I am also pretty sure I am not using the parser incorrectly, though I will be able to furnish example code. At this point, I am somewhat sure that the error is in the _internal_write function, specifically in the Boyer-Moore-Horspool algorithm implementation. I also think that the bug only occurs based on the sizes of the first few chunks, which differ between requests. In my tests, I've been using a ~6 MiB text file, and I've confirmed that the data is chunked differently each time: the beginning chunks are different sizes, and then it 'evens out', and then the last chunk will probabilistically be smaller. This chunking happens when using either Firefox and cURL, while uploading locally to the Tornado server.

I plan to delve into the algorithm code, but I figured that I would try to get your input on this, in hopes you can think of/implement a solution quicker than I can. As I said, I can also provide a working example after I clean up the code file a little bit. I think my next step to solving this problem is emulating different-sized chunking from a file, just to confirm my suspicion. Since your consistently-chunked example code never has this issue, with any file I try, I am really leaning toward the idea that chunk consistency is a factor here.

FormParser: FormParserConfig instance passed as FileConfig

The expression FileClass(file_name, field_name, config=self.config) looks wrong. It substitutes an instance of FormParserConfig for a FileConfig parameter. These bundles are conceptually different; in particular, there is no way to specify a value for UPLOAD_DELETE_TMP unless you violate the specification for FormParserConfig.

Add pre-commit.ci into project for automate linter/formatter etc and auto fix check

I would like to bring pre-commit config and introduce to project in current toml setting and in time it can be expand as well.

Documentation link in pyproject.toml is broken

l.60 in pyproject.toml looks like it needs an update.

Documentation = "https://andrew-d.github.io/python-multipart/"

Do not select a file, submit it directly, and report a strange error:multipart.multipart:Did not find CR at end of boundary

errors：
multipart.multipart:Did not find CR at end of boundary

{
"detail": "There was an error parsing the body"
}

my code:
@app.post("/chatbot/sources_files")
async def sources_files(files: list[UploadFile] = File(description='文件')):
return 'a'

my env:

python：3.10
fastapi：0.98.0

Parsing is extremely slow if there are a lot of CR LF characters in the stream

Noticed a slowness in the code when trying to parse objects with a lot of '\r\n' characters in a row.
It seems that the code does too many callbacks to on_part_data, causing potentially many operations, which are not at all necessary (see example and output below).

py version: python 3.9.2
python-multipart version: 0.0.6

Here's a demonstration

$ cat form.py
import timeit
import requests
from multipart.multipart import parse_options_header, MultipartParser


class TestFormDataParser:
    def __init__(self, headers):
        # Parse the Content-Type header to get the multipart boundary.
        self.headers = headers
        _, params = parse_options_header(self.headers['Content-Type'])
        self.charset = params.get(b'charset', 'utf-8')
        if type(self.charset) == bytes:
            self.charset = self.charset.decode('latin-1')
        try:
            self.boundary = params[b'boundary']
        except KeyError:
            raise MultiPartException('Missing boundary in multipart.')

        self.callbacks = {
            'on_part_begin': lambda: None,
            'on_part_data': self.on_part_data,
            'on_part_end': lambda: None,
            'on_header_field': lambda *_: None,
            'on_header_value': lambda *_: None,
            'on_header_end': lambda: None,
            'on_headers_finished': lambda: None,
            'on_end': lambda: None,
        }
        self.f = MultipartParser(self.boundary, self.callbacks)
        self.on_data_calls = 0

    def on_part_data(self, *_):
        self.on_data_calls += 1

    def parse(self, data):
        self.f.write(data)
        self.f.finalize()
        print(f'on_part_data was called {self.on_data_calls} times')

def test():

    test_data = [
        (b'fo' * 300000 + b'\r\n' + b'foobar', "b'fo' * 300000 + b'\\r\\n' + b'foobar',"),
        (b'fo' * 290000 + b'\r\n' * 10000 + b'foobar',"b'fo' * 290000 + b'\\r\\n' * 10000 + b'foobar'"),
        (b'fo' * 150000 + b'\r\n' * 150000 + b'foobar',"b'fo' * 150000 + b'\r\n' * 150000 + b'foobar'"),
        ((b'fo' + b'\r\n') * 150000 + b'foobar',"(b'fo' + b'\\r\\n') * 150000 + b'foobar'"),
        ((b'fo' + b'\r\n' * 5) * 50000 + b'foobar',"(b'fo' + b'\\r\\n' * 5) * 50000 + b'foobar'"),
        ((b'fo' + b'\r\n' * 50) * 5000 + b'foobar',"(b'fo' + b'\\r\\n' * 50) * 5000 + b'foobar'"),
        (b'fo' + b'\r\n' * 300000 + b'foobar',"b'fo' + b'\\r\\n' * 300000 + b'foobar'"),
    ]

    def run_it(data):
        prep = requests.Request('POST', 'http://example.com', files={'sample': sample, 'other': b'baz'}).prepare()
        parser = TestFormDataParser(prep.headers)
        parser.parse(prep.body)

    _min = 100000
    _max = 0
    iters = 1
    for sample, data_format in test_data:
        sample_size = len(sample)
        crlf_count = sample.count(b'\r\n')
        measurement_time = timeit.timeit(lambda: run_it(sample), setup="pass",number=iters)
        print(f'data format: {data_format}')
        print(f'It took {measurement_time:.3f}s to run {iters} iter with sample size: {sample_size}.')
        print(f'Sample contain {crlf_count} \\r\\n characters.\n\n')
        if measurement_time < _min:
            _min = measurement_time
        if measurement_time > _max:
            _max = measurement_time

    print(f'Biggest difference is {_max-_min:.2f}s ({_max/_min*100:.2f}%) for sample of the same size')

if __name__ == '__main__':
    test()

and here's the output that I get:

on_part_data was called 4 times
data format: b'fo' * 300000 + b'\r\n' + b'foobar',
It took 0.005s to run 1 iter with sample size: 600008.
Sample contain 1 \r\n characters.


on_part_data was called 9986 times
data format: b'fo' * 290000 + b'\r\n' * 10000 + b'foobar'
It took 0.057s to run 1 iter with sample size: 600006.
Sample contain 10000 \r\n characters.


on_part_data was called 149986 times
data format: b'fo' * 150000 + b'
' * 150000 + b'foobar'
It took 1.337s to run 1 iter with sample size: 600006.
Sample contain 150000 \r\n characters.


on_part_data was called 14 times
data format: (b'fo' + b'\r\n') * 150000 + b'foobar'
It took 0.005s to run 1 iter with sample size: 600006.
Sample contain 150000 \r\n characters.


on_part_data was called 14 times
data format: (b'fo' + b'\r\n' * 5) * 50000 + b'foobar'
It took 0.004s to run 1 iter with sample size: 600006.
Sample contain 250000 \r\n characters.


on_part_data was called 170002 times
data format: (b'fo' + b'\r\n' * 50) * 5000 + b'foobar'
It took 1.463s to run 1 iter with sample size: 510006.
Sample contain 250000 \r\n characters.


on_part_data was called 299986 times
data format: b'fo' + b'\r\n' * 300000 + b'foobar'
It took 2.096s to run 1 iter with sample size: 600008.
Sample contain 300000 \r\n characters.


Biggest difference is 2.09s (56888.98%) for sample of the same size

In our use case this can lead to some really slow upload times. For example, a 9MB file filled with CR LF can take 2000s+, and in applications based on FastAPI or Starlette (web frameworks), this may lead to blocking the main thread for seconds and up to a minute at a time (depending on size of chunks in the request stream).

This is an edge case, so to say, but because of it, we are forced to use a different formdata parser.

(If any python webserver uses this to parse incoming data, one can quite easily DoS the service)

importing python-multipart breaks six

Description:
Importing python-multipart seemingly breaks the standard six compatibility fixes.
Example:

Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from googleapiclient.discovery import build
>>>

Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from multipart import multipart
>>> from googleapiclient.discovery import build
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sidorenko/Work/suedkurier/env/local/lib/python2.7/site-packages/googleapiclient/discovery.py", line 33, in <module>
    from six.moves.urllib.parse import urlencode, urlparse, urljoin, \
ImportError: No module named urllib.parse

System information:

uname -srm
Linux 4.4.0-53-generic x86_64

python --version
Python 2.7.12

pip --version
pip 9.0.1

pip show python-multipart
Name: python-multipart
Version: 0.0.2
Summary: A streaming multipart parser for Python
Home-page: http://github.com/andrew-d/python-multipart
Author: Andrew Dunham
Author-email: UNKNOWN
License: Apache
Location: /home/sidorenko/Work/suedkurier/env/lib/python2.7/site-packages
Requires: 

pip show google-api-python-client
Name: google-api-python-client
Version: 1.6.0
Summary: Google API Client Library for Python
Home-page: http://github.com/google/google-api-python-client/
Author: Google Inc.
Author-email: UNKNOWN
License: Apache 2.0
Location: /home/sidorenko/Work/suedkurier/env/lib/python2.7/site-packages
Requires: six, uritemplate, oauth2client, httplib2

Move docs to MkDocs

Error when invoking tests

Error when running tests with Python 3.11 using inv test.

ERROR

Traceback (most recent call last):
  File "/Users/eltbus/Learnspace/python/python-multipart/.venv/bin/inv", line 8, in <module>
    sys.exit(program.run())
             ^^^^^^^^^^^^^
  File "/Users/eltbus/Learnspace/python/python-multipart/.venv/lib/python3.11/site-packages/invoke/program.py", line 373, in run
    self.parse_collection()
  File "/Users/eltbus/Learnspace/python/python-multipart/.venv/lib/python3.11/site-packages/invoke/program.py", line 465, in parse_collection
    self.load_collection()
  File "/Users/eltbus/Learnspace/python/python-multipart/.venv/lib/python3.11/site-packages/invoke/program.py", line 699, in load_collection
    module, parent = loader.load(coll_name)
                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/eltbus/Learnspace/python/python-multipart/.venv/lib/python3.11/site-packages/invoke/loader.py", line 76, in load
    module = imp.load_module(name, fd, path, desc)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/eltbus/.pyenv/versions/3.11.5/lib/python3.11/imp.py", line 235, in load_module
    return load_source(name, filename, file)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/eltbus/.pyenv/versions/3.11.5/lib/python3.11/imp.py", line 172, in load_source
    module = _load(spec)
             ^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 721, in _load
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/Users/eltbus/Learnspace/python/python-multipart/tasks.py", line 16, in <module>
    @task
     ^^^^
  File "/Users/eltbus/Learnspace/python/python-multipart/.venv/lib/python3.11/site-packages/invoke/tasks.py", line 331, in task
    return klass(args[0], **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/eltbus/Learnspace/python/python-multipart/.venv/lib/python3.11/site-packages/invoke/tasks.py", line 76, in __init__
    self.positional = self.fill_implicit_positionals(positional)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/eltbus/Learnspace/python/python-multipart/.venv/lib/python3.11/site-packages/invoke/tasks.py", line 167, in fill_implicit_positionals
    args, spec_dict = self.argspec(self.body)
                      ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/eltbus/Learnspace/python/python-multipart/.venv/lib/python3.11/site-packages/invoke/tasks.py", line 153, in argspec
    spec = inspect.getargspec(func)
           ^^^^^^^^^^^^^^^^^^
AttributeError: module 'inspect' has no attribute 'getargspec'. Did you mean: 'getargs'?

Solution is to update invoke from 1.7.3 to 2.0.0. See this issue.

Or install and run the tests with other Python version (i.e. 3.10)

Failing to parse non-standard line breaks

Hi,

A while ago I have found that the package does not handle non standard line breaks e.g. \n for separation of multiple files. Both, Flask and https://httpbin.org/post can handle non-standard line breaks and it would be nice if python-multipart would do so too.

The issue was originally reported here: encode/starlette#1059

However, I think this is now a python-multipart and not a starlette issue, so I am forwarding the issue here.

Parser fails on data with mixed encodings

When data contains a base64 encoding, then a plain text, the parser tries to decode the plain text as if it was base64, which results in an error. For example:

from io import BytesIO
import multipart

def on_field(field):
    print('field', field)

def on_file(file):
    print('file', file)

data = b'--foo\r\nContent-Type: text/plain; charset="UTF-8"\r\nContent-Disposition: form-data; name=field1\r\nContent-Transfer-Encoding: base64\r\n\r\nw6k=\r\n--foo\r\nContent-Type: text/plain; charset="UTF-8"\r\nContent-Disposition: form-data; name=field2\r\n\r\nsome text\r\n\r\n--foo--'

headers = {'Content-Type': 'multipart/form-data; boundary="foo"', 'Content-Length': str(len(data))}
multipart.parse_form(headers, BytesIO(data), on_field, on_file)

Output:

field Field(field_name=b'field1', value=b'\xc3\xa9')
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/multipart/decoders.py", line 60, in write
    decoded = base64.b64decode(val)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/base64.py", line 88, in b64decode
    return binascii.a2b_base64(s, strict_mode=validate)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
binascii.Error: Incorrect padding

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/christophe/xxxx.py", line 13, in <module>
    multipart.parse_form(headers, BytesIO(data), on_field, on_file)
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/multipart/multipart.py", line 1884, in parse_form
    parser.write(buff)
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/multipart/multipart.py", line 1776, in write
    return self.parser.write(data)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/multipart/multipart.py", line 1058, in write
    l = self._internal_write(data, data_len)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/multipart/multipart.py", line 1327, in _internal_write
    data_callback('part_data')
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/multipart/multipart.py", line 1104, in data_callback
    self.callback(name, data, marked_index, i)
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/multipart/multipart.py", line 584, in callback
    func(data, start, end)
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/multipart/multipart.py", line 1665, in on_part_data
    bytes_processed = vars.writer.write(data[start:end])
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/multipart/decoders.py", line 62, in write
    raise DecodeError('There was an error raised while decoding '
multipart.exceptions.DecodeError: There was an error raised while decoding base64-encoded data.

The problem is that the parser is trying to decode the text "some text" as base64 when it's actually plain text.

[QUESTION] Setup automated release pipeline

After #54 we now have a way to build wheels and source distributions as we adhere to latest packaging PEPs so it would be nice to setup an automated way to create new releases. I have a few questions before sending a PR

My initial approach would be to do it via github actions, running the test suite with each supported python version BUT
should a new release be created on every merge to master? Or on every manual git tagging?

Would you prefer to avoid github actions for any reason? In my head, the github action yaml should invoke commands that also exist locally. Something like

- inv build
- inv test
- inv bump minor
- inv publish

so it can be called locally, or via a github action or any other CI solution

I can also create a first working idea and iterate over that but i thought it might be more efficient to avoid certain ideas if you have preferences

Related #50 #41 #37 #40

Incompatible with PyYAML 6

Due to yaml/pyyaml#561, the tests fail with

============================= test session starts ==============================
platform linux -- Python 3.9.6, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /build/python-multipart-0.0.5
plugins: cov-2.12.1
collected 0 items / 1 error

==================================== ERRORS ====================================
______________ ERROR collecting multipart/tests/test_multipart.py ______________
multipart/tests/test_multipart.py:712: in <module>
    yaml_data = yaml.load(f)
E   TypeError: load() missing 1 required positional argument: 'Loader'
=============================== warnings summary ===============================
multipart/tests/compat.py:48
  /build/python-multipart-0.0.5/multipart/tests/compat.py:48: PytestUnknownMarkWarning: Unknown pytest.mark.slow_test - is this a typo?  You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/mark.html
    slow_test = pytest.mark.slow_test

-- Docs: https://docs.pytest.org/en/stable/warnings.html
=========================== short test summary info ============================
ERROR multipart/tests/test_multipart.py - TypeError: load() missing 1 require...
!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!
========================= 1 warning, 1 error in 0.93s ==========================

Name clash with `multipart` package.

I don't especially mind that both multipart and python-multipart exist on PyPI, but it's a bit awkward that they both also install as multipart.

I've already managed to bump my head on this, and couldn't understand why the signature for multipart.MultipartParser wasn't as expected having had both packages installed at some point.

Although I much prefer the plain multipart name I wonder if it's worth considering something like multipartparser?

Alternatively we don't worry about it for now, and consider something later if it's an issue. Not sure.

There is an `AttributeError` when `repr(querystring_parser)`

reproduce:

python: 3.8.5
python-multipart: 0.0.5

import pytest
from multipart import QuerystringParser


# pylint: disable=invalid-name
def test_repr_QuerystringParser():
    parser = QuerystringParser(callbacks={})

    with pytest.raises(AttributeError):
        repr(parser)

details about that AttributeError

self = <[AttributeError("'QuerystringParser' object has no attribute 'keep_blank_values'") raised in repr()] QuerystringParser object at 0x7f171df674f0>

    def __repr__(self):
        return "%s(keep_blank_values=%r, strict_parsing=%r, max_size=%r)" % (
            self.__class__.__name__,
>           self.keep_blank_values, self.strict_parsing, self.max_size
        )
E       AttributeError: 'QuerystringParser' object has no attribute 'keep_blank_values'

MultipartParseError: Did not find boundary character

I tried using multipart from Python module http.server. It did not work. What am I doing wrong (apart from everything)?

Server script

#!/usr/bin/python3
# coding=ISO-8859-2
import multipart
from os import environ
from sys import stdin

def on_field(field):
	pass

def on_file(file):
	print (file.field_name)

def get_content_type ():
	return environ .get ('CONTENT_TYPE')

def get_content_length ():
	return environ .get ('CONTENT_LENGTH')

print('''Content-Type: text/html; charset=UTF-8

<!DOCTYPE HTML ><HTML LANG="pl" ><TITLE >Przyjęcie formularza</TITLE
 ><H1 >Wynik zapytania</H1 ><P >&hellip; '''
  )

if False:
	print (get_content_type (), input ())
else:
  multipart.parse_form (
  {
	  'Content-Type' : get_content_type (),
	  'Content-Length' : get_content_length ()
  }, stdin, on_field, on_file)

Server command

python -m http.server --cgi

Payload

POST /cgi-bin/upload HTTP/1
Host: 0.0.0.0:8000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Content-Type: multipart/form-data; boundary=---------------------------250220064916703015541512867370
Content-Length: 239
Origin: http://0.0.0.0:8000
Connection: keep-alive
Referer: http://0.0.0.0:8000/
Upgrade-Insecure-Requests: 1

-----------------------------250220064916703015541512867370
Content-Disposition: form-data; name="FILE"; filename=".bash_aliases"
Content-Type: application/octet-stream


-----------------------------250220064916703015541512867370--

Server log

127.0.0.1 - - [23/Feb/2024 16:03:37] "POST /cgi-bin/upload HTTP/1.1" 200 -
Did not find boundary character '-' at index 2
Traceback (most recent call last):
  File "/home/krzysztof/cgi-bin/upload", line 28, in <module>
    multipart.parse_form (
  File "/usr/lib/python3/dist-packages/multipart/multipart.py", line 1902, in parse_form
    parser.write(buff)
  File "/usr/lib/python3/dist-packages/multipart/multipart.py", line 1794, in write
    return self.parser.write(data)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/multipart/multipart.py", line 1076, in write
    l = self._internal_write(data, data_len)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/multipart/multipart.py", line 1183, in _internal_write
    raise e
multipart.exceptions.MultipartParseError: Did not find boundary character '-' at index 2
127.0.0.1 - - [23/Feb/2024 16:03:37] CGI script exit code 1

QuotedPrintableDecoder: Unexpected output for input b'=3AX'

Running

#!/usr/bin/python3

from io import BytesIO
from multipart.decoders import QuotedPrintableDecoder

input_bytes = b'=3AX'
expected_output_bytes = b':X'

underlying = BytesIO()
d = QuotedPrintableDecoder(underlying)
print("writing:",input_bytes)
d.write(input_bytes)
d.finalize()

print("underlying saved bytes:",underlying.getvalue())
print("expected         bytes:",expected_output_bytes)

# Problem with line 131 in decoders.py
#         enc, rest = data[:-3], data[-3:]
# ?

I get the output:

writing: b'=3AX'
underlying saved bytes: b'3AX'
expected         bytes: b':X'

I think the data[:-3] idea in line 131 in decoders.py implies an "3-alignment" which is not always given.

AttributeError: 'tuple' object has no attribute 'encode'

I get an error when sending a request from a C# client library stemming from the parse_options_header. This error only started cropping up in 0.0.7 and did not exist beforehand.

Here's some sample code to reproduce the error

With python-multipart==0.0.7:

from multipart.multipart import parse_options_header

ctype, options = parse_options_header(
    b'form-data; name=parameters; filename="C:\Users\name\path\to\file.png";'
    b" filename*=utf-8''C%3A%5CUsers%5Cname%5Cpath%5Cto3%5Cfile.png"
)

print(ctype, options)

Output:

Traceback (most recent call last):
  File ".\bug.py", line 3, in <module>
    parse_options_header(
  File "M:\Metrized\in-progress\metrized-cv\venv\lib\site-packages\multipart\multipart.py", line 118, in parse_options_header
    options[key.encode('latin-1')] = value.encode('latin-1')
AttributeError: 'tuple' object has no attribute 'encode'

With python-multipart==0.0.6, the output is:

b'form-data' {b'name': b'parameters', b'filename': b'path\to\x0cile.png', b'filename*': b"utf-8''C%3A%5CUsers%5Cname%5Cpath%5Cto3%5Cfile.png"}

I believe a fix in parse_options_header could be (multpart.py:111-112):

    for param in params:
        key, value = param
        if isinstance(value, tuple):          <---- These lines
            value = value[-1]                   <----

Give access to file headers.

Great module, first off!

I have a use case where I need to do processing based on what file type the file is. Sure, I can look at filename extensions, but it would be nice if I could access a Content-Type header for each file part.

on_header_begin is missing

on_header_begin is documented but isn't actually defined. I was wondering why the callback wasn't working and saw that it's not even in the code at all.

multipart.File has no MIME type information

Unfortunately, multipart's File class seems to have a serious regression compared to cgi's FieldStorage class: Where FieldStorage.type contained the declared MIME type of the uploaded file (or None if not given), File does not seem to have this information. This makes is basically impossible to download uploaded files while keeping the file type intact.

IndexError: list assignment index out of range

While fuzz testing #117 locally, it resulted in IndexError: list assignment index out of range error.

import io

from multipart import parse_form
from multipart.exceptions import FormParserError

TEST_DATA = b'----boundary\r\nContent-D----boundary\r\nContent-Disposition: form-data; name="file"; filename="test.t'


def on_field(field):
    print("field", field)


def on_file(file):
    print("file", file)


def main():
    header = {"Content-Type": "multipart/form-data; boundary=--boundary"}
    parse_form(header, io.BytesIO(TEST_DATA), on_field, on_file)


if __name__ == "__main__":
    main()

Traceback (most recent call last):
  File "/home/maxx/dev/security/oss-fuzz-projects/python-multipart-backup/fuzz/test.py", line 23, in <module>
    main()
  File "/home/maxx/dev/security/oss-fuzz-projects/python-multipart-backup/fuzz/test.py", line 19, in main
    parse_form(header, io.BytesIO(TEST_DATA), on_field, on_file)
  File "/home/maxx/dev/security/oss-fuzz-projects/python-multipart-backup/venv/lib/python3.11/site-packages/multipart/multipart.py", line 1954, in parse_form
    parser.write(buff)
  File "/home/maxx/dev/security/oss-fuzz-projects/python-multipart-backup/venv/lib/python3.11/site-packages/multipart/multipart.py", line 1844, in write
    return self.parser.write(data)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/maxx/dev/security/oss-fuzz-projects/python-multipart-backup/venv/lib/python3.11/site-packages/multipart/multipart.py", line 1128, in write
    l = self._internal_write(data, data_len)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/maxx/dev/security/oss-fuzz-projects/python-multipart-backup/venv/lib/python3.11/site-packages/multipart/multipart.py", line 1472, in _internal_write
    self.lookbehind[index - 1] = c
    ~~~~~~~~~~~~~~~^^^^^^^^^^^
IndexError: list assignment index out of range

caused by:

python-multipart/multipart/multipart.py

Line 1465 in ed02d22

self.lookbehind[index - 1] = c

Tested against ed02d22

Skip header area?

RFC 1341 says about multipart payloads:

This [the multipart content-type] indicates that the entity consists of several parts, each itself with a structure that is syntactically identical to an RFC 822 message, except that the header area might be completely empty...

I read that as: "a multipart payload may have a header area". As a matter of fact, when you construct a multipart payload using email.mime.multipart.MIMEMultipart and then calling as_string(), you get:

MIME-Version: 1.0
Content-Type: multipart/form-data; boundary="========== bounda r y 930"

--========== bounda r y 930\nMIME-Version: 1.0
Content-Type: application/octet-stream
...

The multipart parser from the old cgi module would correctly grok that when I uploaded it (provided I arranged for the content-type to be in the HTTP header). python-multipart (tried 0.0.5) does not and fails when it sees the M of the MIME-Version.

I appreciate the "as browsers send it" part of multipart's rationale; however, being somewhat friendly to machine-generated multipart messages would, I think, be a friendly gesture, and it might prevent breakage as people move from cgi's FieldStorage (used, e.g., in twisted) to python-multipart.

What I think should be done: in MultipartParser's _internal_write, we should blindly skip everything until the MIME header area is consumed, i.e., until we have found a CRLFCRLF sequence. Would you consider such a PR?

Fix release automation

That stuff is broken. Help is appreciated.

0.0.5: pytest is failing

I'm trying to package your module as an rpm package. So I'm using the typical PEP517 based build, install and test cycle used on building packages from non-root account.

python3 -sBm build -w --no-isolation
because I'm calling build with --no-isolation I'm using during all processes only locally installed modules
install .whl file in </install/prefix>
run pytest with PYTHONPATH pointing to sitearch and sitelib inside </install/prefix>

Here is pytest output:

+ PYTHONPATH=/home/tkloczko/rpmbuild/BUILDROOT/python-multipart-0.0.5-13.fc35.x86_64/usr/lib64/python3.8/site-packages:/home/tkloczko/rpmbuild/BUILDROOT/python-multipart-0.0.5-13.fc35.x86_64/usr/lib/python3.8/site-packages
+ /usr/bin/pytest -ra
=========================================================================== test session starts ============================================================================
platform linux -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /home/tkloczko/rpmbuild/BUILD/python-multipart-0.0.5
plugins: cov-3.0.0, anyio-3.3.4
collected 0 items / 1 error

================================================================================== ERRORS ==================================================================================
____________________________________________________________ ERROR collecting multipart/tests/test_multipart.py ____________________________________________________________
multipart/tests/test_multipart.py:715: in <module>
    yaml_data = yaml.load(f)
E   TypeError: load() missing 1 required positional argument: 'Loader'
============================================================================= warnings summary =============================================================================
multipart/tests/compat.py:48
  /home/tkloczko/rpmbuild/BUILD/python-multipart-0.0.5/multipart/tests/compat.py:48: PytestUnknownMarkWarning: Unknown pytest.mark.slow_test - is this a typo?  You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/mark.html
    slow_test = pytest.mark.slow_test

-- Docs: https://docs.pytest.org/en/stable/warnings.html
========================================================================= short test summary info ==========================================================================
ERROR multipart/tests/test_multipart.py - TypeError: load() missing 1 required positional argument: 'Loader'
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
======================================================================= 1 warning, 1 error in 0.37s ========================================================================

Erroneous "Consuming a byte in the end state" warnings.

Hiya. Great stuff!

Planning to use your module for Starlette, see encode/starlette#102 (Its an async framework, so we need a streaming parser like this, and your API is the only one I could find that I can easily adapt into a nice Sans-IO

I've noticed that the module is erroneously issuing "Consuming a byte in the end state" warnings.

example.py

from starlette.applications import Starlette
from starlette.responses import PlainTextResponse
from multipart.multipart import parse_options_header
import multipart
import uvicorn

app = Starlette()

@app.route('/', methods=['POST'])
async def homepage(request):
    content_type, params = parse_options_header(request.headers['Content-Type'])
    boundary = params.get(b'boundary')

    # No callbacks here, just exercise the parser
    parser = multipart.MultipartParser(boundary, callbacks={})

    # Feed the parser with data from the request.
    async for chunk in request.stream():
        parser.write(chunk)

    return PlainTextResponse('Uploaded OK\n')


if __name__ == '__main__':
    uvicorn.run(app, host='0.0.0.0', port=8000)

Running the app:

$ python example.py 
INFO: Started server process [10257]
INFO: Waiting for application startup.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
WARNING: Consuming a byte in the end state
WARNING: Consuming a byte in the end state
INFO: ('127.0.0.1', 56109) - "POST / HTTP/1.1" 200

curl:

$ curl -ik -F file=@/tmp/test.txt http://0.0.0.0:8000
HTTP/1.1 100 Continue
HTTP/1.1 200 OK
server: uvicorn
date: Wed, 10 Oct 2018 15:40:42 GMT
content-length: 11
content-type: text/plain; charset=utf-8

Uploaded OK

I'll see if I can find some time later this week or next to try to resolve the issue.

when release a new version

0.0.5 is very old, when can release a new version?

Fails with FastAPI 0.63.x Python 3.9.7

"No module named multipart" or "No module named python_multipart".
If imports are removed, then

Form data requires "python-multipart" to be installed. 
You can install "python-multipart" with: 
pip install python-multipart

Example script fails with AssertionError: write() argument must be a bytes instance

Example script at http://andrew-d.github.io/python-multipart/quickstart.html#simple-example fails with AssertionError: write() argument must be a bytes instance when trying to send a multipart-formdata request from Insomnia.

issue with text/csv with strange characters

We are seeing some characters dropped when uploading certain files. Attached is an example file.
I have been going through the parsing code, but can't see the issue..

x.txt

Development status

Would you consider updating the development status to stable? It seems the API stayed the same these past few years.

Parsing fails if part header name contains a number

When custom headers are included in parts, and those header names contain numbers, the multipart parser throws an exception:

Found non-alphanumeric character 49 in header at 92

Numbers are perfectly valid in header names and shouldn't be rejected.

Sample data to reproduce the issue (X-funky-header-1 triggers it):

b'''--b8825ae386be4fdc9644d87e392caad3\r\n
Content-Type: text/plain; charset=utf-8\r\n
X-funky-header-1: bar\r\n
Content-Length: 6\r\n
Content-Disposition: attachment; filename="secret.txt"; name="files"\r\n\r\n
aaaaaa\r\n
--b8825ae386be4fdc9644d87e392caad3--\r\n'''

Test with empty requests

Need to check what happens if we get something like:

POST / HTTP/1.1
Host: localhost

Generalize install instructions for other shells

Error when running pip install .[dev] on zsh, because [ and ] are special characters.

pip install .[dev]
zsh: no matches found: .[dev]

Solution (from this issue) is to use quotes

pip install '.[dev]'

Please consider updating the README

FileClass instantiated without config

If you want to use MAX_MEMORY_FILE_SIZE, you need to create your own FileClass for use in FormParser, because FormParser.config is not sent to File instance.

Upload wheels to PyPI

Wheels (.whl) for this package are currently missing from PyPI. Could wheels be uploaded for the current and future releases?

Read more about the advantages of wheels to understand why generating wheel distributions are important.

To create a wheel along with source distribution:

(venv) $ pip install --upgrade pip setuptools wheel
(venv) $ python setup.py sdist bdist_wheel

# See dist/*.whl

To upload wheels:

(venv) $ pip install twine
(venv) $ twine upload dist/*

Push tags

There's a discrepancy between released versions on PyPI (0.0.1 through 0.0.5) and tags on GitHub (0.0.2 and 0.0.4 only). Could you push the tags onto this repository 🙂 ? It can be confusing when checking what the latest version is, etc.

Add Python 3.12 support

Please add support for Python 3.12. As I see, the library is not tested for Python 3.12, so I am not sure if the library will work correctly: https://github.com/andrew-d/python-multipart/blob/d3d16dae4b061c34fe9d3c9081d9800c49fc1f7a/.github/workflows/test.yaml#L17

Issue with Handling Multipart Forms Containing Cyrillic File Names

  File "/Users/h0rn3t/PycharmProjects/ms-mass-load/venv/lib/python3.11/site-packages/multipart/multipart.py", line 118, in parse_options_header
    options[key.encode('latin-1')] = value.encode('latin-1')
                                     ^^^^^^^^^^^^
AttributeError: 'tuple' object has no attribute 'encode'

How to read the in-memory file content

Hi,

I have a silly question, would you please help how to read the file content in the quick start document.
https://andrew-d.github.io/python-multipart/quickstart.html#in-depth-example

def on_file(file):
ret.append("Parsed file named: %s" % (file.field_name,))

Thank you!
Kun

MultipartParseError when parsing empty message

Parsing an empty message, i.e. a message with only the final boundary, raises MultipartParseError in MultipartParser._internal_write:

Traceback (most recent call last):
  File "/tmp/repro.py", line 5, in <module>
    parser.write(b"--" + boundary + b"--\r\n")
  File "/tmp/python-multipart/multipart/multipart.py", line 1076, in write
    l = self._internal_write(data, data_len)
  File "/tmp/python-multipart/multipart/multipart.py", line 1154, in _internal_write
    raise e
multipart.exceptions.MultipartParseError: Did not find CR at end of boundary (3)

How to reproduce (ed8c47a):

import multipart

boundary = b"-"
parser = multipart.MultipartParser(boundary)
parser.write(b"--" + boundary + b"--\r\n")

Firefox and Chrome produce empty messages like this, e.g. when POSTing a HTML form with only a unchecked checkbox.

Please upload current version (0.0.3) to pypi

currently there is only v0.0.2: https://pypi.python.org/pypi/python-multipart/

Thank you!