Hi,
This looks like a great project, and I'm trying to incorporate it into use with a Tornado server in order to handle files uploaded as multipart/form-data. It looks like it could be a perfect fit. However, I've come across a show-stopping crash that I've spent some time trying to solve, hoping for it to be obvious, to no avail.
Tornado somewhat-recently introduced a new feature, a decorator for its RequestHandler
class called @stream_request_body. If you're not familiar with Tornado, just know that the decorator streams received data chunks to a function where you can write the code to handle those chunks. In my case, I'm immediately sending the chunks to a MultipartParser
object. However, if there is more than one chunk (which seems to occur for content-length payloads that are larger than 64 KiB), there seems to be a "random chance" that the following error will occur:
Traceback (most recent call last):
[ ... Tornado & my stuff ...]
File "...src/multipart/multipart/multipart.py", line 1055, in write
l = self._internal_write(data, data_len)
File "...src/multipart/multipart/multipart.py", line 1314, in _internal_write
c = data[i]
IndexError: string index out of range
In the above traceback, i
seems to always be equal to len(data)
, so it is an off-by-one error that occurs sometimes. Here is a link to the line where this occurs. If I catch this exception, the same error may occur an arbitrary number more times, and the final file will have lost data (I haven't examined exactly how much).
Remember that I said there is a "random chance" of it occurring. Sometimes, it doesn't occur at all, for the same file that has been seen to fail another time.
I also occasionally get logger WARNING
messages that look like this, which is more rare:
Did not find boundary character 'i' at index 2
I get a message like the above for each character in the boundary, and always at index 2. I think this is just a different manifestation of the same error.
I have taken a top-down approach to debugging this, but I've not been able to find anything yet. When I write the chunks of data directly to a file (instead of the Parser
object), the end result is consistent and correct in that the only differences between uploads are at the head and tail of the file, where the boundaries are, and no data is missing from the original file. This rules out a Tornado bug. I am also pretty sure I am not using the parser incorrectly, though I will be able to furnish example code. At this point, I am somewhat sure that the error is in the _internal_write
function, specifically in the Boyer-Moore-Horspool algorithm implementation. I also think that the bug only occurs based on the sizes of the first few chunks, which differ between requests. In my tests, I've been using a ~6 MiB text file, and I've confirmed that the data is chunked differently each time: the beginning chunks are different sizes, and then it 'evens out', and then the last chunk will probabilistically be smaller. This chunking happens when using either Firefox and cURL, while uploading locally to the Tornado server.
I plan to delve into the algorithm code, but I figured that I would try to get your input on this, in hopes you can think of/implement a solution quicker than I can. As I said, I can also provide a working example after I clean up the code file a little bit. I think my next step to solving this problem is emulating different-sized chunking from a file, just to confirm my suspicion. Since your consistently-chunked example code never has this issue, with any file I try, I am really leaning toward the idea that chunk consistency is a factor here.