ptpb / pb Goto Github PK

View Code? Open in Web Editor NEW

548.0 13.0 53.0 325 KB

pb is a formerly-lightweight pastebin and url shortener

License: Other

Python 84.85% HTML 1.92% API Blueprint 12.99% Dockerfile 0.24%

pastebin pastebin-service flask mongodb python pb

pb's Issues

mimetypes are broken

#43 broke this. We need to do something cleverer than the string validator.

Though we proved last night that all of our current procedures other than paste_get_stats have something approaching constant-time complexity, I think the complexity of our current code is rather ridiculous.

Here are the problems so far:

Three tables, each with roughly the same kind of object/thing

We did this because it made more sense than breaking indexing, or making special metadata columns that ultimately identify what kind of paste this is, then having to twiddle metadata bits in queries.

SQL procedures separate from python code

There are multiple reasons for this, mostly because: 1) SQL injection is impossible, because no SQL is ever executed from the application 2) as a result, we also get the benefit of only parsing SQL once (on schema load), which makes queries faster

Here's why we got rid of the previous ORM (sqlalchemy):

slow and clumsy

Enough said.

However, as you'll notice, we've only actually fixed the 'slow' part; clumsy is back, only in a different form. The way to fix the clumsiness (and the problem in its entirety), in my opinion, is to replace SQL entirely.

Mongo in particular fits our data model very well. First read the terminology comparison. In the first few seconds we learn:

SQL is replaced entirely by query language that can be represented entirely in native python functions and dictionaries.
We can have documents with different fields in the same collection--only those with fields that match (both what we wanted to match and what we want back) our query will be returned.
We get unique primary keys for free--no need to fuck with AUTO_INCREMENT nonsense.

If you're not convinced, here's how MySQL doesn't help us whatsoever:

There are no relationships between tables, thus the relational database model makes zero sense here.
If our data isn't relational, it makes zero sense to express it in a relational language like SQL.

All of these facts combined, I'd like to replace entirely our use of MySQL with MongoDB.

tests

As we've demonstrated quite prophetically, when large changes like pastes → paste+urls and b64 → b66 happen, things tend to break in completely preventable ways.

The problem is that testing is tedious and boring. This shouldn't be the case.

Pʀᴏʙʟᴇᴍ. Sᴏʟᴜᴛɪᴏɴ: ~~Dᴇꜱᴛʀᴏʏ~~ Automate testing. All pull requests should ideally never introduce regressions, as all the possibilities would have been exhaustively checked before a merge is even possible.

Flask has a few suggestions on how to do this too.

possible nagios monitors

Low-hanging fruit:

warning for ~> 2 ^ 22 pastes (check ~~the /s route~~ db)
warning for ~> 2 ^ 18 shorturls (check db--should probably make a new stored procedure for this)
warning for free disk space < 50%
PUT on some uuid (unix time in seconds) to make sure we're actually talking to a live/sane gunicorn
GET on that same paste ID to make sure varnish is doing the RightThing™ (race condition is possible here)

documentation is outdated

mimetype: /id.ext
/<ID>/<LEXER> to /<id>/<lexer>

We should also probably rewrite the whole page so the first-level item is the endpoint, and then we talk about the verbs and what they do.

Something like:

/: GET, POST
=============

== GET ==
This is what get does.

== POST ==
This is what post does.

``/<id>``: GET, PUT, DELETE
=======================

Blah blah

Hopefully you get the idea. We might also make an initial SUMMARY at the beginning gives full examples on just making/getting pastes.

no css selector matches docutils' literal-block

I'm not sure if we want CSS for this in general, or if index.rst should be adjusted, or if this isn't actually a problem (if we're ok with the default browser pre style).

See the https://ptpb.pw/QQQP.jpg and https://ptpb.pw/QQQ_/py#L-24 lines from the rendered index.

reference pastes by sha1

First reported by DimeShake on reddit.

Create a new set of /<sha1> routes with the same features as referencing pastes by base66 id.

This will be extremely trivial to do--digests are already present and indexed.

rewrite index in reStructuredText

I hate the ridiculous index.html that we have now; rewrite this nonsense in something less ridiculous. We also get fun things like table of contents, cross-references, etc…

Use the reStructuredText stuff from #67 for HTML rendering. We then rely on caching to avoid having to regenerate the HTML (from the perspective of the client) on each request.

live preview

I hear people want this.

The idea is:

you write your markdown or whatever
you can see (perhaps even as you type or something) what the rendered output looks like

Why I would be ok with not doing this:

involves adding client-side js (arguably completely disqualifying pb from being 'lightweight')
you could accomplish the same thing more or less just by repeated submissions and forward/backward in the browser

the fast master

Currently it takes a few seconds to upload screencaps; any chance we can speed things up?

@buhman had linked the Python Profilers docs; I suppose I should try them out

EDIT: [nuked stupid 'sanic' image]

shorten url's in-browser

21:31:12 +buhman> I'll probably just make a 'shorturl' button on /f that re-uses the same form thing

sgtm

Sunset Provisions

Some pastebins offer the ability for users to set a time in the future when the paste should be deleted (though some of them offer this function in preset intervals, just offering the count in seconds would be preferable in my opinion). Additionally, an alternative method could be introduced, accepting a short passphrase which, when passed to the server would request the paste's deletion.

Neither of these would override the need for the server to delete pastes as necessary, rather, they would just offer a simple avenue for users to request the deletion of their pastes prematurely.

Curious configuration file

Hi,

I was just curious if there is any special reason you went with a yaml configuration file instead of using the approach flasks support: http://flask.pocoo.org/docs/0.10/config/#configuring-from-files

Thanks

Line numbers don't seem to work correctly

I would expect https://ptpb.pw/AABV/bf#-109 to take me to line 109 of the paste, but it always takes me to the top.

URL enlarger

buhman said everything is in scope, and since a URL shortener is supposedly offered now, the opposite would be nice (limited to ~450 characters so they can be used in IRC messages).

the best pastebin that ever was or will be

One of the most critical things about a pastebin is that it must always work, even through the zombie apocalypse.

People want to put their shit up on a pastebin yesterday, not fuck around with 6 alternatives trying to find the one that happens to be working now. We need to deliver on that, and be the pastebin that everyone uses because it is the only pastebin that works when all of the others fail.

There are a few considerations:

application monitoring

If nothing else fails, one inevitability is that we will run out of paste IDs or disk space or both. We should make sure that this never gets in the way of people being able to make new pastes--this should be actively monitored and actioned upon proactively, and not only after someone reports actual breakage.

stack high availability

Even the brief time it takes to deploy code and restart gunicorn is too long, let alone any possible un-planned breakage should be treated like people die for every second ptpb is unavailable. We should decide how we want to handle this: for code deployments, we should have an automated system that drains all incoming connections, and shifts load to elsewhere. We should decide the exact mechanisms we want to use to accomplish this.

network uptime

Despite my mixed feelings about Datacate, it's a serious problem if our/their infrastructure goes down, which it has at least twice since September 2014, both times for at least an hour. We should have ptpb deployed in multiple geographically separate locations, and have automatic failover and monitoring mechanisms to prepare for the catastrophic failure of an entire datacenter.

"503 Backend fetch failed" when trying to upload file via html form

When I try to upload a file (tried jpg and png) via the html form, I receive "503 Backend fetch failed."

large file support

22:14:10 +GermainZ pomf.se supports 50M!
22:14:15 +GermainZ ptpb.pw must be superior!!!11
00:07:54 +krosos "bug report: ptpb lacks superiority. needs to outnumber mega.com. -- GermainZ"

In the rewrite, we'll need to specially handle >16M files and store them in GridFS instead.

I'd also like to see c= removed from the post/request before we do this, which would remove the need to parse the entire request body (avoiding loading files entirely in-memory before writing to the DB).

dynamic highlight.css generation

This should be generated dynamically by pygments when there is a cache miss rather than a static file.

vanity paste IDs

#48 breaks the current vanity paste-id mechanism/hack. There are two ways to un-fuck this (provided we want to un-fuck it):

roll our own auto_increment that has the behavior we want (hard, likely slow)
make another secondary index (easy, fast, could cause collisions)

spurrious filename fragment

% curl -F c=@- https://ptpb.pw <<< 'boats and hoes asdf'
url: https://ptpb.pw/QQQe?filename=-
uuid: 4556bd33-4cc3-4aad-bf82-58f1de76de7a

PUT broken on vanity pastes

This is not a cache invalidation thing (though I suspect cache invalidation may be simultaneously broken for this), but a db-actually-never-gets-updated thing.

Needs better tests to catch this.

un-caching the things

We should intelligently cache things. We currently have varnish in front of ptpb, but this will cause problems if somebody modifies a paste as varnish will continue caching until the max time expires, this will look weird and feel clunky to users.

We should do something like this:
http://flask.pocoo.org/snippets/95/

Highlight specified line number

It would be a nice feature to highlight the specified line number (e.g. https://ptpb.pw/AABV/bf#-96) once #37 is fixed.

ttyrec handler

There are a few websites (which are essentially limited pastebins) which have allowed you to upload a session recorded with ttyrec and share it (just like you would share a paste in pb now). The features these sites offer include play/pause/restart (some even include options to make playback faster or slower than real-time) along with the full text of the session being copyable (this should be one of the easiest features considering that the format ttyrec outputs actually includes all the text.

pb, as it does not truly care about what kind of data is pasted to it, an already take ttyrec files as pastes. And users could fetch one of these files and pipe it into ttyplay, but having a handler (similar to the rst handler) that allows for in-browser playback of these pastes could be incredibly helpful.

Markup Rendering

This here pastey thing y'all have made is quite fancy. Yet, I find it lacking the ability to host my whole website. It would be great if I could paste some HTML, or RestructuredText, and then use a special URL parameter when viewing the paste to have it automagically render as a webpage rather than the content of the paste.

Fantabulous!

Missing license

I can't seem to find the license. Would you add one, please? See http://choosealicense.com/ for more information.

Make favicon

Use a hash of all data to make some sort of favicon

documentation is outdated (again)

Fixit.

DELETE/PUT

add support for these verbs.

We'll do 'authentication' by giving the user a one-time UUID that they need to spit back at us to do the delete/put.

werkzeug and bad design choices

Example request:

POST / HTTP/1.1
User-Agent: curl/7.39.0
Host: localhost:9000
Accept: */*
Content-Length: 141
Expect: 100-continue
Content-Type: multipart/form-data; boundary=------------------------d5b5f1dac8e938df

--------------------------d5b5f1dac8e938df
Content-Disposition: form-data; name="c"

asdf

--------------------------d5b5f1dac8e938df--

Generated via:

curl -F "c=<-" localhost:9000 <<< asdf

multipart messages consist of boundaries that each contain some Content-Disposition.

There are two problems with the way Werkzeug handles form-data in particular.

handling depends on the existence (or lack thereof) of the filename disposition extension parameter
if this parameter does not exist:
1. the stream is userspace-copied directly into memory in its entirety (list), regardless of size
2. werkzeug decodes this according to the charset if present in the Content-Type header within the boundary, otherwise assuming utf-8 by default.
3. decoding of some sort is inevitable
if this parameter exists
1. the stream is copied directly into memory in its entirety (BytesIO), regardless of size--though overriding this behavior is possible, doing this in a useful way would be fairly difficult.
2. regardless of the above, using sqlalchemy or naive mysql-python-connector, will nevertheless result in a userspace-copy of the entirety of the stream to memory.
3. no content decoding occurs.

While with hackery (and a longer curl command), the somewhat-more-favorable filename form decode method could be used without breaking HTML form input, doing so is not worth it as Werkzeug is still inherently inefficient and broken.

Support 64 MiB pastes

At the moment, pb supports pastes up to 60 MiB in size.

That is, 62914560 bytes. Look how awful that number is!

Would it not be dramatically rounder and prettier to allow for 64 MiB; i.e., 67108864 (or 2 ^ 26) bytes?

code execution

because @polyzen is too shy to open this himself, I'm opening it.

It's desired to execute code and provide stdout similar to how codepad does things.

I imagine we'd sandbox execution in containers, where each execution enters a copy-on-write btrfs snapshot. We could limit resource utilization and execution time via normal container mechanisms.

I've heard something about the html form being bloat; so this wouldn't involve writing anything AJAXy, nor would it involve blocking the POST until the paste is complete. Instead we'd return http code 202 ACCEPTED on POST. It's likely also an appropriate response for GET when we're (still) not ready yet.

We'll also need to make sure we do cache invalidation when we're ready. Likely the container master general would do an internal callback request to give the results.

Questions:

How do we want to format the result? Should it just be a normal paste?
What languages do we want to support initially? Python? C? Anything else?

un-break vanity paste handling

Traceback (most recent call last):
  File "/root/pbenv/src/flask/flask/app.py", line 1969, in __call__
    return self.wsgi_app(environ, start_response)
  File "/root/pbenv/src/flask/flask/app.py", line 1953, in wsgi_app
    response = self.make_response(self.handle_exception(e))
  File "/root/pbenv/src/flask/flask/app.py", line 1530, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/root/pbenv/src/flask/flask/_compat.py", line 33, in reraise
    raise value
  File "/root/pbenv/src/flask/flask/app.py", line 1950, in wsgi_app
    response = self.full_dispatch_request()
  File "/root/pbenv/src/flask/flask/app.py", line 1604, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/root/pbenv/src/flask/flask/app.py", line 1507, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/root/pbenv/src/flask/flask/_compat.py", line 33, in reraise
    raise value
  File "/root/pbenv/src/flask/flask/app.py", line 1602, in full_dispatch_request
    rv = self.dispatch_request()
  File "/root/pbenv/src/flask/flask/app.py", line 1588, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/srv/pb/pb/paste/views.py", line 51, in post
    paste = model.insert(stream, label=label)
  File "/srv/pb/pb/paste/model.py", line 47, in insert
    get_db().pastes.insert(d)
  File "/root/pbenv/lib/python3.4/site-packages/pymongo/collection.py", line 1926, in insert
    check_keys, manipulate, write_concern)
  File "/root/pbenv/lib/python3.4/site-packages/pymongo/collection.py", line 436, in _insert
    self.codec_options, sock_info)
  File "/root/pbenv/lib/python3.4/site-packages/pymongo/pool.py", line 237, in legacy_write
    return helpers._check_gle_response(response)
  File "/root/pbenv/lib/python3.4/site-packages/pymongo/helpers.py", line 227, in _check_gle_response
    raise DuplicateKeyError(details["err"], code, result)
pymongo.errors.DuplicateKeyError: E11000 duplicate key error index: pb.pastes.$label_1  dup key: { : "~polyzen" }

Use something other than Generic.Output in null lexer.

Reported by @HalosGhost, though he was too shy to make an issue for it.

The main problem is that Generic.Output is too dull, which makes it hard to read in large quantities. Ideally we wouldn't add any extra color styles to null-lexer text at all.

Should basically look at how other lexers spit out un-parseable text (they do not use Generic.Output).

Package for Arch Linux

The reference host of pb (ptpb) is great, but there are some cases where I would find it helpful to be able to install my own copy of pb to be hosted wherever I so choose; e.g., I could even imagine deploying pb as a private service to be used on an internal network in the company I work for.

Offering a PKGBUILD to build and package pb for Arch would be deeply appreciated.

no cache invalidation for PUT/DELETE

This predates #104, but #104 made this worse:

There are multiple ways to reference the same paste--on PUT/DELETE we would pick one of those, put it in the Location header, then invalidate_cache() used that to send a new BAN request to varnish.

We should replace the current invalidation mechanism with something actually sane.

Distributed pastebin network?

It would be interesting to enable some kind of compartmentalized synchronizing of pb databases, this way the pb database at machine X could share cetain content with machine Y, allowing redundancy in the event of say a network failure. You could mark certain files or directories as synchronizing with pb databases on multiple other machines. This could really allow for a lightweight, scalable, and easily implemented network for say projects and collaborations.

missing test coverage/broken tests

After #104, we lost test coverage, and also exposed a few tests that were broken.

This should be fixed.

Handlers: CSS overrides (specifying a custom stylesheet)

 +buhman │ GermainZ: I also anticipate allowing css overrides
       ⤷ │ not sure how to make a sensible http verb thing to do that though
       ⤷ │ or if it should just be a query string or some shit
       ⤷ │ ?css=QQxy
       ⤷ │ or similar
       ⤷ │ the coolest shit though
       ⤷ │ is that we could make everything query string shit
       ⤷ │ then make shorthand for 'turn this back into a shorturl that points to my long query string'

So basically:

Allow something like ptpb.pw/r/pasteid?css=someotherpasteid to use someotherpasteid as a stylesheet for rendering pasteid.
Maybe also generate a short URL for it, e.g. ptpb.pw/QQQQ → ptpb.pw/r/pasteid?css=someotherpasteid

missing test coverage

Here's what we aren't testing (yet):

pb.url.*
pb.paste.handler.*
pb.paste.views.{highlight_css,list_lexers,stats,delete,put,form,filter_rst}
pb.util.highlight

Actually a fairly short list.

For things like {delete,put}, we need to be able to reliably parse the UUID from responses, which is not currently possible.

All of these tests should probably be 'indirect' tests via what we expose in the API (and not necessarily directly calling handlers for example).

varnish load balancing

From #62:

https://www.varnish-cache.org/docs/4.0/reference/vcl.html#backend-definition
https://www.varnish-cache.org/docs/4.0/reference/vcl.html#probes
https://www.varnish-cache.org/docs/4.0/reference/vmod_directors.generated.html

Examples: https://www.varnish-cache.org/docs/4.0/users-guide/vcl-backends.html#directors

betterer short-ID generator

rfc3986:

2.3.  Unreserved Characters

   Characters that are allowed in a URI but do not have a reserved
   purpose are called unreserved.  These include uppercase and lowercase
   letters, decimal digits, hyphen, period, underscore, and tilde.

      unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

These will be used in our custom base66 encoder. Ideally we'd also use a simple replacement cipher or similar so we get 'cool-looking' urls instead of: '001', '002', etc.., we'd get 2xz and 8y~ and similar.

This would also remove the dependency on bitstring.

secret pastes

Opening because @GermainZ is too shy.

Questions:

are secret pastes completely different objects, and thus in a separate table from 'regular' pastes?

The answer to that question basically depends on whether we want to make a second uuid column, or re-use the current uuid column. There are two ways I see this working:

We add a 'private' column, then shove the result of private in request.form.keys() or similar into the db. GET then allows both by-uuid and by-id. by-uuid always matches, whereas by-id matches only if not private. The concern with this method is that it would allow anyone with the secret URL to also modify/delete the paste. This could be avoided by making PUT/DELETE match not-secret pastes, making secret pastes immutable.
'private' pastes are a completely different kind of thing, in their own separate table. The PRIMARY_KEY is still id, but instead of mediumint, it's binary(16) again, like things were way back in ebe35cd. There is still a separate uuid column which could be renamed to something more descriptive. GET would have two separate routes, one for id and another for uuid; these would query the different tables. PUT/DELETE would probably need/use fancy union stuff.

Both ideas have their strengths and complexities. What do.

post-mysql transition mechanism

As it was revealed that people depend on their precious existing pastes for things like factoids, I'd like to avoid losing these. I quickly set up a nginx hack that uses a 404 handler which then sends the request to an instance of the pre-mongo pb. But that's not particularly good.

Better:

Convert all existing pastes (by base66 id) to vanity pastes. This will take quite a bit of work.

document universally-unique paste content behavior

I get asked about this a lot, so I should probably document it.

better URL routing

One of the issues in solving #107 is URL routing.

Currently we rely on regular expressions to figure out which kind of get we want, but this is actually no longer necessary.

Instead, this should be refactored into a single route/converter that accepts all paths, and finds any paste (or redirect, which would be refactored into a special type of paste) that matches in any field in any way.

As a result of merging redirects with pastes, we could go back to everything being 4-character IDs (as returned by the API), while allowing retrieval with any number of characters.

Moar renderers

I've heard requests for:

markdown
asciidoc
latex

Latex would be pretty hard, but the rest should be easy.

Make a list of lexers endpoint

Do it.

templates use 'ptpb.pw' constant where this might not actually be true

We should instead be using werkzeug.wsgi.get_host() to calculate it in some places, and flask.helpers.url_for() in others.

Not 9000% sure if we actually want this everywhere.

ptpb / pb Goto Github PK

pb's Issues

application monitoring

stack high availability

network uptime

Recommend Projects

Recommend Topics

Recommend Org