ptpb / pb Goto Github PK
View Code? Open in Web Editor NEWpb is a formerly-lightweight pastebin and url shortener
License: Other
pb is a formerly-lightweight pastebin and url shortener
License: Other
#43 broke this. We need to do something cleverer than the string validator.
Though we proved last night that all of our current procedures other than paste_get_stats
have something approaching constant-time complexity, I think the complexity of our current code is rather ridiculous.
Here are the problems so far:
We did this because it made more sense than breaking indexing, or making special metadata columns that ultimately identify what kind of paste this is, then having to twiddle metadata bits in queries.
There are multiple reasons for this, mostly because: 1) SQL injection is impossible, because no SQL is ever executed from the application 2) as a result, we also get the benefit of only parsing SQL once (on schema load), which makes queries faster
Here's why we got rid of the previous ORM (sqlalchemy):
Enough said.
However, as you'll notice, we've only actually fixed the 'slow' part; clumsy is back, only in a different form. The way to fix the clumsiness (and the problem in its entirety), in my opinion, is to replace SQL entirely.
Mongo in particular fits our data model very well. First read the terminology comparison. In the first few seconds we learn:
If you're not convinced, here's how MySQL doesn't help us whatsoever:
All of these facts combined, I'd like to replace entirely our use of MySQL with MongoDB.
As we've demonstrated quite prophetically, when large changes like pastes → paste+urls and b64 → b66 happen, things tend to break in completely preventable ways.
The problem is that testing is tedious and boring. This shouldn't be the case.
Pʀᴏʙʟᴇᴍ. Sᴏʟᴜᴛɪᴏɴ: Dᴇꜱᴛʀᴏʏ Automate testing. All pull requests should ideally never introduce regressions, as all the possibilities would have been exhaustively checked before a merge is even possible.
Flask has a few suggestions on how to do this too.
Low-hanging fruit:
/id.ext
/<ID>/<LEXER>
to /<id>/<lexer>
We should also probably rewrite the whole page so the first-level item is the endpoint, and then we talk about the verbs and what they do.
Something like:
/: GET, POST
=============
== GET ==
This is what get does.
== POST ==
This is what post does.
``/<id>``: GET, PUT, DELETE
=======================
Blah blah
Hopefully you get the idea. We might also make an initial SUMMARY
at the beginning gives full examples on just making/getting pastes.
I'm not sure if we want CSS for this in general, or if index.rst should be adjusted, or if this isn't actually a problem (if we're ok with the default browser pre style).
See the https://ptpb.pw/QQQP.jpg
and https://ptpb.pw/QQQ_/py#L-24
lines from the rendered index.
First reported by DimeShake on reddit.
Create a new set of /<sha1>
routes with the same features as referencing pastes by base66 id.
This will be extremely trivial to do--digests are already present and indexed.
I hate the ridiculous index.html that we have now; rewrite this nonsense in something less ridiculous. We also get fun things like table of contents, cross-references, etc…
Use the reStructuredText stuff from #67 for HTML rendering. We then rely on caching to avoid having to regenerate the HTML (from the perspective of the client) on each request.
I hear people want this.
The idea is:
Why I would be ok with not doing this:
Currently it takes a few seconds to upload screencaps; any chance we can speed things up?
@buhman had linked the Python Profilers docs; I suppose I should try them out
EDIT: [nuked stupid 'sanic' image]
21:31:12 +buhman> I'll probably just make a 'shorturl' button on /f that re-uses the same form thing
sgtm
Some pastebins offer the ability for users to set a time in the future when the paste should be deleted (though some of them offer this function in preset intervals, just offering the count in seconds would be preferable in my opinion). Additionally, an alternative method could be introduced, accepting a short passphrase which, when passed to the server would request the paste's deletion.
Neither of these would override the need for the server to delete pastes as necessary, rather, they would just offer a simple avenue for users to request the deletion of their pastes prematurely.
Hi,
I was just curious if there is any special reason you went with a yaml configuration file instead of using the approach flasks support: http://flask.pocoo.org/docs/0.10/config/#configuring-from-files
Thanks
I would expect https://ptpb.pw/AABV/bf#-109 to take me to line 109 of the paste, but it always takes me to the top.
buhman said everything is in scope, and since a URL shortener is supposedly offered now, the opposite would be nice (limited to ~450 characters so they can be used in IRC messages).
One of the most critical things about a pastebin is that it must always work, even through the zombie apocalypse.
People want to put their shit up on a pastebin yesterday, not fuck around with 6 alternatives trying to find the one that happens to be working now. We need to deliver on that, and be the pastebin that everyone uses because it is the only pastebin that works when all of the others fail.
There are a few considerations:
If nothing else fails, one inevitability is that we will run out of paste IDs or disk space or both. We should make sure that this never gets in the way of people being able to make new pastes--this should be actively monitored and actioned upon proactively, and not only after someone reports actual breakage.
Even the brief time it takes to deploy code and restart gunicorn is too long, let alone any possible un-planned breakage should be treated like people die for every second ptpb is unavailable. We should decide how we want to handle this: for code deployments, we should have an automated system that drains all incoming connections, and shifts load to elsewhere. We should decide the exact mechanisms we want to use to accomplish this.
Despite my mixed feelings about Datacate, it's a serious problem if our/their infrastructure goes down, which it has at least twice since September 2014, both times for at least an hour. We should have ptpb deployed in multiple geographically separate locations, and have automatic failover and monitoring mechanisms to prepare for the catastrophic failure of an entire datacenter.
When I try to upload a file (tried jpg and png) via the html form, I receive "503 Backend fetch failed."
22:14:10 +GermainZ pomf.se supports 50M!
22:14:15 +GermainZ ptpb.pw must be superior!!!11
00:07:54 +krosos "bug report: ptpb lacks superiority. needs to outnumber mega.com. -- GermainZ"
In the rewrite, we'll need to specially handle >16M files and store them in GridFS instead.
I'd also like to see c=
removed from the post/request before we do this, which would remove the need to parse the entire request body (avoiding loading files entirely in-memory before writing to the DB).
This should be generated dynamically by pygments when there is a cache miss rather than a static file.
#48 breaks the current vanity paste-id mechanism/hack. There are two ways to un-fuck this (provided we want to un-fuck it):
roll our own auto_increment that has the behavior we want (hard, likely slow)
make another secondary index (easy, fast, could cause collisions)
% curl -F c=@- https://ptpb.pw <<< 'boats and hoes asdf'
url: https://ptpb.pw/QQQe?filename=-
uuid: 4556bd33-4cc3-4aad-bf82-58f1de76de7a
This is not a cache invalidation thing (though I suspect cache invalidation may be simultaneously broken for this), but a db-actually-never-gets-updated thing.
Needs better tests to catch this.
We should intelligently cache things. We currently have varnish in front of ptpb, but this will cause problems if somebody modifies a paste as varnish will continue caching until the max time expires, this will look weird and feel clunky to users.
We should do something like this:
http://flask.pocoo.org/snippets/95/
It would be a nice feature to highlight the specified line number (e.g. https://ptpb.pw/AABV/bf#-96) once #37 is fixed.
There are a few websites (which are essentially limited pastebins) which have allowed you to upload a session recorded with ttyrec
and share it (just like you would share a paste in pb
now). The features these sites offer include play/pause/restart (some even include options to make playback faster or slower than real-time) along with the full text of the session being copyable (this should be one of the easiest features considering that the format ttyrec
outputs actually includes all the text.
pb
, as it does not truly care about what kind of data is pasted to it, an already take ttyrec
files as pastes. And users could fetch one of these files and pipe it into ttyplay
, but having a handler (similar to the rst handler) that allows for in-browser playback of these pastes could be incredibly helpful.
This here pastey thing y'all have made is quite fancy. Yet, I find it lacking the ability to host my whole website. It would be great if I could paste some HTML, or RestructuredText, and then use a special URL parameter when viewing the paste to have it automagically render as a webpage rather than the content of the paste.
Fantabulous!
I can't seem to find the license. Would you add one, please? See http://choosealicense.com/ for more information.
Use a hash of all data to make some sort of favicon
Fixit.
add support for these verbs.
We'll do 'authentication' by giving the user a one-time UUID that they need to spit back at us to do the delete/put.
Example request:
POST / HTTP/1.1
User-Agent: curl/7.39.0
Host: localhost:9000
Accept: */*
Content-Length: 141
Expect: 100-continue
Content-Type: multipart/form-data; boundary=------------------------d5b5f1dac8e938df
--------------------------d5b5f1dac8e938df
Content-Disposition: form-data; name="c"
asdf
--------------------------d5b5f1dac8e938df--
Generated via:
curl -F "c=<-" localhost:9000 <<< asdf
multipart
messages consist of boundaries that each contain some Content-Disposition
.
There are two problems with the way Werkzeug handles form-data
in particular.
filename
disposition extension parameterBytesIO
), regardless of size--though overriding this behavior is possible, doing this in a useful way would be fairly difficult.sqlalchemy
or naive mysql-python-connector
, will nevertheless result in a userspace-copy of the entirety of the stream to memory.While with hackery (and a longer curl command), the somewhat-more-favorable filename
form decode method could be used without breaking HTML form input, doing so is not worth it as Werkzeug is still inherently inefficient and broken.
At the moment, pb
supports pastes up to 60 MiB in size.
That is, 62914560
bytes. Look how awful that number is!
Would it not be dramatically rounder and prettier to allow for 64 MiB; i.e., 67108864
(or 2 ^ 26
) bytes?
because @polyzen is too shy to open this himself, I'm opening it.
It's desired to execute code and provide stdout similar to how codepad does things.
I imagine we'd sandbox execution in containers, where each execution enters a copy-on-write btrfs snapshot. We could limit resource utilization and execution time via normal container mechanisms.
I've heard something about the html form being bloat; so this wouldn't involve writing anything AJAXy, nor would it involve blocking the POST until the paste is complete. Instead we'd return http code 202 ACCEPTED
on POST. It's likely also an appropriate response for GET when we're (still) not ready yet.
We'll also need to make sure we do cache invalidation when we're ready. Likely the container master general would do an internal callback request to give the results.
Questions:
Traceback (most recent call last):
File "/root/pbenv/src/flask/flask/app.py", line 1969, in __call__
return self.wsgi_app(environ, start_response)
File "/root/pbenv/src/flask/flask/app.py", line 1953, in wsgi_app
response = self.make_response(self.handle_exception(e))
File "/root/pbenv/src/flask/flask/app.py", line 1530, in handle_exception
reraise(exc_type, exc_value, tb)
File "/root/pbenv/src/flask/flask/_compat.py", line 33, in reraise
raise value
File "/root/pbenv/src/flask/flask/app.py", line 1950, in wsgi_app
response = self.full_dispatch_request()
File "/root/pbenv/src/flask/flask/app.py", line 1604, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/root/pbenv/src/flask/flask/app.py", line 1507, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/root/pbenv/src/flask/flask/_compat.py", line 33, in reraise
raise value
File "/root/pbenv/src/flask/flask/app.py", line 1602, in full_dispatch_request
rv = self.dispatch_request()
File "/root/pbenv/src/flask/flask/app.py", line 1588, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/srv/pb/pb/paste/views.py", line 51, in post
paste = model.insert(stream, label=label)
File "/srv/pb/pb/paste/model.py", line 47, in insert
get_db().pastes.insert(d)
File "/root/pbenv/lib/python3.4/site-packages/pymongo/collection.py", line 1926, in insert
check_keys, manipulate, write_concern)
File "/root/pbenv/lib/python3.4/site-packages/pymongo/collection.py", line 436, in _insert
self.codec_options, sock_info)
File "/root/pbenv/lib/python3.4/site-packages/pymongo/pool.py", line 237, in legacy_write
return helpers._check_gle_response(response)
File "/root/pbenv/lib/python3.4/site-packages/pymongo/helpers.py", line 227, in _check_gle_response
raise DuplicateKeyError(details["err"], code, result)
pymongo.errors.DuplicateKeyError: E11000 duplicate key error index: pb.pastes.$label_1 dup key: { : "~polyzen" }
Reported by @HalosGhost, though he was too shy to make an issue for it.
The main problem is that Generic.Output is too dull, which makes it hard to read in large quantities. Ideally we wouldn't add any extra color styles to null-lexer text at all.
Should basically look at how other lexers spit out un-parseable text (they do not use Generic.Output).
The reference host of pb
(ptpb) is great, but there are some cases where I would find it helpful to be able to install my own copy of pb
to be hosted wherever I so choose; e.g., I could even imagine deploying pb
as a private service to be used on an internal network in the company I work for.
Offering a PKGBUILD
to build and package pb
for Arch would be deeply appreciated.
This predates #104, but #104 made this worse:
There are multiple ways to reference the same paste--on PUT/DELETE we would pick one of those, put it in the Location header, then invalidate_cache()
used that to send a new BAN
request to varnish.
We should replace the current invalidation mechanism with something actually sane.
It would be interesting to enable some kind of compartmentalized synchronizing of pb databases, this way the pb database at machine X could share cetain content with machine Y, allowing redundancy in the event of say a network failure. You could mark certain files or directories as synchronizing with pb databases on multiple other machines. This could really allow for a lightweight, scalable, and easily implemented network for say projects and collaborations.
After #104, we lost test coverage, and also exposed a few tests that were broken.
This should be fixed.
+buhman │ GermainZ: I also anticipate allowing css overrides
⤷ │ not sure how to make a sensible http verb thing to do that though
⤷ │ or if it should just be a query string or some shit
⤷ │ ?css=QQxy
⤷ │ or similar
⤷ │ the coolest shit though
⤷ │ is that we could make everything query string shit
⤷ │ then make shorthand for 'turn this back into a shorturl that points to my long query string'
So basically:
ptpb.pw/r/pasteid?css=someotherpasteid
to use someotherpasteid
as a stylesheet for rendering pasteid
.ptpb.pw/QQQQ
→ ptpb.pw/r/pasteid?css=someotherpasteid
Here's what we aren't testing (yet):
pb.url.*
pb.paste.handler.*
pb.paste.views.{highlight_css,list_lexers,stats,delete,put,form,filter_rst}
pb.util.highlight
Actually a fairly short list.
For things like {delete,put}
, we need to be able to reliably parse the UUID from responses, which is not currently possible.
All of these tests should probably be 'indirect' tests via what we expose in the API (and not necessarily directly calling handlers for example).
From #62:
https://www.varnish-cache.org/docs/4.0/reference/vcl.html#backend-definition
https://www.varnish-cache.org/docs/4.0/reference/vcl.html#probes
https://www.varnish-cache.org/docs/4.0/reference/vmod_directors.generated.html
Examples: https://www.varnish-cache.org/docs/4.0/users-guide/vcl-backends.html#directors
rfc3986:
2.3. Unreserved Characters
Characters that are allowed in a URI but do not have a reserved
purpose are called unreserved. These include uppercase and lowercase
letters, decimal digits, hyphen, period, underscore, and tilde.
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
These will be used in our custom base66 encoder. Ideally we'd also use a simple replacement cipher or similar so we get 'cool-looking' urls instead of: '001', '002', etc.., we'd get 2xz
and 8y~
and similar.
This would also remove the dependency on bitstring
.
Opening because @GermainZ is too shy.
Questions:
The answer to that question basically depends on whether we want to make a second uuid column, or re-use the current uuid column. There are two ways I see this working:
private in request.form.keys()
or similar into the db. GET then allows both by-uuid and by-id. by-uuid always matches, whereas by-id matches only if not private. The concern with this method is that it would allow anyone with the secret URL to also modify/delete the paste. This could be avoided by making PUT/DELETE match not-secret pastes, making secret pastes immutable.PRIMARY_KEY
is still id, but instead of mediumint, it's binary(16) again, like things were way back in ebe35cd. There is still a separate uuid
column which could be renamed to something more descriptive. GET would have two separate routes, one for id and another for uuid; these would query the different tables. PUT/DELETE would probably need/use fancy union stuff.Both ideas have their strengths and complexities. What do.
As it was revealed that people depend on their precious existing pastes for things like factoids, I'd like to avoid losing these. I quickly set up a nginx hack that uses a 404 handler which then sends the request to an instance of the pre-mongo pb
. But that's not particularly good.
Better:
Convert all existing pastes (by base66 id) to vanity pastes. This will take quite a bit of work.
I get asked about this a lot, so I should probably document it.
One of the issues in solving #107 is URL routing.
Currently we rely on regular expressions to figure out which kind of get we want, but this is actually no longer necessary.
Instead, this should be refactored into a single route/converter that accepts all paths, and finds any paste (or redirect, which would be refactored into a special type of paste) that matches in any field in any way.
As a result of merging redirects with pastes, we could go back to everything being 4-character IDs (as returned by the API), while allowing retrieval with any number of characters.
I've heard requests for:
Latex would be pretty hard, but the rest should be easy.
Do it.
We should instead be using werkzeug.wsgi.get_host()
to calculate it in some places, and flask.helpers.url_for()
in others.
Not 9000% sure if we actually want this everywhere.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.