cvandeplas / pystemon Goto Github PK
View Code? Open in Web Editor NEWMonitoring tool for PasteBin-alike sites written in Python. Inspired by pastemon http://github.com/xme/pastemon
License: GNU Affero General Public License v3.0
Monitoring tool for PasteBin-alike sites written in Python. Inspired by pastemon http://github.com/xme/pastemon
License: GNU Affero General Public License v3.0
Hoping this is just temporary issue as just setup pystemon up:
Error 521 Ray ID: 4508f07cc11ba6d7 โข 2018-08-26
Web server is down
Failed to download the page because of other HTTPlib error proxy error http://pastie.org/pastes trying again. [2018-08-26 ] Retry 1/100 for http://pastie.org/pastes [2018-08-26 ] Failed to download the page because of other HTTPlib error proxy error http://pastie.org/pastes trying again.
Is this a regular issue with the above website?
pystemon[19253]: No last pasties matches for regular expression site:pastebin.com regex:<a href="/(\w{8})">.+</a></td>. Error in your regex? Dumping htmlPage #012 <!DOCTYPE HTML>#012#011<head>#012#011#011<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />#012#011#011<title>Pastes Archive - Pastebin.com</title>#012#011#011<link rel="shortcut icon" href="/favicon.ico" />#012#011#011<script src="/js/jquery.min.js"></script>#012#011#011<script src="/js/pastebin.min.js"></script>#012#011#011<link href="/i/pastebin.min.css" rel="stylesheet" type="text/css" />#012#011#011<!--[if lt IE 10]>#012#011#011#011<link href="/i/pastebin.ie8.css" rel="stylesheet" type="text/css" />#012#011#011<![endif]-->#012#012 #012#011#011<style>body{-webkit-text-size-adjust:none;}</style>#012#011#011#011#011<meta property="fb:app_id" content="231493360234820" />#012#011#011<meta property="og:title" content="Pastes Archive - Pastebin.com" />#012#011#011<meta property="og:type" content="article" />#012#011#011<meta property="og:url" content="https://pastebin.com/archive" />#012#011#011<meta property="og:image" content="https://pastebin.com/i/facebook.png" />#012#011#011<meta property="og:site_name" content="Pastebin" />#012#011#011<meta name="google-site-verification" content="jkUAIOE8owUXu8UXIhRLB9oHJsWBfOgJbZzncqHoF4A" />#012#011#011<link rel="canonical" href="https://pastebin.com/archive" />#012#011#011#011#011<meta name="viewport" content="width=device-width, initial-scale=0.70, maximum-scale=1.0, user-scalable=yes">#012#011#011#012#011#011<script>#012#011#011#011(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){#012#011#011#011(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),#012#011#011#011m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)#012#011#011#011})(window,document,'script','//www.google-analytics.com/analytics.js','ga');#012#012#011#011#011ga('create', 'UA-58643-34', 'auto');#012#011#011#011ga('require', 'displayfeatures');#012#011#011#011ga('send', 'pageview');#012#011#011</script>#012#011#011<script type="text/javascript">#012#011#011#011if (top != self)#012#011#011#011#011top.location.href = location.href;#012#011#011</script>#012#011</head>#012#011<body>#012#011<div id="main_frame">#012#011#011<div id="jq-dropdown-1" class="jq-dropdown jq-dropdown-anchor-right jq-dropdown-scroll">#012#011#011#011<ul class="jq-dropdown-menu">#012#011#011#011#011#015#012#011#011#011#011<li class="lih_640">#015#012#011#011#011#011#011<form class="search_form_li" name="search_form_li" method="get" action="/search" id="cse-search-box-li">#015#012#011#011#011#011#011#011<input class="search_input_li" type="text" name="q" size="5" value="" placeholder="search..." />#015#012#011#011#011#011#011</form>#015#012#015#012#011#011#011#011</li>#015#012#011#011#011#011<li class="lih_div"></li>#015#012#011#011#011#011<li onclick="location.href='/signup'" class="dd_su">Sign Up</li>#015#012#011#011#011#011<li onclick="location.href='/login'" class="dd_lo">Login</li>#015#012#011#011#011#011<li class="lih_div"></li>#015#012#011#011#011#011<li onclick="location.href='/api'" class="lih_640">API</li>#015#012#011#011#011#011<li onclick="location.href='/faq'" class="lih_640">FAQ</li>#015#012#011#011#011#011<li onclick="location.href='/tools'" class="lih_640">Tools</li>#015#012#011#011#011#011<li onclick="location.href='/trends'" class="lih_640">Trends</li>#015#012#011#011#011#011<li onclick="location.href='/archive'" class="lih_640">Archive</li>#011#011#011</ul>#012#011#011</div>#012#011#011<div id="header">#012#011#011#011<div id="header_wrap">#012#011#011#011#011<div id="header_top">#012#011#011#011#011#011<div id="header_logo" onclick="location.href='/'">PASTEBIN</div>#012#011#011#011#011#011<div id="header_new_paste" class="new_paste_button" onclick="location.href='/'">new paste</div>#012#011#011#011#011#011<div id="header_links">#012#011#011#011#011#011#011<a href="/trends">trends</a>#012#011#011#011#011#011#011<a href="/api" class="mmh">API</a>#012#011#011#011#011#011#011<a href="/tools" class="mmh">tools</a>#012#011#011#011#011#011#011<a href="/faq" class="mmh">faq</a>#012#011#011#011#011#011</div>#012#011#011#011#011#011<div id="header_search">#012#011#011#011#011#011#011<form class="search_form" name="search_form" method="get" action="/search" id="cse-search-box">#012#011#011#011#011#011#011#011<input class="search_input" type="text" name="q" size="5" value="" placeholder="search..." />#012#011#011#011#011#011#011</form>#012#011#011#011#011#011</div>#012#011#011#011#011#011#015#012#011#011#011#011#011<div id="header_members">#015#012#011#011#011#011#011#011<div id="header_dropdown" data-jq-dropdown="#jq-dropdown-1"> </div>#015#012#011#011#011#011#011#011<div id="header_icon"><a href="/login"><img src="/i/guest.png" class="header_icon" alt="" /></a></div>#015#012#011#011#011#011#011#011<div id="header_user_frame">#015#012#011#011#011#011#011#011#011<div id="header_username">Guest User</div>#015#012#011#011#011#011#011#011#011<div id="header_user_status">-</div>#015#012#011#011#011#011#011#011</div>#015#012#011#011#011#011#011#011<div id="header_icons">#015#012#011#011#011#011#011#011#011<a href="/login" title="My Pastebin"><img src="/i/t.gif" class="header_icons hi_mypastebin" alt="" /></a>#015#012#011#011#011#011#011#011#011<a href="/messages" title="My Messages"><img src="/i/t.gif" class="header_icons hi_messages" alt="" /></a>#015#012#011#011#011#011#011#011#011<a href="/alerts" title="My Alerts"><img src="/i/t.gif" class="header_icons hi_alerts" alt="" /></a>#015#012#011#011#011#011#011#011#011<a href="/settings" title="My Settings"><img src="/i/t.gif" class="header_icons hi_settings" alt="" /></a>#015#012#011#011#011#011#011#011</div>#015#012#011#011#011#011#011</div>#011#011#011#011</div>#012#011#011#011</div>#012#011#011</div>#012#011#011<div id="super_frame">#012#011#011#011<div id="monster_frame">#012#011#011#011#011<div id="content_frame">#012#011#011#011#011#011<div id="content_right">#011#011#011#011#011#011#012#011#011#011#011#011#011#011#011#011#011#011#011<div class="content_right_menu">#015#012#011#011#011#011#011#011#011#011#011<div class="content_right_title"><a href="/archive">Public Pastes</a></div>#015#012#011#011#011#011#011#011#011#011#011<div id="menu_2">#015#012#011#011#011#011#011#011#011#011#011#011<ul class="right_menu"><li><a href="/aJFbuCy2">Untitled</a><span>T-SQL | 15 sec ago</span></li><li><a href="/0W5mCKcJ">Untitled</a><span>PHP | 15 sec ago</span></li><li><a href="/ETVPpL2C">Untitled</a><span>21 sec ago</span></li><li><a href="/U2c9t6w6">Untitled</a><span>22 sec ago</span></li><li><a href="/a8jZ7dzF">Untitled</a><span>24 sec ago</span></li><li><a href="/EEpjM3LS">Untitled</a><span>31 sec ago</span></li><li><a href="/41RLME91">Untitled</a><span>32 sec ago</span></li><li><a href="/C6bv8q3q">Untitled</a><span>32 sec ago</span></li></ul></div></div>#011#011#011#011#011#011<div id="abrpm2"></div>#012#011#011#011#011#011#011#015#012#011#011#011<div style="padding: 0; width:160px;margin: 10px 0;clear:left;">#015#012#011#011#011#011<script type="text/javascript"><!--#015#012#011#011#011#011#011e9 = new Object();#015#012#011#011#011#011 e9.size = "160x600,120x600";#015#012#011#011#011#011//--></script>#015#012#011#011#011#011<script type="text/javascript" src="https://tags.expo9.exponential.com/tags/Pastebincom/Unsure/tags.js"></script>#015#012#011#011#011</div>#011#011#011#011#011#011<div id="steadfast" title="Pastebin is proudly hosted by Steadfast.net" onclick="location.href='http://steadfast.net/?utm_source=pastebin.com&utm_medium=referral&utm_content=hosting_by_banner&utm_campaign=referral_20140118_x_x_pastebin_partner&source=referral_20140118_x_x_pastebin_partner'"></div>#012#011#011#011#011#011</div>#012#011#011#011#011#011<div id="content_left"><div id="ie_msg"></div>#012#011#011#015#012#011#011#011<div id="abrpm"></div>#015#012#011#011#011<div class="banner_728">#015#012#011#011#011#011<script type="text/javascript"><!--#015
Hi,
Thanks for the great tool. However, I have been running the tool for sometime now but I cant seem to get any matches.
It is downloading the pasties but the alerts folder is still empty and i have not received a single match from any of the websites. The regex are as simple as searching for the word function just to test and still no matches.
Could you please help ?
When saving to mongo got 'bytes' object has no attribute 'encode'.
Pastebin informs the user when you access the site to actively
Implement a general function that matches some keywords in the html like
Actions to implement:
Allow some dynamic sleep time between the download of the pasties.
Related to #15 stats of queues so the user can see if his sleep timings are letting the queue grow indefinitely
Implement yara scanning of the pastie.
Options would thus be: regex OR yara-file
Hi,
Nothing major but if you do an update for bug fixes at some point.
It throws an error when running Pystemon.
ERROR: Cannot import the BeautifulSoup 3 Python library. Are you sure you installed it?
This is due to the import being:
from BeautifulSoup import BeautifulSoup
BS4 uses:
from bs4 import BeautifulSoup
Thanks,
Some minutes after launching the feeder it gives this error and everything stops:
Traceback (most recent call last):
File "pystemon-feeder.py", line 64, in
messagedata = open(pystemonpath+paste).read()
IOError: [Errno 2] No such file or directory: '/home/gt/pystemon/archive/codepad.org/2018/01/03/8RsQZOlJ.gz'
The directory doesn't really exist and I made a FLUSHALL to redis but the problem persists.
Any idea?
Thanks,
Hello,
CentOS 6.4 64-bit, Python 2.7.3 and the latest PyYAML and BeautifulSoup installed with easy_installer.
After launching pystemon I get a whole lot of this:
Found 10 new pasties for site nopaste.me
ThreadPasties for codepad.org crashed unexpectectly, recovering...: string indices must be integers, not str
Found 30 new pasties for site cdv.lt
Found 20 new pasties for site pastie.org
Found 20 new pasties for site snipt.net
ThreadPasties for pastebin.com crashed unexpectectly, recovering...: string indices must be integers, not str
ThreadPasties for codepad.org crashed unexpectectly, recovering...: string indices must be integers, not str
ThreadPasties for slexy.org crashed unexpectectly, recovering...: string indices must be integers, not str
ThreadPasties for pastie.org crashed unexpectectly, recovering...: string indices must be integers, not str
Found 13 new pasties for site pastesite.com
ThreadPasties for pastebin.com crashed unexpectectly, recovering...: string indices must be integers, not str
ThreadPasties for codepad.org crashed unexpectectly, recovering...: string indices must be integers, not str
ThreadPasties for cdv.lt crashed unexpectectly, recovering...: string indices must be integers, not str
ThreadPasties for pastie.org crashed unexpectectly, recovering...: string indices must be integers, not str
ThreadPasties for slexy.org crashed unexpectectly, recovering...: string indices must be integers, not str
ThreadPasties for pastebin.com crashed unexpectectly, recovering...: string indices must be integers, not str
Am I to assume things are working and I can ignore the crashed part?
Failed to download the page because of other HTTPlib error proxy error http://pastebin.ru/ trying again.
Add an option to store the pasties in an elasticsearch database
When fetching from Slexy.org, got the error:
ThreadPasties for slexy.org crashed unexpectectly, recovering...: 'str' does not support the buffer interface
Hi,
Im getting
Failed to download the page because of other HTTPlib error proxy error http://pastebin.com/api_scrape_item.php?i=2BgPDRi1 trying again
but if I do :
curl "https://scrape.pastebin.com/api_scraping.php?limit=250"
it works
Any ideas what could be going wrong?
Ive also played with the network option in the yaml, but always get
[2020-01-28 11:51:33,018] Error in configuration file:
[2020-01-28 11:51:33,018] error position: (1:9)
Any ideas?
http://pastebin.mozilla.org/2333873
id is incremental
No last pasties matches for regular expression site:pastebin.ca regex:rel="/preview.php\?id=(\d+). Error in your regex? Dumping htmlPage #012 <?xml version="1.0" encoding="utf-8"?>#012<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">#012<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="" lang="">#012<head>#012 <title>pastebin - Type, paste, share.</title>#012 <meta name="microid" content="ca0462a24e49b118730aa3ba02c4e6cc5a55cd2d"/>#012 <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>#012 <script type="text/javascript">#012//<![CDATA[#012try{if (!window.CloudFlare) {var CloudFlare=[{verbose:0,p:0,byc:0,owlid:"cf",bag2:1,mirage2:0,oracle:0,paths:{cloudflare:"/cdn-cgi/nexp/dok3v=1613a3a185/"},atok:"562c088a39c1bb7971cde1dfe7a5cc2a",petok:"015d3265ec133392a1e4c2c915eb50a5492261f3-1491187013-1800",zone:"pastebin.ca",rocket:"m",apps:{}}];document.write('<script type="text/javascript" src="//ajax.cloudflare.com/cdn-cgi/nexp/dok3v=f2befc48d1/cloudflare.min.js"><'+'\/script>');}}catch(e){};#012//]]>#012</script>#012<link rel="stylesheet" href="https://pastebin.ca/pb-g.css" type="text/css"/>#012 <link rel="icon" href="https://pastebin.ca/pastebin.ico" type="image/x-icon"/>#012 <link rel="shortcut icon" href="https://pastebin.ca/pastebin.ico"#012 type="image/x-icon"/>#012 <link rel="alternate" href="http://en.pastebin.ca/"#012 hreflang="en" title="English Translation"/>#012 <link rel="alternate" href="http://fr.pastebin.ca/"#012 hreflang="fr" title="French Translation"/>#012 <link rel="alternate" href="http://de.pastebin.ca/"#012 hreflang="de" title="German Translation"/>#012 <link rel="alternate" href="http://ja.pastebin.ca/"#012 hreflang="ja" title="Japanese Translation"/>#012 <link href="mailto:[email protected]" rev="made"/>#012 <link rel="alternate" type="application/rss+xml" title="Posts" href="/rss/posts.rss"/>#012 <link rel="alternate" type="application/rss+xml" title="News" href="/rss/news.rss"/>#012 <link rel="Help" href="/what.php"/>#012 <script type="text/javascript" src="https://code.jquery.com/jquery-1.12.4.min.js"></script>#012 <script type="text/javascript" src="https://code.jquery.com/ui/1.12.0/jquery-ui.min.js"></script>#012 <script type="text/javascript" src="/jquery.cluetip.min.js"></script>#012 <script src="https://pastebin.ca/pb-h.js?2" type="text/javascript"></script>#012<script type="text/javascript" async src="https://www.google.com/recaptcha/api.js"></script>#012<script type="text/javascript">#012 var _paq = _paq || [];#012 _paq.push(["setDomains", ["*.pastebin.ca"]]);#012 _paq.push(['trackPageView']);#012 _paq.push(['enableLinkTracking']);#012 (function() {#012 var u="//pw.vocti.ca/";#012 _paq.push(['setTrackerUrl', u+'piwik.php']);#012 _paq.push(['setSiteId', '3']);#012 var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0];#012 g.type='text/javascript'; g.async=true; g.defer=true; g.src=u+'piwik.js'; s.parentNode.insertBefore(g,s);#012 })();#012</script>#012</head>#012<body>#012 <div id="header">#012 <h1><span style="color:#003366">paste</span>bin - Type, paste, share.</h1>#012 </div>#012 <div id="grprun">Part of <a href="http://slepp.ca/">Slepp's Projects</a> — <a href="http://pastebin.ca/">Pastebin</a> —#012 <a href="http://turl.ca/">TURL</a> — <a href="http://imagebin.ca/">Imagebin</a> — <a#012 href="http://filebin.ca/">Filebin</a></div>#012 <div id="runner"><a href="/feedback.php">Feedback</a> --#012 <a href="http://en.pastebin.ca/"#012 class="sprite sprite-ca">English</a>#012 <a href="http://fr.pastebin.ca/"#012 class="sprite sprite-fr">French</a>#012 <a href="http://de.pastebin.ca/"#012 class="sprite sprite-de">German</a>#012 <a href="http://ja.pastebin.ca/"#012 class="sprite sprite-jp">Japanese</a>#012 </div>#012 <script type="text/javascript">showRunnerMenu();</script>#012 <form method="get" action="/search.php">#012 <div id="topmenu"><a href="new.php" title="Create a new Paste|Follow this link to create a brand new paste."#012 class="jt">Create</a> <a href="upload.php"#012 title="Upload Text, Images or Files|By following this link, you can upload a a text or source file, upload an image, or upload a file!"#012 class="jt">Upload</a> <a href="newest.php">Newest</a> <a#012 href="tools.php">Tools</a> <a href="donate.php">Donate</a> <input type="text" name="q"#012 size="10"/><input type="submit"#012 value="Go"/>#012 </div>#012 </form>#012 <div id="body">#012 <div id="sl"><div class="bl"><div class="br"><div class="tl"><div class="tr"><div class="menu" id="idmenu0"><div class="menutitle"><h2>Stuff to Do</h2></div><div class="items" id="idmenu0-collapse"><div class="link"><a href="/new.php" class="sprite sprite-tab_new">New Post</a>#012</div>#012<div class="link"><a href="/upload.php" class="sprite sprite-top">Upload a Post</a>#012</div>#012<div class="link"><a href="/newest.php" class="sprite sprite-recur">Goto Newest</a>#012</div>#012<div class="link"><a href="/search.php" class="sprite sprite-search">Search</a>#012</div>#012<div class="link"><a href="/tools.php" class="sprite sprite-runprog">Tools / APIs</a>#012</div>#012<div class="link"><a href="/donate.php" class="sprite sprite-emoticon">Donate</a>#012</div>#012</div>#012</div>#012<div class="menu" id="idmenu1"><div class="menutitle"><h2>Information</h2></div><div class="items" id="idmenu1-collapse"><div class="link"><a href="/news.php" class="sprite sprite-comment">Site News</a>#012</div>#012<div class="link"><a href="/what.php" class="sprite sprite-documentinfo">What is This?</a>#012</div>#012</div>#012</div>#012#012 <div class="menu" id="id2243084">#012 <div class="menutitle">#012 <h2>Quick Search</h2>#012 </div>#012#012 <div id="id2243084-collapse">#012 <form method="get" action="/search.php">#012 <fieldset id="searchbar" class="searchbar">#012 <input type="text" name="q" size="15" style="width:10em" class="input-box"/>#012 <br/>#012 <input type="submit" value="Search" class="submit-button"#012 onclick="this.value='Searching...'"/>#012 <br/>#012 </fieldset>#012 </form>#012 <form action="http://pastebin.ca/google.php" id="cse-search-box">#012 <fieldset id="googlebar" class="searchbar">#012 <input type="hidden" name="cx" value="partner-pub-0367252804969302:1yimxphzru5"/>#012 <input type="hidden" name="cof" value="FORID:10"/>#012 <input type="hidden" name="ie" value="UTF-8"/>#012 <input type="text" name="q" id="sbi" style="width:10em" class="input-box"/>#012 <input type="submit" name="sa" value="Google Search" id="sbb" class="submit-button"/>#012 </fieldset>#012 </form>#012 </div>#012 </div>#012 <div class="menu" id="idmenurecent">#012 <div class="menutitle"><h2>Recent Posts</h2></div>#012 <div class="items" id="idmenurecent-collapse">#012 </div></div></div></div></div></div></div><div id="content"><div style="text-align:center;width:100%;ba
archive/PastieSite[codepad.org]/2021/08/25/ZavEsjK5.gz
Error: /home/project/pystemon/archive/PastieSite[codepad.org]/2021/08/25/ZavEsjK5.gz, file not found
archive/PastieSite[ideone.com]/2021/08/25/I2G0lF.gz
Error: /home/project/pystemon/archive/PastieSite[ideone.com]/2021/08/25/I2G0lF.gz, file not found
archive/PastieSite[ideone.com]/2021/08/25/oiDxl6.gz
Error: /home/project/pystemon/archive/PastieSite[ideone.com]/2021/08/25/oiDxl6.gz, file not found
archive/PastieSite[paste.org.ru]/2021/08/25/cg38do.gz
Error: /home/project/pystemon/archive/PastieSite[paste.org.ru]/2021/08/25/cg38do.gz, file not found
archive/PastieSite[pastebin.fr]/2021/08/25/94467.gz
Error: /home/project/pystemon/archive/PastieSite[pastebin.fr]/2021/08/25/94467.gz, file not found
archive/PastieSite[ideone.com]/2021/08/25/61Oxkk.gz
Error: /home/project/pystemon/archive/PastieSite[ideone.com]/2021/08/25/61Oxkk.gz, file not found
archive/PastieSite[codepad.org]/2021/08/25/KpsDHlre.gz
Error: /home/project/pystemon/archive/PastieSite[codepad.org]/2021/08/25/KpsDHlre.gz, file not found
archive/PastieSite[codepad.org]/2021/08/25/qmk33o1O.gz
Error: /home/project/pystemon/archive/PastieSite[codepad.org]/2021/08/25/qmk33o1O.gz, file not found
archive/PastieSite[ideone.com]/2021/08/25/P0nYhp.gz
Error: /home/project/pystemon/archive/PastieSite[ideone.com]/2021/08/25/P0nYhp.gz, file not found
archive/PastieSite[gist.github.com]/2021/08/25/sammolk_85fa80406634fac1360f72ce74c79866.gz
Error: /home/project/pystemon/archive/PastieSite[gist.github.com]/2021/08/25/sammolk_85fa80406634fac1360f72ce74c79866.gz, file not found
archive/PastieSite[pastebin.fr]/2021/08/25/94468.gz
Error: /home/project/pystemon/archive/PastieSite[pastebin.fr]/2021/08/25/94468.gz, file not found
archive/PastieSite[codepad.org]/2021/08/25/SsbJC2MY.gz
Error: /home/project/pystemon/archive/PastieSite[codepad.org]/2021/08/25/SsbJC2MY.gz, file not found
archive/PastieSite[codepad.org]/2021/08/25/b3i76O8R.gz
Error: /home/project/pystemon/archive/PastieSite[codepad.org]/2021/08/25/b3i76O8R.gz, file not found
archive/PastieSite[paste.org.ru]/2021/08/25/ic2wbj.gz
Error: /home/project/pystemon/archive/PastieSite[paste.org.ru]/2021/08/25/ic2wbj.gz, file not found
archive/PastieSite[ideone.com]/2021/08/25/0VJcxK.gz
Inside the cd /pystemon/archive/ you will find only name without PastieSite[]
anyone having this any fix ?
Seems like utf-8 handling fails:
I.e.
ThreadPasties for pastesite.com crashed unexpectectly, recovering...: 'ascii' codec can't encode character u'\ufffd' in position 29: ordinal not in range(128)
Hi,
Probably I am absolutely wrong, but I have detected a possible issue in pystemon.py (specifically the fork in circl repository, after adding Python 3 support). The problems are in lines: 329 and 338, when variable named: description is a list, it is not possible to use the function decode, because decode it is for binary data. When trying to decode in a list, and error is raised, so I think the solution is this:
replacing: return '[{}]'.format(', '.join(descriptions.decode('utf-8', 'ignore')))
with: return '[{}]'.format(', '.join(descriptions))
For line 339, the same procedure.
If I am wrong, sorry I am just trying to help.
Thank you very much for continuing the development of this project.
Anyone else having issue's with your pastebin pro account?
All was working successfully for a few weeks and then I noticed AIL was not receiving any paste from my pastebin pro account.
Other paste are downloading successfully (slexy.org, kpaste.net, codepad.org, gist.github.com)
I have triple checked and my IP is whitelisted on Pastebin's site.
My pastebin pro configuration in pystemon.yaml:
pastebin.com_pro:
archive-url: 'https://scrape.pastebin.com/api_scraping.php?limit=250'
archive-regex: '"key": "(.+)",'
download-url: 'https://scrape.pastebin.com/api_scrape_item.php?i={id}'
public-url: 'https://pastebin.com/raw/{id}'
update-max: 50
update-min: 40
The following errors over and over until it reaches 100 then crashes. It does eventually recover on its own but crashes again after 100 tries.
[2018-10-23 21:15:08,671] Failed to download the page because of other HTTPlib error proxy error https://scrape.pastebin.com/api_scraping.php?limit=250 trying again.
[2018-10-23 21:15:08,671] Retry 99/100 for https://scrape.pastebin.com/api_scraping.php?limit=250
[2018-10-23 21:15:08,718] Failed to download the page because of other HTTPlib error proxy error https://scrape.pastebin.com/api_scraping.php?limit=250 trying again.
[2018-10-23 21:15:08,719] Retry 100/100 for https://scrape.pastebin.com/api_scraping.php?limit=250
[2018-10-23 21:15:08,875] Thread for pastebin.com_pro crashed unexpectectly, recovering...: 'NoneType' object has no attribute 'text'
Here is the error when running "./pystemon.py -v":
[2018-10-23 21:34:46,930] Retry 99/100 for https://scrape.pastebin.com/api_scraping.php?limit=250
[2018-10-23 21:34:46,930] Downloading url: https://scrape.pastebin.com/api_scraping.php?limit=250 with proxy: None and user-agent: None
[2018-10-23 21:34:47,039] Failed to download the page because of other HTTPlib error proxy error https://scrape.pastebin.com/api_scraping.php?limit=250 trying again.
[2018-10-23 21:34:47,039] Retry 100/100 for https://scrape.pastebin.com/api_scraping.php?limit=250
[2018-10-23 21:34:47,453] Thread for pastebin.com_pro crashed unexpectectly, recovering...: 'NoneType' object has no attribute 'text'
[2018-10-23 21:34:47,464] Traceback (most recent call last):
File "./pystemon.py", line 127, in run
last_pasties = self.get_last_pasties()
File "./pystemon.py", line 147, in get_last_pasties
htmlPage = response.text
AttributeError: 'NoneType' object has no attribute 'text'
Hello,
Any help with the following error would be appreciated:
ThreadPasties for pastebin.com_pro crashed unexpectectly, recovering...: string indices must be integers, not str
My IP address is already whitelisted on pastebin.
Thank you.
Instead of:
Downloading pasties from cdv.lt. Next download scheduled in 17 seconds
Downloading pasties from slexy.org. Next download scheduled in 21 seconds
You get:
[2013-04-24 15:55:00] Downloading pasties from cdv.lt. Next download scheduled in 17 seconds
[2013-04-24 15:55:02] Downloading pasties from slexy.org. Next download scheduled in 21 seconds
Hello,
It would be nice to be able to choose your logging level in the configuration file.
I only interested in error log and I am spammed with info ones.
I'll propose a PR for this.
AIL version 1.6
Ubuntu 16.04
I'm trying to use a proxy for pystemon. The question is how do you specify the proxy settings? I see at the bottom of the pystemon.yaml file there is the following proxy configuration:
proxy:
random: no
file: 'proxies.txt'
I added my proxy address in the proxies.txt file but this did not help.
The archive link is:
https://pastebin.pl/lists
I'll open a PR soon after with a working implementation, just opening the Issue for easier trackability.
AttributeError: 'list' object has no attribute 'add'
https://github.com/cvandeplas/pystemon/blob/master/pystemon.py#L881
I changed from add to append and it works, I'm not sure if the error is specific to me or only visible when proxies are enabled. I'll submit a PR if that's preferred.
proxies_list.append(line)
how can i fix this error "Failed to alert through telegram: 'Pastie' object has no attribute 'pastie_id'"
Hello,
I wonder why RedisStorage doesn't store paste content unlike other storage systems ?
Is there a limitation of Redis or stuff like that RedisStorage should only store paths and depends on FileStorage ?
Many thanks.
This way the configuration file can be altered with new search patterns and you don't need to kill/restart the script each time.
I got an email from pystemon with a match on a huge paste and my email client crashed opening it.
Maybe, we can send the the content of the paste as text attachment if the size exceed a limit define in the config file.
I'll work on a pull request for this.
ERROR: HTTP Error ############################# http://pastebin.com/archive
No HTML content for page http://pastebin.com/archive
When ever a trigger successfully hits I recieve this error. I know it was mentioned before that this was a throttling issue but I was wondering if this has been resolved. Is it throttling with pystemon where you need to change update-max & update-min? Or is it throttling by pastebin itself. I apologize about reposting.
The current URLs of the scraping API of pastebin will be discontinued on 2018-04-27. Details about the new address(es) are available at:
https://pastebin.com/doc_scraping_api
The file pystemon.yaml needs to be modifying accordingly. My understand is that all they did was add the host name 'scrape' to the API URLs.
pastebin.com_pro:
archive-url: 'https://scrape.pastebin.com/api_scraping.php?limit=100'
archive-regex: '"key": "(.+)",'
download-url: 'https://scrape.pastebin.com/api_scrape_item.php?i={id}'
public-url: 'https://pastebin.com/raw/{id}'
update-max: 20
update-min: 10
Pystemon keeps a list of seen pasties in memory for performance reasons.
When pystemon stops, and is started up immediately it fetches again all data.
It'd be great if pystemon could save his state in a file, and reuse that state when starting up. This way all seen pasties in memory are not re-downloaded again.
from logs:
pystemon[19056]: No HTML content for page http://pastebin.com/archive
From their TOS: https://slexy.org/tos
It seems the service is discontinued.
How about separating the pystemon.yaml
config file into 2 files ?
For instance, sources.yaml for the site sources maintenance and pystemon.yaml for other stuff like log, network etc.
In production, it addresses 2 kinds of contributors:
What do you think about it ?
periodically display the stats of the queues
or perhaps also display when receiving a specific signal
While running pystemon I get this output for pastebin.com
Downloading pasties from pastebin.com. Next download scheduled in 38 seconds
ERROR: HTTP Error ############################# http://pastebin.com/archive
No HTML content for page http://pastebin.com/archive
Signed up and received a pastebin pro account and receive the error below after adding the Developer API Key to the configuration.
The feed dumps the paste to the screen and does not add them to AIL.
Slexy and codepad feeds are working successfully. The issue is only with the pastebin pro config.
"No last pasties matches for regular expression site:pastebin.com_pro regex:"123456790ABCDEFGHIJ": "(.+)",. Error in your regex? Dumping htmlPage"
My pastebin.com pro account configuration in pystemon.yaml:
pastebin.com_pro:
archive-url: 'https://scrape.pastebin.com/api_scraping.php?limit=500'
archive-regex: '"123456790ABCDEFGHIJ": "(.+)",'
download-url: 'https://scrape.pastebin.com/api_scrape_item.php?i={id}'
public-url: 'https://pastebin.com/raw/{id}'
update-max: 50
update-min: 40
Hi, i'm trying to use pystemon, i downloaded it and i run it with the default .yaml config but i get this error, any suggestions about that ?
[2020-10-23 11:05:00,603] Retry client=0/5, server=33/100 for http://pastebin.gr/paste.php?download&id=1 [2020-10-23 11:05:00,695] Failed to download the page because of other HTTPlib error proxy error: http://pastebin.gr/paste.php?download&id=1 [2020-10-23 11:05:00,696] Traceback (most recent call last): File "./pystemon.py", line 993, in __download_url__ res = __parse_http__(url, session, random_proxy) File "./pystemon.py", line 956, in __parse_http__ response.raise_for_status() File "/home/parallels/Develop/pystemon/venv/lib/python3.6/site-packages/requests/models.py", line 941, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: http://pastebin.gr/paste.php?download&id=1
Thank you in advance!
There seems to be a bug with regex pattern matching for email addresses when the regex pattern is set to search only for the domain, as in the following example:
If a paste contains "[email protected]", no match is triggered.
Tested on different paste websites.
Can you replicate this issue?
When running in python3
pystemon.py:1391: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
yamlconfig = yaml.load(open(configfile))
If for whatever reasons the download threads are not catching up with the new pasties arriving the queues will grow and end up eating lots of memory.
Add a new feature threads: auto
that automatically manages the creation of additional download threads for that specific website.
The user can then choose to let pystemon decide what's good (auto) or hard-configure a number of threads per paste-site (like today)
Hi. Your tool is just great, but I am encountering a problem:
If I try to search for some specific patterns, I get errors from slexy.org which I do not get while using other patterns:
[2015-11-21 20:17:34,760] Found 21 new pasties for site snipt.net. There are now 20 pasties to be downloaded.
[2015-11-21 20:17:36,885] Found hit for ['--------'] in pastie http://slexy.org/raw/s2SNyQD6FM
[2015-11-21 20:17:36,887] ThreadPasties for slexy.org crashed unexpectectly, recovering...: 'ascii' codec can't encode characters in position 1171-1172: ordinal not in range(128)
[2015-11-21 20:17:38,810] Found hit for ['---------'] in pastie http://slexy.org/raw/s21C3nxB2v
[2015-11-21 20:17:38,811] ThreadPasties for slexy.org crashed unexpectectly, recovering...: 'ascii' codec can't encode characters in position 1171-1172: ordinal not in range(128)
And, most important, I get absolutely no results from pastebin.com, no matter what the pattern is.
I am behind TOR thru "delegated" daemon.
Thanks
Configuration file:
db:
sqlite3:
enable: no
file: 'db.sqlite3'
When set to 'No' when you launch pystemon it will fail attempting to import sqlite library. Probably not a big deal I just re-compiled Python with sqlite support and everything worked fine.
It appears the rest of the scraping is working just fine but I noticed Pastebin was having some problems today.
ERROR: URL Error ############################# http://pastebin.com/archive
Thread for pastebin.com crashed unexpectectly, recovering...: 'NoneType' object is not iterable
Traceback (most recent call last):
File "pystemon.py", line 92, in run
last_pasties = self.getLastPasties()
File "pystemon.py", line 106, in getLastPasties
htmlPage, headers = downloadUrl(self.archive_url)
TypeError: 'NoneType' object is not iterable
and
Downloading pasties from pastebin.com. Next download scheduled in 34 seconds
Downloading url: http://pastebin.com/archive with proxy: None and user-agent: None
ERROR: HTTP Error ############################# http://pastebin.com/archive
No HTML content for page http://pastebin.com/archive
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.