sylvainde / comicbookmaker Goto Github PK
View Code? Open in Web Editor NEWScript to fetch webcomics and use them to create ebooks.
License: MIT License
Script to fetch webcomics and use them to create ebooks.
License: MIT License
Since the 12th April 2016, the naive strategy used to retrieve Garfield comics does not work anymore.
The comic URL seems just right but the image URL is not. For the record, the last working case and the first failing case:
{
"comic": "Garfield",
"day": 11,
"img": [
"http://garfield.com/uploads/strips/2016-04-11.jpg"
],
"local_img": [
"garfield/2016-04-11.jpg"
],
"month": 4,
"url": "http://garfield.com/comic/2016-04-11",
"year": 2016
},
{
"comic": "Garfield",
"day": 12,
"img": [
"http://garfield.com/uploads/strips/2016-04-12.jpg"
],
"local_img": [
null
],
"month": 4,
"url": "http://garfield.com/comic/2016-04-12",
"year": 2016
},
http://threewordphrase.com/ gives a Server not found error at the moment.
As I was about to add a comic, I noticed that I was about to write code I had already written in the past. Indeed, retrieving comics from website using Wordpress is pretty much always done the same way.
Detecting the meta generator gives:
asp : about to update
< meta content="WordPress 3.5.2" name="generator"/>
asp : nothing new
berkeley : about to update
< meta content="WordPress 2.9" name="generator"/>
berkeley : nothing new
blues : about to update
< meta content="WordPress 3.5.1" name="generator"/>
blues : nothing new
boulet : about to update
< meta content="WordPress 3.5" name="generator"/>
boulet : nothing new
boulet_en : about to update
< meta content="WordPress 3.5" name="generator"/>
boulet_en : nothing new
butter : about to update
< meta content="WordPress 4.3.1" name="generator"/>
butter : nothing new
chuckleaduck : about to update
< meta content="WordPress 4.3.1" name="generator"/>
chuckleaduck : nothing new
completelyserious : about to update
< meta content="WordPress 3.4.1" name="generator"/>
completelyserious : nothing new
efc : about to update
< meta content="WordPress 4.3.1" name="generator"/>
efc : nothing new
fatawesome : about to update
< meta content="WordPress 4.2.5" name="generator"/>
fatawesome : nothing new
gentlemanarmchair : about to update
< meta content="WordPress 4.3.1" name="generator"/>
gentlemanarmchair : nothing new
happletea : about to update
< meta content="WordPress 4.0" name="generator"/>
happletea : nothing new
penmen : about to update
< meta content="WordPress 4.3.1" name="generator"/>
penmen : nothing new
picturesinboxes : about to update
< meta content="WordPress 4.3.1" name="generator"/>
picturesinboxes : nothing new
poorlydrawn : about to update
< meta content="WordPress 4.3.1" name="generator"/>
poorlydrawn : nothing new
rapture : about to update
< meta content="WordPress 4.3.1" name="generator"/>
rapture : nothing new
thor : about to update
< meta content="The Grawlix CMS — the CMS for comics" name="generator"/>
thor : nothing new
toonhole : about to update
< meta content="WordPress 4.3.1" name="generator"/>
toonhole : nothing new
warehouse : about to update
< meta content="WordPress 4.3.1" name="generator"/>
warehouse : nothing new
http://fatawesome.com/the-incredible-antman/
and
http://lastplacecomics.com/comic/this-guy/
are currently giving the following error : "Error establishing a database connection".
This will probably be solved at some point...
As per http://www.safelyendangered.com/
"Safely Endangered is down for maintenance! "
A few comics have no comics retrieved at all.
A few other comics have saved their image to the same file ('geek/.jpeg') because the url of the file is a bit unusual.
I don't know what changed in my config but I now get:
my_env/lib/python3.4/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
This seem to correspond to https://www.crummy.com/software/BeautifulSoup/bs4/doc/#you-need-a-parser .
Links like https://garfield.com/comic/2016/12/15 lead to a form asking for more information.
The code does not crash but no comics gets retrieved and the navigation_navigation
method detects that something is wrong:
garfield : about to update
From https://garfield.com/comic/2016/12/15 : no previous nor next
garfield : nothing new
There should be a simple way to do something like:
It should be simple to save information about the comics we retrieve (in a json file for instance), to re-use this information on demand and to clean it. I am not quite sure what is the most beautiful and more intuite way to make this work.
$ python comicbookmaker.py -c zenpencils
zenpencils : about to update
[<img alt="196. The Can-do girl" src="http://cdn.zenpencils.com/wp-content/uploads/can_do_girl2.jpg" title="196. The Can-do girl"/>]
zenpencils : nothing new
Traceback (most recent call last):
File "comicbookmaker.py", line 88, in <module>
main()
File "comicbookmaker.py", line 63, in main
com.update()
File "/home/josay/Geekage/Comics/ComicBookMaker/comic_abstract.py", line 179, in update
for comic in cls.get_next_comic(comics[-1] if comics else None):
File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 104, in get_next_comic
comic = cls.get_comic_info(soup, next_comic)
File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 528, in get_comic_info
assert all(i['alt'] == i['title'] == "" for i in imgs)
AssertionError
http://www.octopuns.net/ seems to be dead now :(
http://www.octopuns.net/2014/03/im-sorry-im-still-here-guys.html redirects to a generic Webhosting website leading to the following exception:
DEBUG:root:load_db octopuns start
DEBUG:root:update octopuns last comic is http://www.octopuns.net/2014/03/im-sorry-im-still-here-guys.html
DEBUG:root:get_content (url : http://www.octopuns.net/2014/03/im-sorry-im-still-here-guys.html)
DEBUG:root:urlopen_wrapper (url : http://www.octopuns.net/2014/03/im-sorry-im-still-here-guys.html)
octopuns : nothing new
Traceback (most recent call last):
File "comicbookmaker.py", line 98, in <module>
main()
File "comicbookmaker.py", line 73, in main
com.update()
File "/home/josay/Geekage/Comics/ComicBookMaker/comic_abstract.py", line 200, in update
for comic in cls.get_next_comic(last_comic):
File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 121, in get_next_comic
if url else \
File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 108, in get_next_link
return cls.get_navi_link(last_soup, True)
File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 1470, in get_navi_link
link = last_soup.find('img', src=re.compile('.*/Next.png' if next_ else '.*/Back.png')).parent
AttributeError: 'NoneType' object has no attribute 'parent'
http://toonhole.com/ seems to be unavailable
https://tapastic.com/series/HappyMondayComics does not exist anymore. I don't know why...
I think it has moved. Maybe to http://fatawesomecomedy.tumblr.com/
Page http://www.bouletcorp.com/2016/03/07/la-roue-du-karma/ contains an image with no src
<img id="control-img" class="alignnone size-full wp-image-6225" alt="Titre" />
which is not handled properly.
:'(
http://lastplacecomics.com seems to be down: redirection to http://p3nlhclust404.shr.prod.phx3.secureserver.net/SharedContent/redirect_0.html and message "Site suspended".
The archive format has changed.
respawn : about to update
<a class="navi comic-nav-next navi-next" href="http://respawncomic.com/comic/c0063/" title="Next >">Next ></a>
respawn : nothing new
Traceback (most recent call last):
File "comicbookmaker.py", line 88, in <module>
main()
File "comicbookmaker.py", line 63, in main
com.update()
File "/home/josay/Geekage/Comics/ComicBookMaker/comic_abstract.py", line 179, in update
for comic in cls.get_next_comic(comics[-1] if comics else None):
File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 104, in get_next_comic
comic = cls.get_comic_info(soup, next_comic)
File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 1713, in get_comic_info
title = soup.find('h2', class_='post-title').string
AttributeError: 'NoneType' object has no attribute 'string'
Format has changed. To be fixed asap.
http://tapastic.com/ seems to be temporarily unavailable.
I'll have to check when it is back if the format hasn't changed.
Traceback (most recent call last):
File "comicbookmaker.py", line 98, in <module>
main()
File "comicbookmaker.py", line 73, in main
com.update()
File "/home/josay/Geekage/Comics/ComicBookMaker/comic_abstract.py", line 200, in update
for comic in cls.get_next_comic(last_comic):
File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 132, in get_next_comic
comic = cls.get_comic_info(soup, next_comic)
File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 1023, in get_comic_info
author = soup.find('meta', attrs={'name': 'shareaholic:article_author_name'})['content']
TypeError: 'NoneType' object is not subscriptable
Everything has changed : navigation and information available.
5e39c80 misses a @classmethod on the get_first_comic_link
definition.
Related to #20
Sometimes, the comic information is available but the comic image is not available yet leading to many images to me missing (and to the retrieval to be useless eventually).
It could be nice to have something to expand them to maximum size. Unfortunately, I suck at HTML and I do not know what it supported by kindlegen.
For instance:
{
"img": [
"http://infinitemonkeybusiness.net/wp-content/uploads/2015/11/Snake-Charmer.jpg",
"http://i2.wp.com/infinitemonkeybusiness.net/wp-content/uploads/2015/11/Snake-Charmer.jpg?resize=750%2C767"
],
"title": "Snake Charmer - Infinite Monkey Business",
"url": "http://infinitemonkeybusiness.net/comic/snake-charmer/",
"month": 12,
"year": 2015,
"local_img": [
"monkey/Snake-Charmer.jpg",
"monkey/Snake-Charmer.jpg?resize=750,767.jpeg"
],
"comic": "Infinite Monkey Business",
"day": 11
},
Mostly default behaviour : path for output, list of comics, etc.
http://www.goneintorapture.com/ seems to be down . Corresponding tumblr is still working fine.
The URL format has changed.
One can either tweak the latest URL in the JSON file or just (re)move the existing folder and retrigger the download.
Probably an interesting read: https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy/ . Might be related to #36 too.
It seems like the "next" button on the comics doesn't work anymore.
http://linscomics.tumblr.com/post/154205434064/httpwarandpeastumblrcom
You were looking for the comic blog L.I.N.S. but found this empty mess.
The thing is: we changed our name in late 2016 to “War and Peas”.
Please visit our new tumblr: http://warandpeas.tumblr.com/
New comics do not seem to be updated.
$ python comicbookmaker.py -c squares
squares : about to update
squares : nothing new
Traceback (most recent call last):
File "comicbookmaker.py", line 88, in <module>
main()
File "comicbookmaker.py", line 63, in main
com.update()
File "/home/josay/Geekage/Comics/ComicBookMaker/comic_abstract.py", line 179, in update
for comic in cls.get_next_comic(comics[-1] if comics else None):
File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 187, in get_next_comic
for archive_elt in cls.get_archive_elements():
File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 2072, in get_archive_elements
return reversed(get_soup_at_url(archive_url).find('tbody').find_all('tr'))
AttributeError: 'NoneType' object has no attribute 'find_all'
I guess format has changed...
The site has changed :(
I have a kindle but some might prefer a different format
On http://www.anythingcomic.com/comics/2365881/hoppy-christmas/ :
$ python comicbookmaker.py -c anythingcomic
anythingcomic : about to update
anythingcomic : nothing new
Traceback (most recent call last):
File "comicbookmaker.py", line 98, in <module>
main()
File "comicbookmaker.py", line 73, in main
com.update()
File "/home/josay/Geekage/Comics/ComicBookMaker/comic_abstract.py", line 200, in update
for comic in cls.get_next_comic(last_comic):
File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 238, in get_next_comic
comic = cls.get_comic_info(soup, archive_elt)
File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 2235, in get_comic_info
day = string_to_date(td_date.string, '%d %b %Y %I:%M %p')
File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 4906, in string_to_date
ret = datetime.datetime.strptime(string, date_format).date()
File "/usr/lib/python3.4/_strptime.py", line 500, in _strptime_datetime
tt, fraction = _strptime(data_string, format)
File "/usr/lib/python3.4/_strptime.py", line 337, in _strptime
(data_string, format))
ValueError: time data 'December 10th, 2016, 4:00 am' does not match format '%d %b %Y %I:%M %p'
https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior
Nothing about leleoz seems to exist anymore...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.