sylvainde / comicbookmaker Goto Github PK

View Code? Open in Web Editor NEW

34.0 34.0 7.0 2.54 MB

Script to fetch webcomics and use them to create ebooks.

License: MIT License

Python 100.00%

beautiful-soup beautifulsoup comic comic-downloader comics download-comic ebook kindle mobi python web-crawler webcomic

comicbookmaker's People

Contributors

Stargazers

Watchers

Forkers

cclauss jasonpy99 aenawi andrekis audunvn

comicbookmaker's Issues

Problem with retrieval of Garfield images

Since the 12th April 2016, the naive strategy used to retrieve Garfield comics does not work anymore.

The comic URL seems just right but the image URL is not. For the record, the last working case and the first failing case:

{
    "comic": "Garfield",
    "day": 11,
    "img": [
        "http://garfield.com/uploads/strips/2016-04-11.jpg"
    ],
    "local_img": [
        "garfield/2016-04-11.jpg"
    ],
    "month": 4,
    "url": "http://garfield.com/comic/2016-04-11",
    "year": 2016
},
{
    "comic": "Garfield",
    "day": 12,
    "img": [
        "http://garfield.com/uploads/strips/2016-04-12.jpg"
    ],
    "local_img": [
        null
    ],
    "month": 4,
    "url": "http://garfield.com/comic/2016-04-12",
    "year": 2016
},

ThreeWordPhrase does not seem to exist anymore

http://threewordphrase.com/ gives a Server not found error at the moment.

Refactoring - detect comics using the same CMS and reuse code

As I was about to add a comic, I noticed that I was about to write code I had already written in the past. Indeed, retrieving comics from website using Wordpress is pretty much always done the same way.

Detecting the meta generator gives:

asp : about to update
< meta content="WordPress 3.5.2" name="generator"/>
asp : nothing new

berkeley : about to update
< meta content="WordPress 2.9" name="generator"/>
berkeley : nothing new

blues : about to update
< meta content="WordPress 3.5.1" name="generator"/>
blues : nothing new

boulet : about to update
< meta content="WordPress 3.5" name="generator"/>
boulet : nothing new

boulet_en : about to update
< meta content="WordPress 3.5" name="generator"/>
boulet_en : nothing new

butter : about to update
< meta content="WordPress 4.3.1" name="generator"/>
butter : nothing new

chuckleaduck : about to update
< meta content="WordPress 4.3.1" name="generator"/>
chuckleaduck : nothing new

completelyserious : about to update
< meta content="WordPress 3.4.1" name="generator"/>
completelyserious : nothing new

efc : about to update
< meta content="WordPress 4.3.1" name="generator"/>
efc : nothing new

fatawesome : about to update
< meta content="WordPress 4.2.5" name="generator"/>
fatawesome : nothing new

gentlemanarmchair : about to update
< meta content="WordPress 4.3.1" name="generator"/>
gentlemanarmchair : nothing new

happletea : about to update
< meta content="WordPress 4.0" name="generator"/>
happletea : nothing new

penmen : about to update
< meta content="WordPress 4.3.1" name="generator"/>
penmen : nothing new

picturesinboxes : about to update
< meta content="WordPress 4.3.1" name="generator"/>
picturesinboxes : nothing new

poorlydrawn : about to update
< meta content="WordPress 4.3.1" name="generator"/>
poorlydrawn : nothing new

rapture : about to update
< meta content="WordPress 4.3.1" name="generator"/>
rapture : nothing new

thor : about to update
< meta content="The Grawlix CMS — the CMS for comics" name="generator"/>
thor : nothing new

toonhole : about to update
< meta content="WordPress 4.3.1" name="generator"/>
toonhole : nothing new

warehouse : about to update
< meta content="WordPress 4.3.1" name="generator"/>
warehouse : nothing new

Problem with retrieval of a couple of comics (fatawesome / lastplace)

http://fatawesome.com/the-incredible-antman/
and
http://lastplacecomics.com/comic/this-guy/
are currently giving the following error : "Error establishing a database connection".

This will probably be solved at some point...

Problem with retrieval of Safely Endangered

As per http://www.safelyendangered.com/
"Safely Endangered is down for maintenance! "

Image retrieval of Geek and Poke seems wrong

A few comics have no comics retrieved at all.
A few other comics have saved their image to the same file ('geek/.jpeg') because the url of the file is a bit unusual.

BeautifulSoup warning

I don't know what changed in my config but I now get:

my_env/lib/python3.4/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

This seem to correspond to https://www.crummy.com/software/BeautifulSoup/bs4/doc/#you-need-a-parser .

Problem in the retrieval of Garfield comics

Links like https://garfield.com/comic/2016/12/15 lead to a form asking for more information.

The code does not crash but no comics gets retrieved and the navigation_navigation method detects that something is wrong:

garfield : about to update
From https://garfield.com/comic/2016/12/15 : no previous nor next
garfield : nothing new

Save "new" comics

There should be a simple way to do something like:

update comics
update comics
update comics
create book with "new" comics
reset "new" comics

It should be simple to save information about the comics we retrieve (in a json file for instance), to re-use this information on demand and to clean it. I am not quite sure what is the most beautiful and more intuite way to make this work.

Problem with retrieval of ZenPencils

$ python comicbookmaker.py -c zenpencils
zenpencils : about to update
[<img alt="196. The Can-do girl" src="http://cdn.zenpencils.com/wp-content/uploads/can_do_girl2.jpg" title="196. The Can-do girl"/>]
zenpencils : nothing new
Traceback (most recent call last):
  File "comicbookmaker.py", line 88, in <module>
    main()
  File "comicbookmaker.py", line 63, in main
    com.update()
  File "/home/josay/Geekage/Comics/ComicBookMaker/comic_abstract.py", line 179, in update
    for comic in cls.get_next_comic(comics[-1] if comics else None):
  File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 104, in get_next_comic
    comic = cls.get_comic_info(soup, next_comic)
  File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 528, in get_comic_info
    assert all(i['alt'] == i['title'] == "" for i in imgs)
AssertionError

Problem with retrieval of Octopuns

http://www.octopuns.net/ seems to be dead now :(

Problem with retrieval of Octopuns

http://www.octopuns.net/2014/03/im-sorry-im-still-here-guys.html redirects to a generic Webhosting website leading to the following exception:

DEBUG:root:load_db octopuns start
DEBUG:root:update octopuns last comic is http://www.octopuns.net/2014/03/im-sorry-im-still-here-guys.html
DEBUG:root:get_content (url : http://www.octopuns.net/2014/03/im-sorry-im-still-here-guys.html)
DEBUG:root:urlopen_wrapper (url : http://www.octopuns.net/2014/03/im-sorry-im-still-here-guys.html)
octopuns : nothing new
Traceback (most recent call last):
  File "comicbookmaker.py", line 98, in <module>
    main()
  File "comicbookmaker.py", line 73, in main
    com.update()
  File "/home/josay/Geekage/Comics/ComicBookMaker/comic_abstract.py", line 200, in update
    for comic in cls.get_next_comic(last_comic):
  File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 121, in get_next_comic
    if url else \
  File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 108, in get_next_link
    return cls.get_navi_link(last_soup, True)
  File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 1470, in get_navi_link
    link = last_soup.find('img', src=re.compile('.*/Next.png' if next_ else '.*/Back.png')).parent
AttributeError: 'NoneType' object has no attribute 'parent'

Problem with retrieval of Toon hole

http://toonhole.com/ seems to be unavailable

Problem with retrieval of Happy Monday from Tapastic

https://tapastic.com/series/HappyMondayComics does not exist anymore. I don't know why...

Problem with retrieval of FatAwesome Tumblr

I think it has moved. Maybe to http://fatawesomecomedy.tumblr.com/

Scripts hangs during retrieval of images (from Tumblr)

Problem with retrieval of Boulet Corp

Page http://www.bouletcorp.com/2016/03/07/la-roue-du-karma/ contains an image with no src
<img id="control-img" class="alignnone size-full wp-image-6225" alt="Titre" /> which is not handled properly.

Problem with retrieval of DeadlyPanels

Moar comics

Meta list :
http://www.reddit.com/r/webcomics
http://en.m.wikipedia.org/wiki/List_of_webcomic_awards

List :

http://invisiblebread.com/

Something of That Ilk does not seem to exist anymore

:'(

Problem with retrieval of TubeyToons

Problem with retrieval of LastPlace comics

http://lastplacecomics.com seems to be down: redirection to http://p3nlhclust404.shr.prod.phx3.secureserver.net/SharedContent/redirect_0.html and message "Site suspended".

Problem with retrieval of SMBC

The archive format has changed.

Problem with retrieval of Respawn

respawn : about to update
<a class="navi comic-nav-next navi-next" href="http://respawncomic.com/comic/c0063/" title="Next &gt;">Next &gt;</a>
respawn : nothing new
Traceback (most recent call last):
  File "comicbookmaker.py", line 88, in <module>
    main()
  File "comicbookmaker.py", line 63, in main
    com.update()
  File "/home/josay/Geekage/Comics/ComicBookMaker/comic_abstract.py", line 179, in update
    for comic in cls.get_next_comic(comics[-1] if comics else None):
  File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 104, in get_next_comic
    comic = cls.get_comic_info(soup, next_comic)
  File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 1713, in get_comic_info
    title = soup.find('h2', class_='post-title').string
AttributeError: 'NoneType' object has no attribute 'string'

Problem with retrieval of Chuckle A Duck

Format has changed. To be fixed asap.

Problem with retrieval of Dilbert

Problem with retrieval of Tapastic comics

http://tapastic.com/ seems to be temporarily unavailable.
I'll have to check when it is back if the format hasn't changed.

Use urljoin to remove dirty path handling

As per http://stackoverflow.com/questions/476511/resolving-a-relative-url-path-to-its-absolute-path .

Retrieval of Mercwork triggers TypeError - page format changed

Traceback (most recent call last):
  File "comicbookmaker.py", line 98, in <module>
    main()
  File "comicbookmaker.py", line 73, in main
    com.update()
  File "/home/josay/Geekage/Comics/ComicBookMaker/comic_abstract.py", line 200, in update
    for comic in cls.get_next_comic(last_comic):
  File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 132, in get_next_comic
    comic = cls.get_comic_info(soup, next_comic)
  File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 1023, in get_comic_info
    author = soup.find('meta', attrs={'name': 'shareaholic:article_author_name'})['content']
TypeError: 'NoneType' object is not subscriptable

    {
        "img": [
            "http://infinitemonkeybusiness.net/wp-content/uploads/2015/11/Snake-Charmer.jpg", 
            "http://i2.wp.com/infinitemonkeybusiness.net/wp-content/uploads/2015/11/Snake-Charmer.jpg?resize=750%2C767"
        ], 
        "title": "Snake Charmer - Infinite Monkey Business", 
        "url": "http://infinitemonkeybusiness.net/comic/snake-charmer/", 
        "month": 12, 
        "year": 2015, 
        "local_img": [
            "monkey/Snake-Charmer.jpg", 
            "monkey/Snake-Charmer.jpg?resize=750,767.jpeg"
        ], 
        "comic": "Infinite Monkey Business", 
        "day": 11
    },

You were looking for the comic blog L.I.N.S. but found this empty mess.
The thing is: we changed our name in late 2016 to “War and Peas”.
Please visit our new tumblr: http://warandpeas.tumblr.com/

Problem with retrieval of PHD Comics

New comics do not seem to be updated.

Problem with retrieval of Things in Square

$ python comicbookmaker.py -c squares
squares : about to update
squares : nothing new
Traceback (most recent call last):
  File "comicbookmaker.py", line 88, in <module>
    main()
  File "comicbookmaker.py", line 63, in main
    com.update()
  File "/home/josay/Geekage/Comics/ComicBookMaker/comic_abstract.py", line 179, in update
    for comic in cls.get_next_comic(comics[-1] if comics else None):
  File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 187, in get_next_comic
    for archive_elt in cls.get_archive_elements():
  File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 2072, in get_archive_elements
    return reversed(get_soup_at_url(archive_url).find('tbody').find_all('tr'))
AttributeError: 'NoneType' object has no attribute 'find_all'

I guess format has changed...

$ python comicbookmaker.py -c anythingcomic
anythingcomic : about to update
anythingcomic : nothing new
Traceback (most recent call last):
  File "comicbookmaker.py", line 98, in <module>
    main()
  File "comicbookmaker.py", line 73, in main
    com.update()
  File "/home/josay/Geekage/Comics/ComicBookMaker/comic_abstract.py", line 200, in update
    for comic in cls.get_next_comic(last_comic):
  File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 238, in get_next_comic
    comic = cls.get_comic_info(soup, archive_elt)
  File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 2235, in get_comic_info
    day = string_to_date(td_date.string, '%d %b %Y %I:%M %p')
  File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 4906, in string_to_date
    ret = datetime.datetime.strptime(string, date_format).date()
  File "/usr/lib/python3.4/_strptime.py", line 500, in _strptime_datetime
    tt, fraction = _strptime(data_string, format)
  File "/usr/lib/python3.4/_strptime.py", line 337, in _strptime
    (data_string, format))
ValueError: time data 'December 10th, 2016, 4:00 am' does not match format '%d %b %Y %I:%M %p'

https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior