Code Monkey home page Code Monkey logo

comicbookmaker's People

Contributors

sylvainde avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

comicbookmaker's Issues

Problem with retrieval of Garfield images

Since the 12th April 2016, the naive strategy used to retrieve Garfield comics does not work anymore.

The comic URL seems just right but the image URL is not. For the record, the last working case and the first failing case:

{
    "comic": "Garfield",
    "day": 11,
    "img": [
        "http://garfield.com/uploads/strips/2016-04-11.jpg"
    ],
    "local_img": [
        "garfield/2016-04-11.jpg"
    ],
    "month": 4,
    "url": "http://garfield.com/comic/2016-04-11",
    "year": 2016
},
{
    "comic": "Garfield",
    "day": 12,
    "img": [
        "http://garfield.com/uploads/strips/2016-04-12.jpg"
    ],
    "local_img": [
        null
    ],
    "month": 4,
    "url": "http://garfield.com/comic/2016-04-12",
    "year": 2016
},

Refactoring - detect comics using the same CMS and reuse code

As I was about to add a comic, I noticed that I was about to write code I had already written in the past. Indeed, retrieving comics from website using Wordpress is pretty much always done the same way.

Detecting the meta generator gives:

asp : about to update
< meta content="WordPress 3.5.2" name="generator"/>
asp : nothing new

berkeley : about to update
< meta content="WordPress 2.9" name="generator"/>
berkeley : nothing new

blues : about to update
< meta content="WordPress 3.5.1" name="generator"/>
blues : nothing new

boulet : about to update
< meta content="WordPress 3.5" name="generator"/>
boulet : nothing new

boulet_en : about to update
< meta content="WordPress 3.5" name="generator"/>
boulet_en : nothing new

butter : about to update
< meta content="WordPress 4.3.1" name="generator"/>
butter : nothing new

chuckleaduck : about to update
< meta content="WordPress 4.3.1" name="generator"/>
chuckleaduck : nothing new

completelyserious : about to update
< meta content="WordPress 3.4.1" name="generator"/>
completelyserious : nothing new

efc : about to update
< meta content="WordPress 4.3.1" name="generator"/>
efc : nothing new

fatawesome : about to update
< meta content="WordPress 4.2.5" name="generator"/>
fatawesome : nothing new

gentlemanarmchair : about to update
< meta content="WordPress 4.3.1" name="generator"/>
gentlemanarmchair : nothing new

happletea : about to update
< meta content="WordPress 4.0" name="generator"/>
happletea : nothing new

penmen : about to update
< meta content="WordPress 4.3.1" name="generator"/>
penmen : nothing new

picturesinboxes : about to update
< meta content="WordPress 4.3.1" name="generator"/>
picturesinboxes : nothing new

poorlydrawn : about to update
< meta content="WordPress 4.3.1" name="generator"/>
poorlydrawn : nothing new

rapture : about to update
< meta content="WordPress 4.3.1" name="generator"/>
rapture : nothing new

thor : about to update
< meta content="The Grawlix CMS — the CMS for comics" name="generator"/>
thor : nothing new

toonhole : about to update
< meta content="WordPress 4.3.1" name="generator"/>
toonhole : nothing new

warehouse : about to update
< meta content="WordPress 4.3.1" name="generator"/>
warehouse : nothing new

BeautifulSoup warning

I don't know what changed in my config but I now get:

my_env/lib/python3.4/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

This seem to correspond to https://www.crummy.com/software/BeautifulSoup/bs4/doc/#you-need-a-parser .

Save "new" comics

There should be a simple way to do something like:

  • update comics
  • update comics
  • update comics
  • create book with "new" comics
  • reset "new" comics

It should be simple to save information about the comics we retrieve (in a json file for instance), to re-use this information on demand and to clean it. I am not quite sure what is the most beautiful and more intuite way to make this work.

Problem with retrieval of ZenPencils

$ python comicbookmaker.py -c zenpencils
zenpencils : about to update
[<img alt="196. The Can-do girl" src="http://cdn.zenpencils.com/wp-content/uploads/can_do_girl2.jpg" title="196. The Can-do girl"/>]
zenpencils : nothing new
Traceback (most recent call last):
  File "comicbookmaker.py", line 88, in <module>
    main()
  File "comicbookmaker.py", line 63, in main
    com.update()
  File "/home/josay/Geekage/Comics/ComicBookMaker/comic_abstract.py", line 179, in update
    for comic in cls.get_next_comic(comics[-1] if comics else None):
  File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 104, in get_next_comic
    comic = cls.get_comic_info(soup, next_comic)
  File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 528, in get_comic_info
    assert all(i['alt'] == i['title'] == "" for i in imgs)
AssertionError

Problem with retrieval of Octopuns

http://www.octopuns.net/2014/03/im-sorry-im-still-here-guys.html redirects to a generic Webhosting website leading to the following exception:

DEBUG:root:load_db octopuns start
DEBUG:root:update octopuns last comic is http://www.octopuns.net/2014/03/im-sorry-im-still-here-guys.html
DEBUG:root:get_content (url : http://www.octopuns.net/2014/03/im-sorry-im-still-here-guys.html)
DEBUG:root:urlopen_wrapper (url : http://www.octopuns.net/2014/03/im-sorry-im-still-here-guys.html)
octopuns : nothing new
Traceback (most recent call last):
  File "comicbookmaker.py", line 98, in <module>
    main()
  File "comicbookmaker.py", line 73, in main
    com.update()
  File "/home/josay/Geekage/Comics/ComicBookMaker/comic_abstract.py", line 200, in update
    for comic in cls.get_next_comic(last_comic):
  File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 121, in get_next_comic
    if url else \
  File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 108, in get_next_link
    return cls.get_navi_link(last_soup, True)
  File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 1470, in get_navi_link
    link = last_soup.find('img', src=re.compile('.*/Next.png' if next_ else '.*/Back.png')).parent
AttributeError: 'NoneType' object has no attribute 'parent'

Problem with retrieval of Respawn

respawn : about to update
<a class="navi comic-nav-next navi-next" href="http://respawncomic.com/comic/c0063/" title="Next &gt;">Next &gt;</a>
respawn : nothing new
Traceback (most recent call last):
  File "comicbookmaker.py", line 88, in <module>
    main()
  File "comicbookmaker.py", line 63, in main
    com.update()
  File "/home/josay/Geekage/Comics/ComicBookMaker/comic_abstract.py", line 179, in update
    for comic in cls.get_next_comic(comics[-1] if comics else None):
  File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 104, in get_next_comic
    comic = cls.get_comic_info(soup, next_comic)
  File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 1713, in get_comic_info
    title = soup.find('h2', class_='post-title').string
AttributeError: 'NoneType' object has no attribute 'string'

Retrieval of Mercwork triggers TypeError - page format changed

Traceback (most recent call last):
  File "comicbookmaker.py", line 98, in <module>
    main()
  File "comicbookmaker.py", line 73, in main
    com.update()
  File "/home/josay/Geekage/Comics/ComicBookMaker/comic_abstract.py", line 200, in update
    for comic in cls.get_next_comic(last_comic):
  File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 132, in get_next_comic
    comic = cls.get_comic_info(soup, next_comic)
  File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 1023, in get_comic_info
    author = soup.find('meta', attrs={'name': 'shareaholic:article_author_name'})['content']
TypeError: 'NoneType' object is not subscriptable

Images are retrieved twice for Infinite Monkey Business

For instance:

    {
        "img": [
            "http://infinitemonkeybusiness.net/wp-content/uploads/2015/11/Snake-Charmer.jpg", 
            "http://i2.wp.com/infinitemonkeybusiness.net/wp-content/uploads/2015/11/Snake-Charmer.jpg?resize=750%2C767"
        ], 
        "title": "Snake Charmer - Infinite Monkey Business", 
        "url": "http://infinitemonkeybusiness.net/comic/snake-charmer/", 
        "month": 12, 
        "year": 2015, 
        "local_img": [
            "monkey/Snake-Charmer.jpg", 
            "monkey/Snake-Charmer.jpg?resize=750,767.jpeg"
        ], 
        "comic": "Infinite Monkey Business", 
        "day": 11
    }, 

Problem with retrieval of Things in Square

$ python comicbookmaker.py -c squares
squares : about to update
squares : nothing new
Traceback (most recent call last):
  File "comicbookmaker.py", line 88, in <module>
    main()
  File "comicbookmaker.py", line 63, in main
    com.update()
  File "/home/josay/Geekage/Comics/ComicBookMaker/comic_abstract.py", line 179, in update
    for comic in cls.get_next_comic(comics[-1] if comics else None):
  File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 187, in get_next_comic
    for archive_elt in cls.get_archive_elements():
  File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 2072, in get_archive_elements
    return reversed(get_soup_at_url(archive_url).find('tbody').find_all('tr'))
AttributeError: 'NoneType' object has no attribute 'find_all'

I guess format has changed...

New date formation for Anything Comic

On http://www.anythingcomic.com/comics/2365881/hoppy-christmas/ :

$ python comicbookmaker.py -c anythingcomic
anythingcomic : about to update
anythingcomic : nothing new
Traceback (most recent call last):
  File "comicbookmaker.py", line 98, in <module>
    main()
  File "comicbookmaker.py", line 73, in main
    com.update()
  File "/home/josay/Geekage/Comics/ComicBookMaker/comic_abstract.py", line 200, in update
    for comic in cls.get_next_comic(last_comic):
  File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 238, in get_next_comic
    comic = cls.get_comic_info(soup, archive_elt)
  File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 2235, in get_comic_info
    day = string_to_date(td_date.string, '%d %b %Y %I:%M %p')
  File "/home/josay/Geekage/Comics/ComicBookMaker/comics.py", line 4906, in string_to_date
    ret = datetime.datetime.strptime(string, date_format).date()
  File "/usr/lib/python3.4/_strptime.py", line 500, in _strptime_datetime
    tt, fraction = _strptime(data_string, format)
  File "/usr/lib/python3.4/_strptime.py", line 337, in _strptime
    (data_string, format))
ValueError: time data 'December 10th, 2016, 4:00 am' does not match format '%d %b %Y %I:%M %p'

https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.