Code Monkey home page Code Monkey logo

wandering_inn's Introduction

wandering_inn

NOTE: It looks like wanderinginn.com has changed firewall settings to prevent/limit scraping; attempting to scrape an entire volume--much less multiple volumes--will likely result in a ban on your IP.

I've re-checked the site for any prohibition on scraping and don't see any. I'm not sure what the actual threshold to trigger a ban is (best guess: a sustained high rate of access over a short period of time; I was banned with ~100 page accesses in a day, but the last 90 or so of those were in rapid succession via script).

I recommend against using any options other than "--chapter" to select content.

Download and convert The Wandering Inn to epub and mobi (kindle) format

I have no affiliation with and no rights to The Wandering Inn; I'm just a fan who likes to read on my kindle and on my phone even when I don't have internet access.

The created ebook sometimes has some rough patches to it; I'd encourage you to buy the official releases as they happen on Amazon to get a polished copy and support the author. I only created this project so I can catch/keep up with the web publications.

This script relies on ebookmaker and its dependencies, and runs on python3. Converting to mobi format also relies on calibre, and the script is written for bash.

Usage

  1. Clone this repository:
git clone --recurse-submodules https://github.com/Patrick-Hogan/wandering_inn.git
cd wandering_inn
  1. Install requirements

    Recommended: install requirements in a virtual environment.

    # in a virtual environment:
    pip install -r requirements.txt
    
    # Alternatively, use the --user flag to avoid needing sudo/admin:
    pip install --user -r requirements.txt
  2. Run the script:

    Options can be displayed by passing -h or --help:

    ./wanderinginn2epub.py --help

    Generate a single epub for all available public chapters:

    ./wanderinginn2epub.py

    Pretty print the chapters that would be included:

    ./wanderinginn2epub.py --output-print-index

    Generate one epub per volume for volumes 1-7:

    ./wanderinginn2epub.py --volume 1 2 3 4 5 6 7 --output-by-volume

    Generate one epub per chapter for volume 8, stripping color so light fonts are readable on black-and-white screens (e.g., winter sprites' coversations):

    ./wanderinginn2epub.py --volume 8 --output-by-chapter --strip-color

    Generate an epub for the latest published chapter only:

    ./wanderinginn2epub.py --chapter latest --output-by-chapter

Automated Mailing

I have never set up automated mailing from a windows box and have no interest in doing so; configuration on linux varies, but the following setup works on ubuntu 18.10+ for me.

To automatically email to amazon devices, I use msmtp and mutt, configured to use my gmail account to send mail; any backend mail configuration can be used, though, as long as the account is added to the approved addresses in your amazon account. Once your configuration is set up to allow mutt to send mail from the command line, you can specify the list of recipients (e.g., the device addresses from your amazon account) in an environment variable, from the command line or in a recipients.txt (space or newline separated).

wandering_inn's People

Contributors

jjtech0130 avatar lkempf avatar not-na avatar patrick-hogan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

wandering_inn's Issues

Error when getting image

When scraping Volume 7, this error happens regularly - sometimes in early chapters, sometimes in a bit later ones, but it never reaches 100%.

I believe this might be related to the new formats on the original website?

Traceback (most recent call last):
  File "C:\Users\User\wandering_inn\wanderinginn2epub.py", line 382, in <module>
    main()
  File "C:\Users\User\wandering_inn\wanderinginn2epub.py", line 363, in main
    get_book(volume_data,
  File "C:\Users\User\wandering_inn\wanderinginn2epub.py", line 304, in get_book
    ch.save(stream=fh, strip_color=strip_color, image_path=image_path)
  File "C:\Users\User\wandering_inn\wanderinginn2epub.py", line 124, in save
    with open(os.path.join(image_path, img_filename), 'wb') as fo:
OSError: [Errno 22] Invalid argument: './build\\html\\images\\hs_deceiver_final_lowres.jpg?fit=2908%2C2821&ssl=1'

error importing BeautifulSoup from bs4

Hi,
I am new to using python scripts. I followed the installation instructions, but I always get this error when running the script:
from bs4 import BeautifulSoup ModuleNotFoundError: No module named 'bs4'
I tried both using venv and installing deps globally, also tried using codespace, they all yield the same results.
Any help will be appreciated.

Wordpress Rate Limiter

Using this project on a large number of chapters causes the user's IP address to be banned by wanderinginn.com with the following error:


WordPress Backup & Security Plugin - BlogVault Firewall

Blocked because of Malicious Activities

Reference ID: <removed>

Presumably, the script hits some rate limit that trips a ban. Running for a few chapters at a time seems okay for now.

Workarounds:

  • Only use the script to download individual chapters at once
  • (potential) Update the script to limit the rate at which wanderinginn.com is queried (this will result in significantly longer run times, but may prevent the ban). Make heavier and more organized caching of pages to make the delay less of a problem.

Can't figure out how to use repository - Send epub through email?

This is a gross misuse of the issue feature, but I was wondering if you could send an up to date epub to an attached email.

I fiddled with this for an hour, but I have no experience with git or python. I really just want to read the series...

Email is [email protected] if you are willing, if not delete. I'm not sure if you can attach files directly to an issue response, but that would also be great. Thanks for the help.

Heading structure in multivolume epubs

Currently the beginnings of chapters are all level-1 headings, likely because that's what the original pages use. I wonder if there's any way these could be changed to h2 for multivolume epubs, with a single level-1 heading at the beginning of each volume. I'm not sure if there's an easy way to modify this, and to truly get it right you'd need to shift all headings down by one level in case there's any existing subheadings in the chapters. However, it does come with some benefits: Any file conversions into flat formats (HTML, DOCX, etc) will have a proper heading structure, and ebook readers with heading shortcuts will be able to jump by volume instead of just by chapter. I personally use a screen-reader, and when in a webpage or other document with headings, I can press the numbers 1 through 6 to find the next heading at the given level. Thanks for your consideration and for such a great tool.

FEATURE REQUEST: Download by "book"

Volumes are now sub-sectioned into "books", and now that there is an agressive rate limit it would be useful to be able to download just a book.

missing html folder

Running ubuntu 20.04.

It would not work until I created a folder called html in the_wandering_inn folder.

SSL Verification Issue

Running this command:
C:\...\wandering_inn>python wanderinginn2epub.py --volume 9 --output-by-volume

I get the following output:

Traceback (most recent call last):
  File "C:\Users\steph\wandering_inn\wanderinginn2epub.py", line 382, in <module>
    main()
  File "C:\Users\steph\wandering_inn\wanderinginn2epub.py", line 316, in main
    full_index = get_index()
  File "C:\Users\steph\wandering_inn\wanderinginn2epub.py", line 216, in get_index
    page = urlopen(toc_url)
  File "C:\Users\steph\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 214, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Users\steph\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 517, in open
    response = self._open(req, data)
  File "C:\Users\steph\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 534, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "C:\Users\steph\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 494, in _call_chain
    result = func(*args)
  File "C:\Users\steph\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 1385, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "C:\Users\steph\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 1345, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1122)>Traceback (most recent call last):
  File "C:\Users\steph\wandering_inn\wanderinginn2epub.py", line 382, in <module>
    main()
  File "C:\Users\steph\wandering_inn\wanderinginn2epub.py", line 316, in main
    full_index = get_index()
  File "C:\Users\steph\wandering_inn\wanderinginn2epub.py", line 216, in get_index
    page = urlopen(toc_url)
  File "C:\Users\steph\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 214, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Users\steph\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 517, in open
    response = self._open(req, data)
  File "C:\Users\steph\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 534, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "C:\Users\steph\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 494, in _call_chain
    result = func(*args)
  File "C:\Users\steph\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 1385, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "C:\Users\steph\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 1345, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1122)>

I'm pretty new to using github and have been using ChatGPT to get me started so I could definitely (most likely) be missing something obvious.

Unable to get volume from text

$ ./wanderinginn2epub.py --output-print-index
Unable to get volume from text: <p class="announcement-lowkey-item-category responsive-small"></p>

Perhaps the table-of-contents/ page was changed?

TOC is now using relative URLs

The new TOC is now using relative URLs, which leads to errors like this:

  File "/Users/jjtech/wandering_inn/./wanderinginn2epub.py", line 79, in get_page
    return open(self.url,'r')
           ^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/2023/04/30/9-41-pt-3/'

This can be fixed by replacing that line with something like this:

            return urlopen('https://wanderinginn.com/' + self.url, timeout=5)

2 possible problems

Hi there,
I never used github before so I dont know if I am doing any of this correctly...
I found this project a couple of days ago.
I am a beginner programmer and I tried creating something similar to this with selenium before.
I found one Problem with your implementation. In some chapters "Author Notes" where not declared as such, thats why sometimes your implementation would include the pictures and non chapter information postet by pirateaba. I think I found a fix for that. Please note I am a beginner so function can propably written ALOT better. But it was the best I could do. Currently the code would start at line 71 of your code.

        author_note = [p for p in contents.find_all('p') if p.getText()[0:6] == 'Author' and 'Note' in p.getText()[0:13]]
        if author_note:
            end_content = author_note[0]
            [s.decompose() for s in end_content.find_next_siblings()]
            end_content.decompose()
        else:
            # added a way to find and handle the rare cases, where Auther Note: was not used in the end of a chapter to designate the community additions
            old_index = False
            count = 0
            for index, a in enumerate(reversed(list(contents.find_all('p')))):
                if index >= 60: #is just a guess that seemed to work, anything higher then 60 lines is VERY unlikely to still be a Authors Note
                    break
                if a.contents == ['\xa0']:
                    if not old_index:
                        old_index = index
                    if (index - old_index) == 1:
                        count += 1
                    else:
                        count = 0
                    old_index = index
                    if count == 3:
                        [s.decompose() for s in a.find_next_siblings()]
                        break

I also wrote a second function to work with the colorcodes that you find in your strip_color function. This adds the colorname as Stringtext in the beginning and end of the found tag containing the color. That way you would still know if a text was colorcoded in a ebook reader. it would need the webcolors library from pypi, so I dont know if this is anything too interesting for you...
It would in the end look like this: |GOLD| Nevermore |GOLD|
The code I wrote is:

import webcolors
import re

def closest_colour(requested_colour):
    """finds the closest color that returns a name from Webcolors"""
    min_colours = {}
    for key, name in webcolors.CSS2_HEX_TO_NAMES.items():
        r_c, g_c, b_c = webcolors.hex_to_rgb(key)
        rd = (r_c - requested_colour[0]) ** 2
        gd = (g_c - requested_colour[1]) ** 2
        bd = (b_c - requested_colour[2]) ** 2
        min_colours[(rd + gd + bd)] = name
    return min_colours[min(min_colours.keys())]

def get_colour_name(requested_colour):
    """Checks if the name of a color can be found by the RGB color code.
    If it can't be found it uses the closest_colour method to return the most likely candidate for a colorname."""
    try:
        closest_name = webcolors.rgb_to_name(requested_colour)
    except ValueError:
        closest_name = closest_colour(requested_colour)
    return " |" + closest_name.upper()+ "| "

and then at the strip_color check:

        if strip_color:
            # Strip color from text that can make it hard to read on a paperwhite:
            for span in contents.find_all('span'):
                if 'color' in str(span):
                    # added a way to find the name of the found color and add them to the tag, it can then be used in black and white kindle devices to see what text was originaly color coded
                    color = re.findall(r'(?<=color\:\#).{6}', str(span))[0]
                    color = (int(color[0:2], 16), int(color[2:4], 16), int(color[4:6], 16))
                    color = get_colour_name(color)
                    span.insert(0, color)
                    span.append(color)
                    span.unwrap() # bs4 method for replaceWithChildren

I know that non of this is an issue but I didn't really understand how any of this push and pull request stuff works or if there is a better way to send someone a messege on here. I hope any of this is usefull in some way.

Issue on OSX (Python 3.10)

Issue All options generate empty epub books. I'm running OSX Ventura 13.2.1.

Might be multiple causes, starting with dependencies. I'm running Python3.10 as I only saw Python3 as a requirement, if you specifically suggest another version please let me know. When running pip install -r requirements.txt in the cloned directory, the following error is given:

ERROR: Could not install packages due to an OSError: Cannot move the non-empty directory '/opt/homebrew/lib/python3.10/site-packages/tqdm-4.64.1-py3.10.egg': Lacking write permission to '/opt/homebrew/lib/python3.10/site-packages/tqdm-4.64.1-py3.10.egg'.

Seeing as this is a permissions related error, I resolved it by running the command as sudo. It may be worth nothing that for OSX this is a necessary step. The problem still persists after this, though, and the same output is generated. Here is a sample command and the output:

Test Case #1
Input: python3.10 ./wanderinginn2epub.py
Output: Chapter: 0it [00:00, ?it/s] Generating ePub file for eBook "The Wandering Inn". ePub file "./build/The Wandering Inn.epub" successfully generated.

Test Case #2
Input: python3.10 ./wanderinginn2epub.py --volume 1 --output-by-chapter
Output: truncated all of the html output.. <a href="https://wanderinginn.com/2022/04/26/8-82-pt-3/">8.82 (Pt. 3)</a><br/> <a href="https://wanderinginn.com/2022/05/03/8-83/">8.83</a><br/> <a href="https://wanderinginn.com/2022/05/03/8-84/">8.84</a><br/> <a href="https://wanderinginn.com/2022/05/03/8-85/">8.85</a><br/> <a href="https://wanderinginn.com/2022/05/03/epilogue/">Epilogue</a></p> Chapter: 0it [00:00, ?it/s] Generating ePub file for eBook "The Wandering Inn".

Test case 1 generates an empty epub, while test case 2 does nothing but exits the program. I apologize if this an error on my end or during installation, as Python is not something I am very familiar with.

can't get it to work

installed Python 3.7
created a virtual environment using conda
pip install -r requirements.txt

When I try to run ./wanderinginn2mobi.sh --help in bash I get:
sed: can't read recipients.txt: No such file or directory ./wanderinginn2mobi.sh: line 19: ebook-convert: command not found
I created an empty recipients.txt (whatever it is for) so now it only say
./wanderinginn2mobi.sh: line 19: ebook-convert: command not found

The submodule installation seems to have gone fine (the folder 'ebookmaker' is not empty and I can import ebookmaker in ipython

I'd love to get a hint on how I could get this to work. I'm supporting the author on Patreon but I reeeealy don't want to read in a Browser
cheers

Tool no longer works with the redesign

When trying after the redesign, on a new IP that is not banned, I get the following result when running the script:

┌─[switchblade@Switchblade] - [~/wandering_inn] - [2188]
└─[$] python3 wanderinginn2epub.py                                                                    [8:51:27]
Chapter: 0it [00:00, ?it/s]
Generating ePub file for eBook "The Wandering Inn".
ePub file "./build/The Wandering Inn.epub" successfully generated.
┌─[switchblade@Switchblade] - [~/wandering_inn] - [2189]
└─[$] cat cmd                                                                                         [8:51:38]
python3 wanderinginn2epub.py --volume 1 2 3 4 5 6 7 8 9 --output-by-volume
┌─[switchblade@Switchblade] - [~/wandering_inn] - [2190]
└─[$] $(cat cmd)                                                                                      [8:53:22]
Chapter: 0it [00:00, ?it/s]                                                               | 0/9 [00:00<?, ?it/s]
Generating ePub file for eBook "The Wandering Inn - Volume 1".
ePub file "./build/The Wandering Inn - Volume 1.epub" successfully generated.
Chapter: 0it [00:00, ?it/s]
Generating ePub file for eBook "The Wandering Inn - Volume 2".
ePub file "./build/The Wandering Inn - Volume 2.epub" successfully generated.
Chapter: 0it [00:00, ?it/s]█▎                                                     | 2/9 [00:00<00:00, 10.64it/s]
Generating ePub file for eBook "The Wandering Inn - Volume 3".
ePub file "./build/The Wandering Inn - Volume 3.epub" successfully generated.
Chapter: 0it [00:00, ?it/s]
Generating ePub file for eBook "The Wandering Inn - Volume 4".
ePub file "./build/The Wandering Inn - Volume 4.epub" successfully generated.
Chapter: 0it [00:00, ?it/s]████████████████▋                                      | 4/9 [00:00<00:00, 10.73it/s]
Generating ePub file for eBook "The Wandering Inn - Volume 5".
ePub file "./build/The Wandering Inn - Volume 5.epub" successfully generated.
Chapter: 0it [00:00, ?it/s]
Generating ePub file for eBook "The Wandering Inn - Volume 6".
ePub file "./build/The Wandering Inn - Volume 6.epub" successfully generated.
Chapter: 0it [00:00, ?it/s]████████████████████████████████                       | 6/9 [00:00<00:00, 10.67it/s]
Generating ePub file for eBook "The Wandering Inn - Volume 7".
ePub file "./build/The Wandering Inn - Volume 7.epub" successfully generated.
Chapter: 0it [00:00, ?it/s]
Generating ePub file for eBook "The Wandering Inn - Volume 8".
ePub file "./build/The Wandering Inn - Volume 8.epub" successfully generated.
Chapter: 0it [00:00, ?it/s]███████████████████████████████████████████████▎       | 8/9 [00:00<00:00, 10.74it/s]
Generating ePub file for eBook "The Wandering Inn - Volume 9".
ePub file "./build/The Wandering Inn - Volume 9.epub" successfully generated.
Volume: 100%|█████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 10.74it/s]


The ebooks are empty and do not contain any info, even when doing it by volume. I assume there is some issue finding the content or URLs. I have confirmed that I can hit the site, and other scraper can see content.

What logs can I give you to help you fix?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.