patrick-hogan / wandering_inn Goto Github PK
View Code? Open in Web Editor NEWDownload and convert The Wandering Inn to epub and mobi (kindle) format
Download and convert The Wandering Inn to epub and mobi (kindle) format
When trying after the redesign, on a new IP that is not banned, I get the following result when running the script:
┌─[switchblade@Switchblade] - [~/wandering_inn] - [2188]
└─[$] python3 wanderinginn2epub.py [8:51:27]
Chapter: 0it [00:00, ?it/s]
Generating ePub file for eBook "The Wandering Inn".
ePub file "./build/The Wandering Inn.epub" successfully generated.
┌─[switchblade@Switchblade] - [~/wandering_inn] - [2189]
└─[$] cat cmd [8:51:38]
python3 wanderinginn2epub.py --volume 1 2 3 4 5 6 7 8 9 --output-by-volume
┌─[switchblade@Switchblade] - [~/wandering_inn] - [2190]
└─[$] $(cat cmd) [8:53:22]
Chapter: 0it [00:00, ?it/s] | 0/9 [00:00<?, ?it/s]
Generating ePub file for eBook "The Wandering Inn - Volume 1".
ePub file "./build/The Wandering Inn - Volume 1.epub" successfully generated.
Chapter: 0it [00:00, ?it/s]
Generating ePub file for eBook "The Wandering Inn - Volume 2".
ePub file "./build/The Wandering Inn - Volume 2.epub" successfully generated.
Chapter: 0it [00:00, ?it/s]█▎ | 2/9 [00:00<00:00, 10.64it/s]
Generating ePub file for eBook "The Wandering Inn - Volume 3".
ePub file "./build/The Wandering Inn - Volume 3.epub" successfully generated.
Chapter: 0it [00:00, ?it/s]
Generating ePub file for eBook "The Wandering Inn - Volume 4".
ePub file "./build/The Wandering Inn - Volume 4.epub" successfully generated.
Chapter: 0it [00:00, ?it/s]████████████████▋ | 4/9 [00:00<00:00, 10.73it/s]
Generating ePub file for eBook "The Wandering Inn - Volume 5".
ePub file "./build/The Wandering Inn - Volume 5.epub" successfully generated.
Chapter: 0it [00:00, ?it/s]
Generating ePub file for eBook "The Wandering Inn - Volume 6".
ePub file "./build/The Wandering Inn - Volume 6.epub" successfully generated.
Chapter: 0it [00:00, ?it/s]████████████████████████████████ | 6/9 [00:00<00:00, 10.67it/s]
Generating ePub file for eBook "The Wandering Inn - Volume 7".
ePub file "./build/The Wandering Inn - Volume 7.epub" successfully generated.
Chapter: 0it [00:00, ?it/s]
Generating ePub file for eBook "The Wandering Inn - Volume 8".
ePub file "./build/The Wandering Inn - Volume 8.epub" successfully generated.
Chapter: 0it [00:00, ?it/s]███████████████████████████████████████████████▎ | 8/9 [00:00<00:00, 10.74it/s]
Generating ePub file for eBook "The Wandering Inn - Volume 9".
ePub file "./build/The Wandering Inn - Volume 9.epub" successfully generated.
Volume: 100%|█████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 10.74it/s]
The ebooks are empty and do not contain any info, even when doing it by volume. I assume there is some issue finding the content or URLs. I have confirmed that I can hit the site, and other scraper can see content.
What logs can I give you to help you fix?
Running ubuntu 20.04.
It would not work until I created a folder called html in the_wandering_inn folder.
Volumes are now sub-sectioned into "books", and now that there is an agressive rate limit it would be useful to be able to download just a book.
When scraping Volume 7, this error happens regularly - sometimes in early chapters, sometimes in a bit later ones, but it never reaches 100%.
I believe this might be related to the new formats on the original website?
Traceback (most recent call last):
File "C:\Users\User\wandering_inn\wanderinginn2epub.py", line 382, in <module>
main()
File "C:\Users\User\wandering_inn\wanderinginn2epub.py", line 363, in main
get_book(volume_data,
File "C:\Users\User\wandering_inn\wanderinginn2epub.py", line 304, in get_book
ch.save(stream=fh, strip_color=strip_color, image_path=image_path)
File "C:\Users\User\wandering_inn\wanderinginn2epub.py", line 124, in save
with open(os.path.join(image_path, img_filename), 'wb') as fo:
OSError: [Errno 22] Invalid argument: './build\\html\\images\\hs_deceiver_final_lowres.jpg?fit=2908%2C2821&ssl=1'
The new TOC is now using relative URLs, which leads to errors like this:
File "/Users/jjtech/wandering_inn/./wanderinginn2epub.py", line 79, in get_page
return open(self.url,'r')
^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/2023/04/30/9-41-pt-3/'
This can be fixed by replacing that line with something like this:
return urlopen('https://wanderinginn.com/' + self.url, timeout=5)
Using this project on a large number of chapters causes the user's IP address to be banned by wanderinginn.com with the following error:
WordPress Backup & Security Plugin - BlogVault Firewall
Blocked because of Malicious Activities
Reference ID: <removed>
Presumably, the script hits some rate limit that trips a ban. Running for a few chapters at a time seems okay for now.
Workarounds:
Hi there,
I never used github before so I dont know if I am doing any of this correctly...
I found this project a couple of days ago.
I am a beginner programmer and I tried creating something similar to this with selenium before.
I found one Problem with your implementation. In some chapters "Author Notes" where not declared as such, thats why sometimes your implementation would include the pictures and non chapter information postet by pirateaba. I think I found a fix for that. Please note I am a beginner so function can propably written ALOT better. But it was the best I could do. Currently the code would start at line 71 of your code.
author_note = [p for p in contents.find_all('p') if p.getText()[0:6] == 'Author' and 'Note' in p.getText()[0:13]]
if author_note:
end_content = author_note[0]
[s.decompose() for s in end_content.find_next_siblings()]
end_content.decompose()
else:
# added a way to find and handle the rare cases, where Auther Note: was not used in the end of a chapter to designate the community additions
old_index = False
count = 0
for index, a in enumerate(reversed(list(contents.find_all('p')))):
if index >= 60: #is just a guess that seemed to work, anything higher then 60 lines is VERY unlikely to still be a Authors Note
break
if a.contents == ['\xa0']:
if not old_index:
old_index = index
if (index - old_index) == 1:
count += 1
else:
count = 0
old_index = index
if count == 3:
[s.decompose() for s in a.find_next_siblings()]
break
I also wrote a second function to work with the colorcodes that you find in your strip_color function. This adds the colorname as Stringtext in the beginning and end of the found tag containing the color. That way you would still know if a text was colorcoded in a ebook reader. it would need the webcolors library from pypi, so I dont know if this is anything too interesting for you...
It would in the end look like this: |GOLD| Nevermore |GOLD|
The code I wrote is:
import webcolors
import re
def closest_colour(requested_colour):
"""finds the closest color that returns a name from Webcolors"""
min_colours = {}
for key, name in webcolors.CSS2_HEX_TO_NAMES.items():
r_c, g_c, b_c = webcolors.hex_to_rgb(key)
rd = (r_c - requested_colour[0]) ** 2
gd = (g_c - requested_colour[1]) ** 2
bd = (b_c - requested_colour[2]) ** 2
min_colours[(rd + gd + bd)] = name
return min_colours[min(min_colours.keys())]
def get_colour_name(requested_colour):
"""Checks if the name of a color can be found by the RGB color code.
If it can't be found it uses the closest_colour method to return the most likely candidate for a colorname."""
try:
closest_name = webcolors.rgb_to_name(requested_colour)
except ValueError:
closest_name = closest_colour(requested_colour)
return " |" + closest_name.upper()+ "| "
and then at the strip_color check:
if strip_color:
# Strip color from text that can make it hard to read on a paperwhite:
for span in contents.find_all('span'):
if 'color' in str(span):
# added a way to find the name of the found color and add them to the tag, it can then be used in black and white kindle devices to see what text was originaly color coded
color = re.findall(r'(?<=color\:\#).{6}', str(span))[0]
color = (int(color[0:2], 16), int(color[2:4], 16), int(color[4:6], 16))
color = get_colour_name(color)
span.insert(0, color)
span.append(color)
span.unwrap() # bs4 method for replaceWithChildren
I know that non of this is an issue but I didn't really understand how any of this push and pull request stuff works or if there is a better way to send someone a messege on here. I hope any of this is usefull in some way.
Hi,
I am new to using python scripts. I followed the installation instructions, but I always get this error when running the script:
from bs4 import BeautifulSoup ModuleNotFoundError: No module named 'bs4'
I tried both using venv and installing deps globally, also tried using codespace, they all yield the same results.
Any help will be appreciated.
installed Python 3.7
created a virtual environment using conda
pip install -r requirements.txt
When I try to run ./wanderinginn2mobi.sh --help
in bash I get:
sed: can't read recipients.txt: No such file or directory ./wanderinginn2mobi.sh: line 19: ebook-convert: command not found
I created an empty recipients.txt
(whatever it is for) so now it only say
./wanderinginn2mobi.sh: line 19: ebook-convert: command not found
The submodule installation seems to have gone fine (the folder 'ebookmaker' is not empty and I can import ebookmaker in ipython
I'd love to get a hint on how I could get this to work. I'm supporting the author on Patreon but I reeeealy don't want to read in a Browser
cheers
Currently the beginnings of chapters are all level-1 headings, likely because that's what the original pages use. I wonder if there's any way these could be changed to h2 for multivolume epubs, with a single level-1 heading at the beginning of each volume. I'm not sure if there's an easy way to modify this, and to truly get it right you'd need to shift all headings down by one level in case there's any existing subheadings in the chapters. However, it does come with some benefits: Any file conversions into flat formats (HTML, DOCX, etc) will have a proper heading structure, and ebook readers with heading shortcuts will be able to jump by volume instead of just by chapter. I personally use a screen-reader, and when in a webpage or other document with headings, I can press the numbers 1 through 6 to find the next heading at the given level. Thanks for your consideration and for such a great tool.
I could not simply python -m pip install ebookmaker
. I had to add the linked ebookmaker in the readme as a submodule relative to the root of this repo.
Running this command:
C:\...\wandering_inn>python wanderinginn2epub.py --volume 9 --output-by-volume
I get the following output:
Traceback (most recent call last):
File "C:\Users\steph\wandering_inn\wanderinginn2epub.py", line 382, in <module>
main()
File "C:\Users\steph\wandering_inn\wanderinginn2epub.py", line 316, in main
full_index = get_index()
File "C:\Users\steph\wandering_inn\wanderinginn2epub.py", line 216, in get_index
page = urlopen(toc_url)
File "C:\Users\steph\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 214, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\steph\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 517, in open
response = self._open(req, data)
File "C:\Users\steph\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 534, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "C:\Users\steph\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 494, in _call_chain
result = func(*args)
File "C:\Users\steph\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 1385, in https_open
return self.do_open(http.client.HTTPSConnection, req,
File "C:\Users\steph\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 1345, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1122)>Traceback (most recent call last):
File "C:\Users\steph\wandering_inn\wanderinginn2epub.py", line 382, in <module>
main()
File "C:\Users\steph\wandering_inn\wanderinginn2epub.py", line 316, in main
full_index = get_index()
File "C:\Users\steph\wandering_inn\wanderinginn2epub.py", line 216, in get_index
page = urlopen(toc_url)
File "C:\Users\steph\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 214, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\steph\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 517, in open
response = self._open(req, data)
File "C:\Users\steph\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 534, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "C:\Users\steph\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 494, in _call_chain
result = func(*args)
File "C:\Users\steph\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 1385, in https_open
return self.do_open(http.client.HTTPSConnection, req,
File "C:\Users\steph\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 1345, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1122)>
I'm pretty new to using github and have been using ChatGPT to get me started so I could definitely (most likely) be missing something obvious.
Issue All options generate empty epub books. I'm running OSX Ventura 13.2.1.
Might be multiple causes, starting with dependencies. I'm running Python3.10 as I only saw Python3 as a requirement, if you specifically suggest another version please let me know. When running pip install -r requirements.txt
in the cloned directory, the following error is given:
ERROR: Could not install packages due to an OSError: Cannot move the non-empty directory '/opt/homebrew/lib/python3.10/site-packages/tqdm-4.64.1-py3.10.egg': Lacking write permission to '/opt/homebrew/lib/python3.10/site-packages/tqdm-4.64.1-py3.10.egg'.
Seeing as this is a permissions related error, I resolved it by running the command as sudo. It may be worth nothing that for OSX this is a necessary step. The problem still persists after this, though, and the same output is generated. Here is a sample command and the output:
Test Case #1
Input: python3.10 ./wanderinginn2epub.py
Output: Chapter: 0it [00:00, ?it/s] Generating ePub file for eBook "The Wandering Inn". ePub file "./build/The Wandering Inn.epub" successfully generated.
Test Case #2
Input: python3.10 ./wanderinginn2epub.py --volume 1 --output-by-chapter
Output: truncated all of the html output.. <a href="https://wanderinginn.com/2022/04/26/8-82-pt-3/">8.82 (Pt. 3)</a><br/> <a href="https://wanderinginn.com/2022/05/03/8-83/">8.83</a><br/> <a href="https://wanderinginn.com/2022/05/03/8-84/">8.84</a><br/> <a href="https://wanderinginn.com/2022/05/03/8-85/">8.85</a><br/> <a href="https://wanderinginn.com/2022/05/03/epilogue/">Epilogue</a></p> Chapter: 0it [00:00, ?it/s] Generating ePub file for eBook "The Wandering Inn".
Test case 1 generates an empty epub, while test case 2 does nothing but exits the program. I apologize if this an error on my end or during installation, as Python is not something I am very familiar with.
Hi, I am incredibly new to any of this and I was met with these two lines. is there a fix or quick solution? thanks
This is a gross misuse of the issue feature, but I was wondering if you could send an up to date epub to an attached email.
I fiddled with this for an hour, but I have no experience with git or python. I really just want to read the series...
Email is [email protected] if you are willing, if not delete. I'm not sure if you can attach files directly to an issue response, but that would also be great. Thanks for the help.
$ ./wanderinginn2epub.py --output-print-index
Unable to get volume from text: <p class="announcement-lowkey-item-category responsive-small"></p>
Perhaps the table-of-contents/ page was changed?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.