ricsinaruto / gutenberg-dialog Goto Github PK
View Code? Open in Web Editor NEWBuild a dialog dataset from online books in many languages
Home Page: https://arxiv.org/abs/2004.12752
License: MIT License
Build a dialog dataset from online books in many languages
Home Page: https://arxiv.org/abs/2004.12752
License: MIT License
Hi @ricsinaruto! I absolutely love this project and the idea of extracting dialog datasets from books.
I wanted to share a small issue I ran into when running the pipeline. By default, the gutenberg
package attempts to use the mirror http://aleph.gutenberg.org for downloading the books, when gutenberg.acquire.load_etext(...)
is called from download.py
. For some reason this mirror was down, and I got the message "Could not download book" for every book in the list.
A little digging revealed that the gutenberg
library allows an alternate mirror to be specified as an environment variable - you can specify any mirror URL from https://www.gutenberg.org/MIRRORS.ALL and set it to the env. variable GUTENBERG_MIRROR
before running the code. For example,
export GUTENBERG_MIRROR="https://gutenberg.pglaf.org"
python code/main.py ...
I thought it might be helpful to mention this somewhere in the readme to save others from scratching their head on this! What do you think? I can submit a PR for it if that is ok.
Best regards and thanks again for this amazing work!
I'm just dumping this here so I don't forget really - not pushing for a fix. Im totally grateful for your work :)
...
Filtering old books based on vocab for nl language.
Filtered 0 books.
Filtering old books based on vocab for es language.
Filtered 0 books.
Filtering old books based on vocab for pt language.
Filtered 0 books.
Filtering old books based on vocab for it language.
Filtered 0 books.
Filtering old books based on vocab for hu language.
Filtered 0 books.
Extracting dialogs for en language.
Traceback (most recent call last):
File "C:\Training\training\training\gutenberg-dialog\code\main.py", line 77, in <module>
main()
File "C:\Training\training\training\gutenberg-dialog\code\main.py", line 73, in main
p.run()
File "C:\Training\training\training\gutenberg-dialog\code\pipeline\pipeline.py", line 30, in run
extract(self.config)
File "C:\Training\training\training\gutenberg-dialog\code\pipeline\dialog_extractor.py", line 79, in extract
dialogs, file_stats = extract_(cfg, directory, lang)
File "C:\Training\training\training\gutenberg-dialog\code\pipeline\dialog_extractor.py", line 49, in extract_
if num_chars / num_words * 10000 > cfg.min_delimiters:
ZeroDivisionError: division by zero
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.