openzim / kolibri Goto Github PK
View Code? Open in Web Editor NEWConvert a Kolibri channel in ZIM file(s)
License: GNU General Public License v3.0
Convert a Kolibri channel in ZIM file(s)
License: GNU General Public License v3.0
This is a subtask of #42
Provide an enhanced UI of HTML5 pages, based on Endless UI (e.g https://key.endlessos.org/en/explore/#/topics/c9d7f950ab6b5a1199e3d6c10d7f0103).
HTML5 assets should be placed in a ./html5s
subfolder in the ZIM (and not only when deduplication is used)
There is no image content per-say in Kolibri but most node (topic or content) can have a thumbnail.
Should we optimize those image? (ie. convert to webp) or is this accessory?
Notes:
Support for an optimization cache would be necessary once we have #3 fixed
This is a subtask of #42
Provide an enhanced UI of audio pages, based on Endless UI (e.g https://key.endlessos.org/en/explore/#/topics/c9d7f950ab6b5a1199e3d6c10d7f0103).
audio assets should be placed in a ./audios
subfolder in the ZIM
When available, we should display the author and license information. lang=
attribute should also be set.
This issue serves as a checklist for the release event.
__about__.py
to x.x.x
vx.x.x
latest
## [Unreleased]
section in Changelog (placeholder for future entries)__about__.py
to `x.y.z-dev0See openzim/python-scraperlib#110 once implemented
Currently, files are downloaded and then written to the ZIM file on the fly using scraperlib. However, libzim fails (with a segmentation fault) once we remove files after calling the add_binary
method from the scraperlib. This might be an issue with scraperlib itself, but anyways needs to be fixed.
Update - This is actually due to openzim/python-scraperlib#69
Tested with https://download.kiwix.org/zim/videos/khan-academy-videos_ar_khws-l-dd_2021-12.zim
Every page has the following webp-polyfill related inline code :
<script>$(document).ready(function() { trigger_webp_polyfill(); });</script>
It is blocked when some Content Security Policies ban inline javascript. It is in particular the case in kiwix-js browser extensions.
Moving this line of code in a javascript file should be enough to fix it, in this case.
In this run, a kolibri2zim over the full khan-academy in English crashed with
Assertion failed at ../../SOURCE/libzim_release/src/writer/cluster.cpp:253
size[489062] == provider->getSize()[1226905]
/usr/local/lib/python3.8/site-packages/libzim.so.7(_Z15_on_assert_failImmEvPKcS1_S1_T_T0_S1_i+0x1a9) [0x7f29e10d6c69]
/usr/local/lib/python3.8/site-packages/libzim.so.7(+0x197a44) [0x7f29e1103a44]
/usr/local/lib/python3.8/site-packages/libzim.so.7(_ZNK3zim6writer7Cluster13write_contentESt8functionIFvRKNS_4BlobEEE+0xde) [0x7f29e1103b2e]
/usr/local/lib/python3.8/site-packages/libzim.so.7(_ZNK3zim6writer7Cluster5writeEi+0xec) [0x7f29e110430c]
/usr/local/lib/python3.8/site-packages/libzim.so.7(_ZN3zim6writer13clusterWriterEPv+0x111) [0x7f29e1106141]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbbb2f) [0x7f29e0ea3b2f]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7f29e558cfa3]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f29e532eeff]
terminate called after throwing an instance of 'std::runtime_error'
what():
Assertion failed at ../../SOURCE/libzim_release/src/writer/cluster.cpp:253
size[489062] == provider->getSize()[1226905]
This is due to this assert inside libzim's writer
void Cluster::write_data(writer_t writer) const
{
for (auto& provider: m_providers)
{
ASSERT(provider->getSize(), !=, 0U);
zim::size_type size = 0;
while(true) {
auto blob = provider->feed();
if(blob.size() == 0) {
break;
}
size += blob.size();
writer(blob);
}
ASSERT(size, ==, provider->getSize());
}
}
Code has been modified since (see https://github.com/openzim/libzim/blob/3a9f574d1aa2f722257f195fcdd6874e3517b8c6/src/writer/cluster.cpp#L246) and would generate a RuntimeError
exception instead but the problem is the same: the size written to the ZIM is different from the size returned by the Provider's get_size()
.
Given kolibri2zim only prints debug after addition to the creator, we don't know which Entry caused the issue.
My investigations would point to a funneled file as other types of content are added via string and the size is automatically calculated.
Funneled ones on the other hand are files that we download directly from the Studio into the ZIM using scraperlib's URLItem
.
Looking at the KA DB, I found a single file reported to have the expected size: c142275210f3f6dec3dfbdb1d9836e7b.mp4.
It works as expected when tested individually so my guess would be that there has been a network/server error that cause downloaded content to be a different. Note that we make an initial tiny request to find Size to decide whether we need to download to disk or not.
We could re-run this and hope this was fixed on it own but this sound like it could happen again given the large size of the content.
Fixing this would be difficult though ; this issue happens on a different libzim-handled thread long after we've added it so we can't catch the (libzim8+ only) exception and retry.
Videos and images are currently downloaded and we shall have support for optimization of these as they contribute a lot to the ZIM size.
Due to the fact that every piece of content is self-contained in Kolibri Studio, if HTML content includes JS libraries for instance (MathJax for libretext) it is then included in the ZIM for each of the content.
We could keep a list of all individual entries' hash and only include the first encounter in the ZIM as entries and subsequent ones would be ZIM redirects.
HTML articles are independent, self-contained HTML content on Kolibri which are mostly accessed by traversing the topics tree up to those.
Currently, we link to those HTML articles and display them directly, meaning only the content is present on the page.
There are alternatives:
iframe
on a page that contains the node's details and navigation.Pros for both is easy navigation back to other points and access to metadata.
Cons are:
It could be an option to toggle for sites like Libretexts where we know we'll only get single page HTML nodes.
Some *.js files get added to the ZIM but with a wrong mimetype. One such example is as follows -
The script from “http://localhost:5100/test/-/assets/h5p_standalone/main.bundle.js” was loaded even though its MIME type (“text/plain”) is not a valid JavaScript MIME type.
Language should be customizable via a param
So we can adapt navigation a bit to the source website
Slideshow and exercises are special kind of nodes created using the Kolibri Studio. These need to be supported.
Here's example content for slideshow and exercise nodes -
Exercise nodes have several kinds of exercises like short answer, multiple choice and single choice.
This is a subtask of #42
Provide an enhanced UI of HTML document pages, based on Endless UI (e.g https://key.endlessos.org/en/explore/#/topics/c9d7f950ab6b5a1199e3d6c10d7f0103).
Should be tackled together with #28
HTML documents assets should be placed in a ./htmls
subfolder in the ZIM
Currently, a script needs to be in place to get the JS dependencies from the repository like the other scrapers
This is a subtask of #42
Provide an enhanced UI of exercices pages, based on Endless UI (e.g https://key.endlessos.org/en/explore/#/topics/c9d7f950ab6b5a1199e3d6c10d7f0103).
exercices assets should be moved to a ./exercices
subfolder
It is currently a cryptic string (a hash probably). This is not user friendly. It should be based on the page title (slug?) and collision risk should be managed.
The scraper must be adapted to match our new Python rules.
Attention points:
Recently, I installed kolibri2zim.
Process followed:
Error:
Traceback (most recent call last):
File "/home/apricot/Desktop/kiwix-org/bin/kolibri2zim", line 11, in
load_entry_point('kolibri2zim==1.0.0.dev0', 'console_scripts', 'kolibri2zim')()
File "/home/apricot/Desktop/kiwix-org/lib/python3.8/site-packages/kolibri2zim-1.0.0.dev0-py3.8.egg/kolibri2zim/main.py", line 15, in main
entry()
File "/home/apricot/Desktop/kiwix-org/lib/python3.8/site-packages/kolibri2zim-1.0.0.dev0-py3.8.egg/kolibri2zim/entrypoint.py", line 181, in main
from .scraper import Kolibri2Zim
File "/home/apricot/Desktop/kiwix-org/lib/python3.8/site-packages/kolibri2zim-1.0.0.dev0-py3.8.egg/kolibri2zim/scraper.py", line 22, in
from zimscraperlib.zim.items import URLItem, StaticItem
ModuleNotFoundError: No module named 'zimscraperlib.zim.items'
Document nodes contain PDF and ePUB files and though they are downloaded in the ZIM, support for their display is not present at this moment. We shall support them as they are integral to the content that is being shared on Kolibri
When downloading content (compressed videos) from the S3 cache, we are currently downloading those in memory and once downloaded, adding them to the ZIM file then eventually releasing all that.
This was done to not write anything on disk.
We should ideally create an S3ContentProvider that would stream content from S3 directly into the ZIM but that would depend on openzim/python-storagelib#6.
In the mean time, we may consider saving those large files to disk instead of in-memory as Disk is cheaper than Memory in our scenario.
Example: Once we have all our video files in the cache, we're just moving stuff around through the network. It's thus pertinent to have a high number of threads doing this. But if you have many large video files downloaded concurrently, you might exhaust your memory without much benefit from skipping the disk.
Theoretical question of course as we don't have such scenario in practice. Just wanted to document behaviors still.
This is a subtask of #42
Setup "tag filtering" as in Endless UI (e.g https://key.endlessos.org/en/explore/#/topics/c9d7f950ab6b5a1199e3d6c10d7f0103).
The dockerfile needs to be revamped in order to support all the scraper features once we have a beta version ready.
We shall have a readme to view quick instructions on running the project
This is a subtask of #42
Provide an enhanced UI of videos pages, based on Endless UI (e.g https://key.endlessos.org/en/explore/#/topics/c9d7f950ab6b5a1199e3d6c10d7f0103).
videos assets should be placed in a ./videos
subfolder in the ZIM
This is a subtask of #42
Setup a minimal first step toward a better default UI, based on Endless UI (e.g https://key.endlessos.org/en/explore/#/topics/c9d7f950ab6b5a1199e3d6c10d7f0103).
This first step will enhance only navigation of topics.
Videos, documents, audios, exercicss will be kept as-is.
This is a subtask of #42
Setup "favorite pages" as in Endless UI (e.g https://key.endlessos.org/en/explore/#/topics/c9d7f950ab6b5a1199e3d6c10d7f0103).
So there's a place to add per-zim non-navigational information
One of the advantage of Kolibri is the cherry-picking of topics from channels. While we probably won't support that any time soon, we should at least support creating a ZIM from a topic of a channel so that we can create smaller, more-focuses ZIMs.
Exercise nodes are composed of a single perseus file. A perseus file is a ZIP containing an exercise.json
entrypoint and other files.
Requires:
Initial difficulty/step would be to create that standalone version. It should be one or multiple files that work in the browser and don't require any backend.
Might be interesting to look at https://github.com/learningequality/kolibri-exercise-perseus-plugin/ although this can't be reused directly obviously.
Once we have a working offline HTML/CSS/JS reader that can be passed a perseus or JSON file and render it, we can move on an integrate it in kolibri2zim.
This is a pylibzim related issue but the root cause hasn't been identified yet (and it depends on usage). favicon is added and the favicon_entry is set but kiwixlib's meta can't find it
Responsive cards adjusts there width to the screen on some sizes distorting the thumbnails. That's unexpected and unpleasant.
The books are allegedly available in several dozen languages, so it would be nice to have a separate zim for each. Storybooks Canada has 40 of these already sorted in a few languages and a git repo (with this list here that might be an easier starting point (Source descriptions here). Likewise Global Storybooks (same books but sorted by country instead of languages, which is kind of odd as two countries might share a language and therefore display the exact same content).
__about__.py
to x.x.x
vx.x.x
latest
## [Unreleased]
section in Changelog (placeholder for future entries)__about__.py
to `x.y.z-dev0Current UI is minimalist (or non-existent) and we should import / reproduce the existing UI.
Here is for instance the layout for Algebra 1:
And how it comes out on the zim:
Most of the work happen inside separate threads and processes. Exceptions raised there are logged and visible but no further action is taken.
We should stop the scraper on exception and return an error code.
self-explanatory
This is a subtask of #42
Provide an enhanced UI of ePub and PDF pages, based on Endless UI (e.g https://key.endlessos.org/en/explore/#/topics/c9d7f950ab6b5a1199e3d6c10d7f0103).
Assets should be placed in a ./epubs
and ./pdfs
subfolder in the ZIM
Currently the upload re-encoded videos is called through the re-encode callback and thus executed in the main thread.
We should defer that to separate threads.
Probably pertinent to refactor the process threads around a thread executor and submit those requests to it.
Add in-zim JS epub reader using https://github.com/futurepress/epub.js/
Using videojs and ogvjs
Add support to generate the task_progress.json
file, so that it can reported by Zimfarm workers and be displayed in Zimfarm UI
Currently kolibri2zim doesn't support thumbnails in topic cards (has a space for them though), and they are easy to download from the Kolibri studio. We shall support these.
H5P support is incomplete currently as there are some path issues to fix as h5p-standalone requires a pre-extracted h5p file and hence, conflicts arise due to the presence of namespaces in the ZIM spec
Using videojs.
Should we convert audio files to ogg? Automatically ? As an option?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.