The kolibri from openzim

Better UI: HTML5 pages

This is a subtask of #42

Provide an enhanced UI of HTML5 pages, based on Endless UI (e.g https://key.endlessos.org/en/explore/#/topics/c9d7f950ab6b5a1199e3d6c10d7f0103).

HTML5 assets should be placed in a ./html5s subfolder in the ZIM (and not only when deduplication is used)

Optimize thumbnails

There is no image content per-say in Kolibri but most node (topic or content) can have a thumbnail.

Should we optimize those image? (ie. convert to webp) or is this accessory?

Notes:

I don't know if the studio itself optimizes those thumbnails at all.
Studio has an option to generate thumbnails for nodes if there's none (walking down the tree until it finds one)

Add support for S3 based optimization cache

Support for an optimization cache would be necessary once we have #3 fixed

Better UI : Audio pages

This is a subtask of #42

Provide an enhanced UI of audio pages, based on Endless UI (e.g https://key.endlessos.org/en/explore/#/topics/c9d7f950ab6b5a1199e3d6c10d7f0103).

audio assets should be placed in a ./audios subfolder in the ZIM

Display author and license on topic nodes

When available, we should display the author and license information. lang= attribute should also be set.

Release 1.2.0

This issue serves as a checklist for the release event.

Use python-scraperlib shared logic to handle description and long_description

See openzim/python-scraperlib#110 once implemented

Properly remove files from disk after downloading and writing to ZIM

Currently, files are downloaded and then written to the ZIM file on the fly using scraperlib. However, libzim fails (with a segmentation fault) once we remove files after calling the add_binary method from the scraperlib. This might be an issue with scraperlib itself, but anyways needs to be fixed.

Update - This is actually due to openzim/python-scraperlib#69

Remove inline javascript to comply with some CSP

Tested with https://download.kiwix.org/zim/videos/khan-academy-videos_ar_khws-l-dd_2021-12.zim

Every page has the following webp-polyfill related inline code :

<script>$(document).ready(function() { trigger_webp_polyfill(); });</script>

It is blocked when some Content Security Policies ban inline javascript. It is in particular the case in kiwix-js browser extensions.

Moving this line of code in a javascript file should be enough to fix it, in this case.

Incorrect Item Size in KA-en

In this run, a kolibri2zim over the full khan-academy in English crashed with

Assertion failed at ../../SOURCE/libzim_release/src/writer/cluster.cpp:253
 size[489062] == provider->getSize()[1226905]
/usr/local/lib/python3.8/site-packages/libzim.so.7(_Z15_on_assert_failImmEvPKcS1_S1_T_T0_S1_i+0x1a9) [0x7f29e10d6c69]
/usr/local/lib/python3.8/site-packages/libzim.so.7(+0x197a44) [0x7f29e1103a44]
/usr/local/lib/python3.8/site-packages/libzim.so.7(_ZNK3zim6writer7Cluster13write_contentESt8functionIFvRKNS_4BlobEEE+0xde) [0x7f29e1103b2e]
/usr/local/lib/python3.8/site-packages/libzim.so.7(_ZNK3zim6writer7Cluster5writeEi+0xec) [0x7f29e110430c]
/usr/local/lib/python3.8/site-packages/libzim.so.7(_ZN3zim6writer13clusterWriterEPv+0x111) [0x7f29e1106141]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbbb2f) [0x7f29e0ea3b2f]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7f29e558cfa3]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f29e532eeff]
terminate called after throwing an instance of 'std::runtime_error'
  what():  
Assertion failed at ../../SOURCE/libzim_release/src/writer/cluster.cpp:253
 size[489062] == provider->getSize()[1226905]

This is due to this assert inside libzim's writer

void Cluster::write_data(writer_t writer) const
{
  for (auto& provider: m_providers)
  {
    ASSERT(provider->getSize(), !=, 0U);
    zim::size_type size = 0;
    while(true) {
      auto blob = provider->feed();
      if(blob.size() == 0) {
        break;
      }
      size += blob.size();
      writer(blob);
    }
    ASSERT(size, ==, provider->getSize());
  }
}

Code has been modified since (see https://github.com/openzim/libzim/blob/3a9f574d1aa2f722257f195fcdd6874e3517b8c6/src/writer/cluster.cpp#L246) and would generate a RuntimeError exception instead but the problem is the same: the size written to the ZIM is different from the size returned by the Provider's get_size().

Given kolibri2zim only prints debug after addition to the creator, we don't know which Entry caused the issue.

My investigations would point to a funneled file as other types of content are added via string and the size is automatically calculated.

Funneled ones on the other hand are files that we download directly from the Studio into the ZIM using scraperlib's URLItem.

Looking at the KA DB, I found a single file reported to have the expected size: c142275210f3f6dec3dfbdb1d9836e7b.mp4.

It works as expected when tested individually so my guess would be that there has been a network/server error that cause downloaded content to be a different. Note that we make an initial tiny request to find Size to decide whether we need to download to disk or not.

We could re-run this and hope this was fixed on it own but this sound like it could happen again given the large size of the content.

Fixing this would be difficult though ; this issue happens on a different libzim-handled thread long after we've added it so we can't catch the (libzim8+ only) exception and retry.

Add support for optimization of videos and images

Videos and images are currently downloaded and we shall have support for optimization of these as they contribute a lot to the ZIM size.

Prevent content duplication

Due to the fact that every piece of content is self-contained in Kolibri Studio, if HTML content includes JS libraries for instance (MathJax for libretext) it is then included in the ZIM for each of the content.

We could keep a list of all individual entries' hash and only include the first encounter in the ZIM as entries and subsequent ones would be ZIM redirects.

Should we integrate nav in HTML articles?

HTML articles are independent, self-contained HTML content on Kolibri which are mostly accessed by traversing the topics tree up to those.

Currently, we link to those HTML articles and display them directly, meaning only the content is present on the page.

There are alternatives:

Display the content in an iframe on a page that contains the node's details and navigation.
Merge the HTML content in a page that include those details and navigation.

Pros for both is easy navigation back to other points and access to metadata.

Cons are:

Integration of the iframe on page, wasting scrolling space
Breaking that UI should the HTML content contain multiple pages.

It could be an option to toggle for sites like Libretexts where we know we'll only get single page HTML nodes.

Fix mimetype of JS files

Some *.js files get added to the ZIM but with a wrong mimetype. One such example is as follows -

The script from “http://localhost:5100/test/-/assets/h5p_standalone/main.bundle.js” was loaded even though its MIME type (“text/plain”) is not a valid JavaScript MIME type.

Language cant be set

Language should be customizable via a param

Add custom CSS support

So we can adapt navigation a bit to the source website

Support for special type of content nodes like slideshow and exercises

Slideshow and exercises are special kind of nodes created using the Kolibri Studio. These need to be supported.

Here's example content for slideshow and exercise nodes -

Exercise node - https://studio.learningequality.org/content/storage/5/c/5c60b911362056b82ef62a054c8dfad0.perseus
Slideshow node manifest - https://studio.learningequality.org/content/storage/3/2/32e28dcd69bdf4cf8c71356c2ed1c485.json
Apart from the manifest(s), a slideshow node also contains several images that need to be shown on the slideshow

Exercise nodes have several kinds of exercises like short answer, multiple choice and single choice.

Better UI : HTML document pages

This is a subtask of #42

Provide an enhanced UI of HTML document pages, based on Endless UI (e.g https://key.endlessos.org/en/explore/#/topics/c9d7f950ab6b5a1199e3d6c10d7f0103).

Should be tackled together with #28

HTML documents assets should be placed in a ./htmls subfolder in the ZIM

Remove JS dependencies from the repository

Currently, a script needs to be in place to get the JS dependencies from the repository like the other scrapers

Better UI: exercices pages

This is a subtask of #42

Provide an enhanced UI of exercices pages, based on Endless UI (e.g https://key.endlessos.org/en/explore/#/topics/c9d7f950ab6b5a1199e3d6c10d7f0103).

exercices assets should be moved to a ./exercices subfolder

URLs should be meaningful

It is currently a cryptic string (a hash probably). This is not user friendly. It should be based on the page title (slug?) and collision risk should be managed.

Adapt the scraper to new rules

The scraper must be adapted to match our new Python rules.

Attention points:

ruff / black / pyright
CHANGELOG formatting

zimscrapperlib.zim.items not found, no module named as such

Recently, I installed kolibri2zim.
Process followed:

created a virtualenv
git cloned the repo
pip3 install -r requirements.txt
python3 setup.py install
kolibri2zim --name="School Of Thought" --channel-id="305b12ea5ea84fa18f933705c23f5ee0" --description="All about fallacies" --low-quality --title="School Of Thhhought"

Error:

Traceback (most recent call last):
File "/home/apricot/Desktop/kiwix-org/bin/kolibri2zim", line 11, in
load_entry_point('kolibri2zim==1.0.0.dev0', 'console_scripts', 'kolibri2zim')()
File "/home/apricot/Desktop/kiwix-org/lib/python3.8/site-packages/kolibri2zim-1.0.0.dev0-py3.8.egg/kolibri2zim/main.py", line 15, in main
entry()
File "/home/apricot/Desktop/kiwix-org/lib/python3.8/site-packages/kolibri2zim-1.0.0.dev0-py3.8.egg/kolibri2zim/entrypoint.py", line 181, in main
from .scraper import Kolibri2Zim
File "/home/apricot/Desktop/kiwix-org/lib/python3.8/site-packages/kolibri2zim-1.0.0.dev0-py3.8.egg/kolibri2zim/scraper.py", line 22, in
from zimscraperlib.zim.items import URLItem, StaticItem
ModuleNotFoundError: No module named 'zimscraperlib.zim.items'

Support for document nodes

Document nodes contain PDF and ePUB files and though they are downloaded in the ZIM, support for their display is not present at this moment. We shall support them as they are integral to the content that is being shared on Kolibri

Optimize content retrieval from cache

When downloading content (compressed videos) from the S3 cache, we are currently downloading those in memory and once downloaded, adding them to the ZIM file then eventually releasing all that.

This was done to not write anything on disk.

We should ideally create an S3ContentProvider that would stream content from S3 directly into the ZIM but that would depend on openzim/python-storagelib#6.

In the mean time, we may consider saving those large files to disk instead of in-memory as Disk is cheaper than Memory in our scenario.
Example: Once we have all our video files in the cache, we're just moving stuff around through the network. It's thus pertinent to have a high number of threads doing this. But if you have many large video files downloaded concurrently, you might exhaust your memory without much benefit from skipping the disk.
Theoretical question of course as we don't have such scenario in practice. Just wanted to document behaviors still.

Better UI: tag filtering

This is a subtask of #42

Setup "tag filtering" as in Endless UI (e.g https://key.endlessos.org/en/explore/#/topics/c9d7f950ab6b5a1199e3d6c10d7f0103).

Redo dockerfile

The dockerfile needs to be revamped in order to support all the scraper features once we have a beta version ready.

Add instructions in README.md

We shall have a readme to view quick instructions on running the project

Better UI : videos pages

This is a subtask of #42

Provide an enhanced UI of videos pages, based on Endless UI (e.g https://key.endlessos.org/en/explore/#/topics/c9d7f950ab6b5a1199e3d6c10d7f0103).

videos assets should be placed in a ./videos subfolder in the ZIM

Better UI: minimal first step

This is a subtask of #42

Setup a minimal first step toward a better default UI, based on Endless UI (e.g https://key.endlessos.org/en/explore/#/topics/c9d7f950ab6b5a1199e3d6c10d7f0103).

This first step will enhance only navigation of topics.

Videos, documents, audios, exercicss will be kept as-is.

Better UI: Favorite pages

This is a subtask of #42

Setup "favorite pages" as in Endless UI (e.g https://key.endlessos.org/en/explore/#/topics/c9d7f950ab6b5a1199e3d6c10d7f0103).

Add custom about page support

So there's a place to add per-zim non-navigational information

Support building for a topic instead of channel root

One of the advantage of Kolibri is the cherry-picking of topics from channels. While we probably won't support that any time soon, we should at least support creating a ZIM from a topic of a channel so that we can create smaller, more-focuses ZIMs.

Support exercise nodes

Exercise nodes are composed of a single perseus file. A perseus file is a ZIP containing an exercise.json entrypoint and other files.

Requires:

Creating a standalone version of https://github.com/Khan/perseus
Adding that version to assets of ZIM
Adding the extracted (or not?) perseus file to the Zim
Create a reader for the exercise: an HTML file calling the perseus code and referencing the perseus file or JSON.

Initial difficulty/step would be to create that standalone version. It should be one or multiple files that work in the browser and don't require any backend.
Might be interesting to look at https://github.com/learningequality/kolibri-exercise-perseus-plugin/ although this can't be reused directly obviously.

Once we have a working offline HTML/CSS/JS reader that can be passed a perseus or JSON file and render it, we can move on an integrate it in kolibri2zim.

Missing favicon on ZIM

This is a pylibzim related issue but the root cause hasn't been identified yet (and it depends on usage). favicon is added and the favicon_entry is set but kiwixlib's meta can't find it

Keep image ratios for Cards on various screen sizes

Responsive cards adjusts there width to the screen on some sizes distorting the thumbnails. That's unexpected and unpleasant.

African Storybooks

Website URL: https://africanstorybook.org/
License: CC-by (sometimes -NC)
Desired ZIM Title: African Storybooks
Desired ZIM Description: Picture storybooks for children’s literacy, enjoyment and imagination

The books are allegedly available in several dozen languages, so it would be nice to have a separate zim for each. Storybooks Canada has 40 of these already sorted in a few languages and a git repo (with this list here that might be an easier starting point (Source descriptions here). Likewise Global Storybooks (same books but sorted by country instead of languages, which is kind of odd as two countries might share a language and therefore display the exact same content).

Release 1.1.0

Better default UI

Current UI is minimalist (or non-existent) and we should import / reproduce the existing UI.

Here is for instance the layout for Algebra 1:

And how it comes out on the zim:

And then going further into Algebra foundations:

and on the zim

Improve errors management

Most of the work happen inside separate threads and processes. Exceptions raised there are logged and visible but no further action is taken.
We should stop the scraper on exception and return an error code.