Code Monkey home page Code Monkey logo

kolibri's People

Contributors

benoit74 avatar btrain01 avatar code-factor avatar imnitishng avatar kelson42 avatar rafaelcestti avatar rgaudin avatar satyamtg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kolibri's Issues

Optimize thumbnails

There is no image content per-say in Kolibri but most node (topic or content) can have a thumbnail.

Should we optimize those image? (ie. convert to webp) or is this accessory?

Notes:

  • I don't know if the studio itself optimizes those thumbnails at all.
  • Studio has an option to generate thumbnails for nodes if there's none (walking down the tree until it finds one)

Release 1.2.0

This issue serves as a checklist for the release event.

  • Check that dependencies have been updated to latest version (especially python-scraper lib)
  • Adjust version in __about__.py to x.x.x
  • Update Github milestone to match the issues that will be released
  • Close Github milestone
  • Update the Changelog so that it is in line with the content of Github milestone
  • Push a tag on Github named vx.x.x
  • Create a Github release, pointing to the tag, with the Changelog of this release
  • Publish the Github release (this will trigger the CI, if the CI fails and you have to push a minor fix which does not justify to create a new version, you will have to delete the release and re-create it from scratch)
  • Check that version is published as a Github release at https://github.com/openzim/kolibri/releases
  • Check that version is published on Github Container Registry at https://ghcr.io/openzim/kolibri and tagged latest
  • Check that version is published on Pypi at https://pypi.org/project/kolibri2zim/
  • Create a new Github milestone with the next minor version incrementaly
  • Create a new Github issue attached to this milestone with this checklist inside
  • Create new ## [Unreleased] section in Changelog (placeholder for future entries)
  • Adjust version in __about__.py to `x.y.z-dev0
  • Inform rgaudin that a new release is ready to use so that he will update Zimfarm recipes
  • If needed, open a PR on Zimfarm to add support for new CLI parameters of interest

Remove inline javascript to comply with some CSP

Tested with https://download.kiwix.org/zim/videos/khan-academy-videos_ar_khws-l-dd_2021-12.zim

Every page has the following webp-polyfill related inline code :

<script>$(document).ready(function() { trigger_webp_polyfill(); });</script>

It is blocked when some Content Security Policies ban inline javascript. It is in particular the case in kiwix-js browser extensions.

Moving this line of code in a javascript file should be enough to fix it, in this case.

Incorrect Item Size in KA-en

In this run, a kolibri2zim over the full khan-academy in English crashed with

Assertion failed at ../../SOURCE/libzim_release/src/writer/cluster.cpp:253
 size[489062] == provider->getSize()[1226905]
/usr/local/lib/python3.8/site-packages/libzim.so.7(_Z15_on_assert_failImmEvPKcS1_S1_T_T0_S1_i+0x1a9) [0x7f29e10d6c69]
/usr/local/lib/python3.8/site-packages/libzim.so.7(+0x197a44) [0x7f29e1103a44]
/usr/local/lib/python3.8/site-packages/libzim.so.7(_ZNK3zim6writer7Cluster13write_contentESt8functionIFvRKNS_4BlobEEE+0xde) [0x7f29e1103b2e]
/usr/local/lib/python3.8/site-packages/libzim.so.7(_ZNK3zim6writer7Cluster5writeEi+0xec) [0x7f29e110430c]
/usr/local/lib/python3.8/site-packages/libzim.so.7(_ZN3zim6writer13clusterWriterEPv+0x111) [0x7f29e1106141]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbbb2f) [0x7f29e0ea3b2f]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7f29e558cfa3]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f29e532eeff]
terminate called after throwing an instance of 'std::runtime_error'
  what():  
Assertion failed at ../../SOURCE/libzim_release/src/writer/cluster.cpp:253
 size[489062] == provider->getSize()[1226905]

This is due to this assert inside libzim's writer

void Cluster::write_data(writer_t writer) const
{
  for (auto& provider: m_providers)
  {
    ASSERT(provider->getSize(), !=, 0U);
    zim::size_type size = 0;
    while(true) {
      auto blob = provider->feed();
      if(blob.size() == 0) {
        break;
      }
      size += blob.size();
      writer(blob);
    }
    ASSERT(size, ==, provider->getSize());
  }
}

Code has been modified since (see https://github.com/openzim/libzim/blob/3a9f574d1aa2f722257f195fcdd6874e3517b8c6/src/writer/cluster.cpp#L246) and would generate a RuntimeError exception instead but the problem is the same: the size written to the ZIM is different from the size returned by the Provider's get_size().

Given kolibri2zim only prints debug after addition to the creator, we don't know which Entry caused the issue.

My investigations would point to a funneled file as other types of content are added via string and the size is automatically calculated.

Funneled ones on the other hand are files that we download directly from the Studio into the ZIM using scraperlib's URLItem.

Looking at the KA DB, I found a single file reported to have the expected size: c142275210f3f6dec3dfbdb1d9836e7b.mp4.

It works as expected when tested individually so my guess would be that there has been a network/server error that cause downloaded content to be a different. Note that we make an initial tiny request to find Size to decide whether we need to download to disk or not.

We could re-run this and hope this was fixed on it own but this sound like it could happen again given the large size of the content.

Fixing this would be difficult though ; this issue happens on a different libzim-handled thread long after we've added it so we can't catch the (libzim8+ only) exception and retry.

Prevent content duplication

Due to the fact that every piece of content is self-contained in Kolibri Studio, if HTML content includes JS libraries for instance (MathJax for libretext) it is then included in the ZIM for each of the content.

We could keep a list of all individual entries' hash and only include the first encounter in the ZIM as entries and subsequent ones would be ZIM redirects.

Should we integrate nav in HTML articles?

HTML articles are independent, self-contained HTML content on Kolibri which are mostly accessed by traversing the topics tree up to those.

Currently, we link to those HTML articles and display them directly, meaning only the content is present on the page.

There are alternatives:

  • Display the content in an iframe on a page that contains the node's details and navigation.
  • Merge the HTML content in a page that include those details and navigation.

Pros for both is easy navigation back to other points and access to metadata.

Cons are:

  • Integration of the iframe on page, wasting scrolling space
  • Breaking that UI should the HTML content contain multiple pages.

It could be an option to toggle for sites like Libretexts where we know we'll only get single page HTML nodes.

Fix mimetype of JS files

Some *.js files get added to the ZIM but with a wrong mimetype. One such example is as follows -

The script from “http://localhost:5100/test/-/assets/h5p_standalone/main.bundle.js” was loaded even though its MIME type (“text/plain”) is not a valid JavaScript MIME type.

Support for special type of content nodes like slideshow and exercises

Slideshow and exercises are special kind of nodes created using the Kolibri Studio. These need to be supported.

Here's example content for slideshow and exercise nodes -

Exercise nodes have several kinds of exercises like short answer, multiple choice and single choice.

URLs should be meaningful

It is currently a cryptic string (a hash probably). This is not user friendly. It should be based on the page title (slug?) and collision risk should be managed.

zimscrapperlib.zim.items not found, no module named as such

Recently, I installed kolibri2zim.
Process followed:

  1. created a virtualenv
  2. git cloned the repo
  3. pip3 install -r requirements.txt
  4. python3 setup.py install
  5. kolibri2zim --name="School Of Thought" --channel-id="305b12ea5ea84fa18f933705c23f5ee0" --description="All about fallacies" --low-quality --title="School Of Thhhought"

Error:

Traceback (most recent call last):
File "/home/apricot/Desktop/kiwix-org/bin/kolibri2zim", line 11, in
load_entry_point('kolibri2zim==1.0.0.dev0', 'console_scripts', 'kolibri2zim')()
File "/home/apricot/Desktop/kiwix-org/lib/python3.8/site-packages/kolibri2zim-1.0.0.dev0-py3.8.egg/kolibri2zim/main.py", line 15, in main
entry()
File "/home/apricot/Desktop/kiwix-org/lib/python3.8/site-packages/kolibri2zim-1.0.0.dev0-py3.8.egg/kolibri2zim/entrypoint.py", line 181, in main
from .scraper import Kolibri2Zim
File "/home/apricot/Desktop/kiwix-org/lib/python3.8/site-packages/kolibri2zim-1.0.0.dev0-py3.8.egg/kolibri2zim/scraper.py", line 22, in
from zimscraperlib.zim.items import URLItem, StaticItem
ModuleNotFoundError: No module named 'zimscraperlib.zim.items'

Support for document nodes

Document nodes contain PDF and ePUB files and though they are downloaded in the ZIM, support for their display is not present at this moment. We shall support them as they are integral to the content that is being shared on Kolibri

Optimize content retrieval from cache

When downloading content (compressed videos) from the S3 cache, we are currently downloading those in memory and once downloaded, adding them to the ZIM file then eventually releasing all that.

This was done to not write anything on disk.

We should ideally create an S3ContentProvider that would stream content from S3 directly into the ZIM but that would depend on openzim/python-storagelib#6.

In the mean time, we may consider saving those large files to disk instead of in-memory as Disk is cheaper than Memory in our scenario.
Example: Once we have all our video files in the cache, we're just moving stuff around through the network. It's thus pertinent to have a high number of threads doing this. But if you have many large video files downloaded concurrently, you might exhaust your memory without much benefit from skipping the disk.
Theoretical question of course as we don't have such scenario in practice. Just wanted to document behaviors still.

Redo dockerfile

The dockerfile needs to be revamped in order to support all the scraper features once we have a beta version ready.

Support building for a topic instead of channel root

One of the advantage of Kolibri is the cherry-picking of topics from channels. While we probably won't support that any time soon, we should at least support creating a ZIM from a topic of a channel so that we can create smaller, more-focuses ZIMs.

Support exercise nodes

Exercise nodes are composed of a single perseus file. A perseus file is a ZIP containing an exercise.json entrypoint and other files.

Requires:

  • Creating a standalone version of https://github.com/Khan/perseus
  • Adding that version to assets of ZIM
  • Adding the extracted (or not?) perseus file to the Zim
  • Create a reader for the exercise: an HTML file calling the perseus code and referencing the perseus file or JSON.

Initial difficulty/step would be to create that standalone version. It should be one or multiple files that work in the browser and don't require any backend.
Might be interesting to look at https://github.com/learningequality/kolibri-exercise-perseus-plugin/ although this can't be reused directly obviously.

Once we have a working offline HTML/CSS/JS reader that can be passed a perseus or JSON file and render it, we can move on an integrate it in kolibri2zim.

Missing favicon on ZIM

This is a pylibzim related issue but the root cause hasn't been identified yet (and it depends on usage). favicon is added and the favicon_entry is set but kiwixlib's meta can't find it

African Storybooks

  • Website URL: https://africanstorybook.org/
  • License: CC-by (sometimes -NC)
  • Desired ZIM Title: African Storybooks
  • Desired ZIM Description: Picture storybooks for children’s literacy, enjoyment and imagination

The books are allegedly available in several dozen languages, so it would be nice to have a separate zim for each. Storybooks Canada has 40 of these already sorted in a few languages and a git repo (with this list here that might be an easier starting point (Source descriptions here). Likewise Global Storybooks (same books but sorted by country instead of languages, which is kind of odd as two countries might share a language and therefore display the exact same content).

Release 1.1.0

  • Check that dependencies have been updated to latest version (especially python-scraper lib)
  • Adjust version in __about__.py to x.x.x
  • Update Github milestone to match the issues that will be released
  • Close Github milestone
  • Update the Changelog so that it is in line with the content of Github milestone
  • Push a tag on Github named vx.x.x
  • Create a Github release, pointing to the tag, with the Changelog of this release
  • Publish the Github release (this will trigger the CI, if the CI fails and you have to push a minor fix which does not justify to create a new version, you will have to delete the release and re-create it from scratch)
  • Check that version is published as a Github release at https://github.com/openzim/kolibri/releases
  • Check that version is published on Github Container Registry at https://ghcr.io/openzim/kolibri and tagged latest
  • Check that version is published on Pypi at https://pypi.org/project/kolibri2zim/
  • Create a new Github milestone with the next minor version incrementaly
  • Create a new Github issue attached to this milestone with this checklist inside
  • Create new ## [Unreleased] section in Changelog (placeholder for future entries)
  • Adjust version in __about__.py to `x.y.z-dev0
  • Inform rgaudin that a new release is ready to use so that he will update Zimfarm recipes
  • If needed, open a PR on Zimfarm to add support for new CLI parameters of interest

Better default UI

Current UI is minimalist (or non-existent) and we should import / reproduce the existing UI.

Here is for instance the layout for Algebra 1:
Capture d’écran 2023-05-17 à 15 44 07

And how it comes out on the zim:
Capture d’écran 2023-05-17 à 15 44 18

And then going further into Algebra foundations:
Capture d’écran 2023-05-17 à 15 45 35

and on the zim
Capture d’écran 2023-05-17 à 15 46 24

Improve errors management

Most of the work happen inside separate threads and processes. Exceptions raised there are logged and visible but no further action is taken.
We should stop the scraper on exception and return an error code.

move s3 upload to IO-bound threads

Currently the upload re-encoded videos is called through the re-encode callback and thus executed in the main thread.

We should defer that to separate threads.

Probably pertinent to refactor the process threads around a thread executor and submit those requests to it.

Report scraper progress

Add support to generate the task_progress.json file, so that it can reported by Zimfarm workers and be displayed in Zimfarm UI

Add support for thumbnails in topic cards

Currently kolibri2zim doesn't support thumbnails in topic cards (has a space for them though), and they are easy to download from the Kolibri studio. We shall support these.

Add support for h5p nodes

H5P support is incomplete currently as there are some path issues to fix as h5p-standalone requires a pre-extracted h5p file and hence, conflicts arise due to the presence of namespaces in the ZIM spec

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.