almet / zimit Goto Github PK

View Code? Open in Web Editor NEW

10.0 2.0 1.0 313 KB

Create zim packages out of regular websites

Home Page: https://zimit.notmyidea.org

Python 31.36% HTML 8.56% CSS 60.08%

zimit's Introduction

Create ZIM files out of HTTP websites

This project provides an API and an user interface in order to convert any website into a Zim file.

Exposed API

All APIs are talking JSON over HTTP. As such, all parameters should be sent as stringified JSON and the Content-Type should be set to "application/json".

POST /website-zim

By posting to this endpoint, you are asking the system to start a new download of a website and a conversion into a Zim format.

Required parameters

url: URL of the website to be crawled
title: Title that will be used in the created Zim file
email: Email address that will get notified when the creation of the file is over

Optional parameters

language: An ISO 639-3 code

representing the language
welcome: the page that will be first shown in the Zim file
description: The description that will be embedded in the Zim file
author: The author of the content

Return values

job_id: The job id is returned in JSON format. It can be used to know the status of the process.

Status codes

400 Bad Request will be returned in case you are not respecting the expected inputs. In case of error, have a look at the body of the response: it contains information about what is missing.
201 Created will be returned if the process started.

Exemple

$ http POST http://0.0.0.0:6543/website-url url="https://refugeeinfo.eu/" title="Refugee Info" email="[email protected]"
HTTP/1.1 201 Created

{
    "job": "5012abe3-bee2-4dd7-be87-39a88d76035d"
}

GET /status/{jobid}

Retrieve the status of a job and displays the associated logs.

Return values

status: The status of the job, it is one of 'queued', finished', 'failed', 'started' and 'deferred'.
log: The logs of the job.

Status codes

404 Not Found will be returned in case the requested job does not exist.
200 OK will be returned in any other case.

Exemple

http GET http://0.0.0.0:6543/status/5012abe3-bee2-4dd7-be87-39a88d76035d
HTTP/1.1 200 OK

{
    "log": "<snip>",
    "status": "finished"
}

Okay, so how do I install it on my server?

Currently, the best way to install it is by retrieving the sources from github

$ git clone https://github.com/almet/zimit.git
$ cd zimit

Create a virtual environment and install the project in it:

$ virtualenv venv
$ venv/bin/pip install -e .
$ venv/bin/pip install -e [email protected]:nvie/rq.git@master#egg=rq

Then, run it how you want, for instance with pserve:

$ venv/bin/pserve zimit.ini

In a separate process, you also need to run the worker:

$ venv/bin/rqworker

And you're ready to go. To test it:

$ http POST http://0.0.0.0:6543/website-url url="https://refugeeinfo.eu/" title="Refugee Info" email="[email protected]"

Debian dependencies

Installing the dependencies

sudo apt-get install httrack libzim-dev libmagic-dev liblzma-dev libz-dev build-essential libtool libgumbo-dev redis-server automake pkg-config

Installing zimwriterfs

git clone https://github.com/wikimedia/openzim.git
cd openzim/zimwriterfs
./autogen.sh
./configure
make

Then upgrade the path to zimwriterfs executable in zimit.ini

$ rqworker & pserve zimit.ini

How to deploy?

There are multiple ways to deploy such service, so I'll describe how I do it with my own best-practices.

First of all, get all the dependencies and the code. I like to have everything available in /home/www, so let's consider this will be the case here:

$ mkdir /home/www/zimit.notmyidea.org
$ cd /home/www/zimit.notmyidea.org
$ git clone https://github.com/almet/zimit.git

Then, you can change the configuration file, by creating a new one:

$ cd zimit
$ cp zimit.ini local.ini

From there, you need to update the configuration to point to the correct binaries and locations.

Nginx configuration

# the upstream component nginx needs to connect to
  upstream zimit_upstream {
      server unix:///tmp/zimit.sock;
  }

  # configuration of the server
  server {
      listen      80;
      listen   [::]:80;
      server_name zimit.ideascube.org;
      charset     utf-8;

      client_max_body_size 200M;

      location /zims {
          alias /home/ideascube/zimit.ideascube.org/zims/;
          autoindex on;
      }

      # Finally, send all non-media requests to the Pyramid server.
      location / {
          uwsgi_pass  zimit_upstream;
          include     /var/ideascube/uwsgi_params;
      }
    }

UWSGI configuration

[uwsgi]
uid = ideascube
gid = ideascube
chdir           = /home/ideascube/zimit.ideascube.org/zimit/
ini             = /home/ideascube/zimit.ideascube.org/zimit/local.ini
# the virtualenv (full path)
home            = /home/ideascube/zimit.ideascube.org/venv/

# process-related settings
# master
master          = true
# maximum number of worker processes
processes       = 4
# the socket (use the full path to be safe
socket          = /tmp/zimit.sock
# ... with appropriate permissions - may be needed
chmod-socket    = 666
# stats           = /tmp/ideascube.stats.sock
# clear environment on exit
vacuum          = true
plugins         = python

supervisord configuration

[program:zimit-worker]
command=/home/ideascube/zimit.ideascube.org/venv/bin/rqworker
directory=/home/ideascube/zimit.ideascube.org/zimit/
user=www-data
autostart=true
autorestart=true
redirect_stderr=true

That's it!

zimit's People

Contributors

Stargazers

Watchers

Forkers

natim

zimit's Issues

Remove protocoll from ZIM file name

By downloading https://refugeeinfo.eu/, the ZIM file name given is https-refugeeinfo-eu.zim. It should be refugeeinfo-eu.zim.

Make it work with a small list of websites

Crawl all level 1 links

Some websites have a lot of external resources. It might be interesting to follow these links (but just once).

Service endpoint

consider renaming to something less generic, like www-zim or website-zim

Detail installation instructions in python.

add advanced option to overwrite/write file

Use case : write css to remove seach form or useless stuff when offline
something to say : add "#search{ display:none;}" to file static/style.css

Add some internal documentation

Currently, the internal documentation is lacking.

We could solve this by adding some docstrings here and there, where needed.

cc @tim-moody

merge zimit.openzim.org live updates

diff --git a/app.wsgi b/app.wsgi
index f66d9b8..cff45a2 100644
--- a/app.wsgi
+++ b/app.wsgi
@@ -1,11 +1,5 @@
-try:
-    import ConfigParser as configparser
-except ImportError:
-    import configparser
-import logging.config
-import os
-
 from zimit import main
+from paste.evalexception.middleware import EvalException

 here = os.path.dirname(__file__)

@@ -20,5 +14,6 @@ logging.config.fileConfig(ini_path)
 config = configparser.ConfigParser()
 config.read(ini_path)

-application = main(config.items('DEFAULT'), **dict(config.items('app:main'
-)))
+application = EvalException(main(config.items('DEFAULT'), **dict(config.items('app:main'))))
+
+#application = main(config.items('DEFAULT'), **dict(config.items('app:main')))
diff --git a/zimit/creator.py b/zimit/creator.py
index 9646892..b2b24e6 100644
--- a/zimit/creator.py
+++ b/zimit/creator.py
@@ -71,7 +71,7 @@ class ZimCreator(object):
         if not os.path.isdir(website_folder):
             message = "Unable to find the website folder! %s" % website_folder
             raise Exception(message)
-        shutil.copy('./favicon.ico', website_folder)
+        shutil.copy('/var/www/zimit.openzim.org/favicon.ico', website_folder)
         return website_folder

     def create_zim(self, input_location, output_name, zim_options):

Release ZimIt 1.0

It works, a version version should be released which means:

Create CHANGELOG
Put a tag on master
Make an Annoucement

Index the generated Zims

The kiwix-index command might be helpful here.

It should be possible to specify the ZIM description

This is important to be able to put a custom description for the ZIM file.

Consider switching to Celery

The goal here would be to ease the logging capture for #5. https://github.com/sontek/pyramid_celery seems a good wrapper for pyramid apps.

Home page is not correct

I have downloaded the following file:
http://zimit.ideascube.org/zims/https-refugeeinfo-eu.zim

and read it with kiwix-serve.

If I click on home it roots me on:
http://zimfarm.kiwix.org:8081/https-refugeeinfo-eu/refugeeinfo.eu/index.html

which does not exist. It should be:
http://zimfarm.kiwix.org:8081/https-refugeeinfo-eu/A/refugeeinfo.eu/index.html

Email say it's ready but 404

On this instance https://zimit.notmyidea.org

-ask for https://projecteuler.net/
-wait

you got mail "your zimfile is ready" , open url => 404 (https://static.notmyidea.org/zims/https-projecteuler-net.zim)
According log, i think zimwriterfs may crash

Work well with perdu.com

Split the worker in 3 jobs

download
zim
email

The list of top level files / folder in the zim
Some packaging information

Explain better in the README that it's an API + something to consume it.

cc @tim-moody

almet / zimit Goto Github PK

zimit's Introduction

Create ZIM files out of HTTP websites

Exposed API

POST /website-zim

Required parameters

Optional parameters

Return values

Status codes

Exemple

GET /status/{jobid}

Return values

Status codes

Exemple

Okay, so how do I install it on my server?

Debian dependencies

Installing the dependencies

Installing zimwriterfs

How to deploy?

Nginx configuration

UWSGI configuration

supervisord configuration

zimit's People

Contributors

Stargazers

Watchers

Forkers

zimit's Issues

Recommend Projects

Recommend Topics

Recommend Org