Code Monkey home page Code Monkey logo

zimit's Introduction

Create ZIM files out of HTTP websites

This project provides an API and an user interface in order to convert any website into a Zim file.

Exposed API

All APIs are talking JSON over HTTP. As such, all parameters should be sent as stringified JSON and the Content-Type should be set to "application/json".

POST /website-zim

By posting to this endpoint, you are asking the system to start a new download of a website and a conversion into a Zim format.

Required parameters

  • url: URL of the website to be crawled
  • title: Title that will be used in the created Zim file
  • email: Email address that will get notified when the creation of the file is over

Optional parameters

  • language: An ISO 639-3 code
    representing the language
  • welcome: the page that will be first shown in the Zim file
  • description: The description that will be embedded in the Zim file
  • author: The author of the content

Return values

  • job_id: The job id is returned in JSON format. It can be used to know the status of the process.

Status codes

  • 400 Bad Request will be returned in case you are not respecting the expected inputs. In case of error, have a look at the body of the response: it contains information about what is missing.
  • 201 Created will be returned if the process started.

Exemple

$ http POST http://0.0.0.0:6543/website-url url="https://refugeeinfo.eu/" title="Refugee Info" email="[email protected]"
HTTP/1.1 201 Created

{
    "job": "5012abe3-bee2-4dd7-be87-39a88d76035d"
}

GET /status/{jobid}

Retrieve the status of a job and displays the associated logs.

Return values

  • status: The status of the job, it is one of 'queued', finished', 'failed', 'started' and 'deferred'.
  • log: The logs of the job.

Status codes

  • 404 Not Found will be returned in case the requested job does not exist.
  • 200 OK will be returned in any other case.

Exemple

http GET http://0.0.0.0:6543/status/5012abe3-bee2-4dd7-be87-39a88d76035d
HTTP/1.1 200 OK

{
    "log": "<snip>",
    "status": "finished"
}

Okay, so how do I install it on my server?

Currently, the best way to install it is by retrieving the sources from github

$ git clone https://github.com/almet/zimit.git
$ cd zimit

Create a virtual environment and install the project in it:

$ virtualenv venv
$ venv/bin/pip install -e .
$ venv/bin/pip install -e [email protected]:nvie/rq.git@master#egg=rq

Then, run it how you want, for instance with pserve:

$ venv/bin/pserve zimit.ini

In a separate process, you also need to run the worker:

$ venv/bin/rqworker

And you're ready to go. To test it:

$ http POST http://0.0.0.0:6543/website-url url="https://refugeeinfo.eu/" title="Refugee Info" email="[email protected]"

Debian dependencies

Installing the dependencies

sudo apt-get install httrack libzim-dev libmagic-dev liblzma-dev libz-dev build-essential libtool libgumbo-dev redis-server automake pkg-config

Installing zimwriterfs

git clone https://github.com/wikimedia/openzim.git
cd openzim/zimwriterfs
./autogen.sh
./configure
make

Then upgrade the path to zimwriterfs executable in zimit.ini

$ rqworker & pserve zimit.ini

How to deploy?

There are multiple ways to deploy such service, so I'll describe how I do it with my own best-practices.

First of all, get all the dependencies and the code. I like to have everything available in /home/www, so let's consider this will be the case here:

$ mkdir /home/www/zimit.notmyidea.org
$ cd /home/www/zimit.notmyidea.org
$ git clone https://github.com/almet/zimit.git

Then, you can change the configuration file, by creating a new one:

$ cd zimit
$ cp zimit.ini local.ini

From there, you need to update the configuration to point to the correct binaries and locations.

Nginx configuration

# the upstream component nginx needs to connect to
  upstream zimit_upstream {
      server unix:///tmp/zimit.sock;
  }

  # configuration of the server
  server {
      listen      80;
      listen   [::]:80;
      server_name zimit.ideascube.org;
      charset     utf-8;

      client_max_body_size 200M;

      location /zims {
          alias /home/ideascube/zimit.ideascube.org/zims/;
          autoindex on;
      }

      # Finally, send all non-media requests to the Pyramid server.
      location / {
          uwsgi_pass  zimit_upstream;
          include     /var/ideascube/uwsgi_params;
      }
    }

UWSGI configuration

[uwsgi]
uid = ideascube
gid = ideascube
chdir           = /home/ideascube/zimit.ideascube.org/zimit/
ini             = /home/ideascube/zimit.ideascube.org/zimit/local.ini
# the virtualenv (full path)
home            = /home/ideascube/zimit.ideascube.org/venv/

# process-related settings
# master
master          = true
# maximum number of worker processes
processes       = 4
# the socket (use the full path to be safe
socket          = /tmp/zimit.sock
# ... with appropriate permissions - may be needed
chmod-socket    = 666
# stats           = /tmp/ideascube.stats.sock
# clear environment on exit
vacuum          = true
plugins         = python

supervisord configuration

[program:zimit-worker]
command=/home/ideascube/zimit.ideascube.org/venv/bin/rqworker
directory=/home/ideascube/zimit.ideascube.org/zimit/
user=www-data
autostart=true
autorestart=true
redirect_stderr=true

That's it!

zimit's People

Contributors

almet avatar

Stargazers

Ryan McQuen avatar  avatar Thomas Paulußen avatar దామోదర avatar  avatar yogeshwar avatar Julian Harty avatar David Larlet avatar Kelson avatar Yohan Boniface avatar

Watchers

James Cloos avatar Kelson avatar

Forkers

natim

zimit's Issues

Crawl all level 1 links

Some websites have a lot of external resources. It might be interesting to follow these links (but just once).

Service endpoint

consider renaming to something less generic, like www-zim or website-zim

merge zimit.openzim.org live updates

diff --git a/app.wsgi b/app.wsgi
index f66d9b8..cff45a2 100644
--- a/app.wsgi
+++ b/app.wsgi
@@ -1,11 +1,5 @@
-try:
-    import ConfigParser as configparser
-except ImportError:
-    import configparser
-import logging.config
-import os
-
 from zimit import main
+from paste.evalexception.middleware import EvalException

 here = os.path.dirname(__file__)

@@ -20,5 +14,6 @@ logging.config.fileConfig(ini_path)
 config = configparser.ConfigParser()
 config.read(ini_path)

-application = main(config.items('DEFAULT'), **dict(config.items('app:main'
-)))
+application = EvalException(main(config.items('DEFAULT'), **dict(config.items('app:main'))))
+
+#application = main(config.items('DEFAULT'), **dict(config.items('app:main')))
diff --git a/zimit/creator.py b/zimit/creator.py
index 9646892..b2b24e6 100644
--- a/zimit/creator.py
+++ b/zimit/creator.py
@@ -71,7 +71,7 @@ class ZimCreator(object):
         if not os.path.isdir(website_folder):
             message = "Unable to find the website folder! %s" % website_folder
             raise Exception(message)
-        shutil.copy('./favicon.ico', website_folder)
+        shutil.copy('/var/www/zimit.openzim.org/favicon.ico', website_folder)
         return website_folder

     def create_zim(self, input_location, output_name, zim_options):

Release ZimIt 1.0

It works, a version version should be released which means:

  • Create CHANGELOG
  • Put a tag on master
  • Make an Annoucement

Feature request: set size limit

allow an optional parameter to limit the size of the download:
monitor size and stop if limit reached
status request returns stopped if size reached
? api to continue later

Change the signature of the ZimCreator class

Currently, it takes a settings dictionary, which isn't really self-explanatory. Instead, passing the settings and having some other function to convert to the expected settings will ease the use by external clients (cc @tim-moody)

Add a quick zim reader view

It could list some interesting informations such as:

  • The list of top level files / folder in the zim
  • Some packaging information

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.