Code Monkey home page Code Monkey logo

fastcci's Introduction

FastCCI Build Status

FastCCI Logo

Fast Commons Category Inspection is an in-memory database for fast commons category operations such as

  • Loop detection
  • Deep traversal
  • Category intersection
  • Category subtraction

FastCCI can operate without depth limits on categories.

fastcci_build_db builds the binary database files from an SQL dump of the categorylinks database.

fastcci_server is the database server backend that can be queried through HTTP.

Where is FastCCI used?

An instance of the FastCCI backend is running on Wikimedia Labs at http://fastcci1.wmflabs.org/. A frontend is available on Wikimedia Commons as a gadget (Click here to install).

Preparing database

The database is generated from a simple parent child pageid table that is generated with a short SQL query. On Wikimedia Tool Labs this query can be launched with the following command. The text output is streamed into the fastcci command that parses it and generates a binary database image, containing of the fastcci.cat index file and the fastcci.tree data file. Both files are saved to the current directory.

mysql --defaults-file=$HOME/replica.my.cnf -h commonswiki.labsdb commonswiki_p -e 'select /* SLOW_OK */ cl_from, page_id, cl_type from categorylinks,page where cl_type!="page" and page_namespace=14 and page_title=cl_to order by page_id;' --quick --batch --silent | ./fastcci_build_db

Query syntax

Start the server with ./fastcci_server PORT DATADIR, where PORT is the tcp port the server will listen, and DATADIR is the path to the fastcci.cat and fastcci.tree files.

The server can be queried through HTTP or WebSockets. The URLs are the same in both cases (except for the protocol part). The request string looks like an ordinary HTTP GET URL. assuming the server was started on port 8080 you can query it using curl like this:

curl 'http://localhost:8080/?c1=9986&c2=26398707&a=path'

For backwards compatibility with browsers that do not properly support cross-domain requests a JavaScript callback mode exists, that wraps the result data in a function call to a fastcci_callback function. This mode is activated by adding the t=js query parameter and value.

Query parameters

  • c1 The primary category pageid integer value. This always has to be specified, otherwise the server will return an error 500.
  • c2 The secondary category (or file) pageid
  • d1 The primary search depth (defaults to infinity)
  • d2 The secondary search depth (defaults to infinity)
  • a The query action. Values can be:
    • and Perform the intersection between category c1 and category c2 (default action)
    • not Fetch files that are in category c1 but not in category c2
    • list List all files in and below category c1
    • fqv List all FPs, QIs, and VIs files (in that order) in and below category c1
    • path Find the subcategory path from category c1 to file or category c2

The server performs some sanity checking on the query parameters to make sure that the pageids supplied are pointing to categories (or if allowed to files).

Response format

The response is delivered in a simple text format with multiple lines. Each line starts with a keyword and may be followed by data. The keywords are:

  • RESULT followed by a | separated list of up to 50 integer triplets of the form pageId,depth,tag. Each triplet stands for one image or category.
  • NOPATH indicates that no path from c1 to c2 in a a=path request was found.
  • OUTOF followed by an integer that is the number of total items in the calculated result (rather than the number of returned items). This can be either an exact number (for a=list) or an estimate (for a=and and a=not).
  • QUEUED is the immediate acknowledgement that the server has queued the current request.
  • WAITING is sent to the client with one integer value representing the number of requests that are ahead in the queue and will be processed before the current request.
  • WORKING followed by two integers representing the current number of items found in c1 and c2. This response item is sent to the client every 0.2s and shows the current state of the ongoing category traversal.
  • DONE indicates the end of the server transmission.

Command line tools

  • fastcci_tarjan uses Tarjan's Algorithm to find strongly coupled components in the category graph. Those are essentially connected clusters of loops.
  • fastcci_circulartest uses a custom algorithm to find individual category loops. Unlike fastcci_tarjan this also catches self referencing categories. It may however omit loops that share nodes with other loops.
  • fastcci_subcats cat_id outputs the direct subcategories of the category specified by cat_id (this is mostly for debugging).
  • fastcci_pfs_search P F S finds all categories with P parent categories, F number of files, and S subcategories.
  • fastcci_diamond finds all category diamonds (i.e. B,C categories with a common parent A and a common subcategory D).

Server setup

Systemd

Copy the service file assets/fastcci.service to /lib/systemd/system/fastcci.service and symlink it to /etc/systemd/system. Register the service with

systemctl daemon-reload
systemctl start fastcci.service

Inspect the FastCCi logs with sudo journalctl -e -u fastcci.service

Upstart

Copy the service file assets/fastcci-server.conf to /etc/init/fastcci-server.conf.

fastcci's People

Contributors

dschwen avatar jeanfred avatar ricordisamoa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fastcci's Issues

Updates

The tool seems to fail to update for a few months now. No recent photos are visible.

Re-enable Websockes

Currently the websocket transport is disabled due to a bug (either in libonion or the way I use it). While everything works fine with HTTP transport I'd like to get websockets up again for improved UX (live reporting of searched images etc.).

Scale to avoid "Waiting in line. X ahead of us."

I often use the tool, but most of the time I get:

Waiting in line. 2 ahead of us.

... or similar.
When that happens, the wait last for a very long time, so I always give up and search manually.

That's a shame, it is a wonderful tool when it works. It is essential if Commons really wants to become the place to go when looking for a quality image.

I know it is easier said than done, but how about making fastcci more scalable?

  • Run several requests in parallel threads?
  • Install the whole thing on several servers and do some load balancing?
  • Set up a time limit? (for instance only return the results found within 1 minute)

max cat needs to be bigger

The max cat in fastcci_build_db.cc needs to be at least 50000000 as the program now gives off a segmentation fault now.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.