Code Monkey home page Code Monkey logo

kontext's Introduction

KonText screenshot

Contents

Introduction

KonText is an advanced corpus query interface and corpus data integration platform built around corpus search engine Manatee-open. It is written in Python 3 and TypeScript and it runs on any major Linux distribution. The development is maintained by the Institute of the Czech National Corpus.

Features

  • fully editable query chain
    • any operation from a user defined sequence (e.g. query -> filter -> sample -> sorting) can be changed and the whole sequence is then re-executed.
  • multiple search modes:
    • concordance,
    • paradigmatic query,
    • word list
    • keyword analysis
  • simple and advanced query types
    • advanced CQL editor with syntax highlighting and attribute recognition
    • interactive PoS tag composing tool for positional and key-value tagsets
    • customizable query suggestions and simple type query refinement (e.g. for homonym disambiguation)
  • support for spoken corpora
    • defined text segments can be played back as audio
    • KWIC detail with easily distinguishable speeches
  • rich concordance view options and tools
    • any positional attribute can be set as primary
    • multiple ways how to display other attributes
    • user-defined line groups - filtering, reviewing groups ratios
    • tokens and KWICs can be connected to external data services (e.g. dictionaries, encyclopedias)
  • rich subcorpus-related functionality
    • any subcorpus is accesible by other users (in case they obtain a URL, otherwise the subcorpus is not discoverable by default)
      • once a public description is set, the subcorpus can be discovered on the "public subcorpora" page
    • text types metadata can be gradually refined to a specific subcorpus ("which publishers are there in case only fiction is selected?")
    • a custom text types ratio can be defined ("give me 20% fiction and 80% journalism")
    • unused subcorpora can be archived (URLs with the subcorpus are still valid) or completely removed (URLs will become invalid)
    • searching within a subcorpous can be further refined with ad-hoc text type selection
    • a subcorpus can be created with respect to corpora aligned ("give me fiction in Czech but only if there is an English translation for it")
  • frequency distribution
    • univariate
      • positional attributes (including tuples of multiple attributes per token)
      • structural attributes
    • multivariate distribution (2 dimensions) for both positional and structural attributes
  • collocation analysis
  • persistent URLs - any result page can be easily shared even if the original query is megabytes long
  • access to previous queries, named queries
  • convenient corpus access
    • finding corpus by a keyword (tag), size, description
    • adding corpus to favorites (incl. subcorpora, aligned corpora)
  • saving result to Excel, CSV, XML, JSONL, TXT
  • HTTP API access

Internal features

  • modern client-side application (written in TypeScript, event stream architecture, React components, extensible)
  • server-side written using the Sanic framework with fully decoupled background concordance/frequency/collocation calculation (using an integrated Rq worker server)
  • modular code design with dynamically loadable plug-ins providing custom functionality implementation (e.g. custom database adapters, authentication method, corpus listing widgets, HTTP session management)
    • integrability with existing information systems

Installation

Docker

Running KonText as a set of Docker containers is the most convenient and flexible way. Docker Compose V2 is required. To run a basic configuration instance (i.e. no MySQL/MariaDB server, no WebSocket server) use:

docker compose up

To run a production grade instance:

docker compose -f docker-compose.yml -f docker-compose.mysql.yml --env-file .env.mysql up

(the .env.mysql allows configuring custom MySQL/MariaDB credentials and KonText configuration file)

Manual installation

Key requirements

  • Python 3.6 (or newer)
  • Manatee corpus search engine - version 2.167.8 and onwards (for KonText v0.17, Manatee v2.2xx is recommended)
  • a key-value storage
    • Redis (recommended), SQLite (supported), custom implementations possible
  • a task queue - Rq
  • HTTP proxy server

For Ubuntu OS users, it is recommended to use the install script which should perform most of the actions necessary to install and run KonText. For other Linux distributions we recommend running KonText within a container or a virtual machine. Please refer to the doc/INSTALL.md file for details.

Customization and contribution

Please refer to our Wiki.

Notable users

How to cite KonText

Tomáš Machálek (2020) - KonText: Advanced and Flexible Corpus Query Interface

@inproceedings{machalek-2020-kontext,
    title = "{K}on{T}ext: Advanced and Flexible Corpus Query Interface",
    author = "Mach{\'a}lek, Tom{\'a}{\v{s}}",
    booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://www.aclweb.org/anthology/2020.lrec-1.865",
    pages = "7003--7008",
    language = "English",
    ISBN = "979-10-95546-34-4",
}

kontext's People

Contributors

dependabot[bot] avatar dlukes avatar kira-d avatar kosarko avatar lenoch avatar michkren avatar msklvsk avatar mzimandl avatar petrduda avatar tomachalek avatar tomazerjavec avatar vidiecan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kontext's Issues

implement KonText set-up validator

What should be checked:

  1. XML validity
  2. references (i.e. a string value referring to an element with specific attribute)
  3. basic plug-ins configuration (especially whether there are required modules present)
  4. paths - whether they exist and whether the web server has proper access rights

sorting of shuffled concordances

Vasek has discovered a strange bug when Kontext behaves a bit differently from NoSkE. Search for "ano" (query type "basic") in ORAL2008, shuffle the concordandce (if not shuffled as default) and then sort according to sp.pohlavi. This will get you to the following screens:

NoSkE: https://korpus.cz/corpora/run.cgi/sortx?q=aword%2C[word%3D%22%28%3Fi%29ano%22]&q=f&corpname=oral2008&attrs=word&attr_allpos=kw&ctxattrs=word&structs=&refs=%3Ddoc.id%2C%3Dsp.pohlavi&iquery=ano&sattr=sp.pohlavi&skey=rc&spos=3

Kontext: https://korpus.cz/kontext/run.cgi/sortx?q=aword%2C[word%3D%22%28%3Fi%29ano%22]&q=f&corpname=oral2008&attrs=word&ctxattrs=word&structs=doc%2Csp&refs=%3Ddoc.id%2C%3Dsp.pohlavi&iquery=ano&sattr=sp.pohlavi&skey=rc&spos=3

Both screens look the same, the concordances are in the same order. But: when you use the "Collocation is sorted: Jump to:" feature and jump to "Z", NoSkE seems to work correctly as it takes you to the beginning of Z's, while kontext will show you a screen where one Z is in the middle of M's!

However, the difference seems to be caused by the way "Jump to" is implemented, because the ordering of concordances in both interfaces is the same (at least I didn't notice any difference) and by browsing through the pages you can see one more such "missorted" Z.

Can you please track down the problem and confirm whether this issue is caused by Manatee?

NB: when applying the same procedure on non-shuffled concordances, the output of sort is OK.

Manatee cache size

Manatee cache takes up too much space now, a query that matches every corpus position (e.g. "[]") in SYN needs more than 50 GB that are stored there uless the cache is explicitly deleted.

It would be desirable to evaluate the usefulness of caching in general, and especially in cases when the concordance size is extremely large. It should be possible to come up with a better solution then, or we could also consider turning the caching off altogether. For the time being, more disk space will be reserved for the cache.

Implement searchable keywords for configured corpora

  1. in config.xml, it should be possible to add 0..n keywords (or "tags") to any corpus
  2. corpora archive should be able to filter corpora list using these keywords

This may replace tree-like structure in the corpus selection widget.

Background calculation error: Unknown action: ~

2015-02-22 18:37:31,990 [conclib] ERROR: Background calculation error: Unknown action: ~
2015-02-22 18:37:32,093 [controller] ERROR: Traceback (most recent call last):
  File "/opt/noske/share/kontext/public/../lib/controller.py", line 818, in process_method
    self._invoke_action(method, pos_args, named_args, tpl_data))
  File "/opt/noske/share/kontext/public/../lib/controller.py", line 490, in _invoke_action
    ans = apply(action, args[1:], na)
  File "/opt/noske/share/kontext/public/../lib/actions.py", line 184, in view
    conc = self.call_function(conclib.get_conc, (self._corp(),))
  File "/opt/noske/share/kontext/public/../lib/controller.py", line 513, in call_function
    return apply(func, args, na)
  File "/opt/noske/share/kontext/public/../lib/conclib.py", line 415, in get_conc
    samplesize=samplesize, fullsize=fullsize, minsize=minsize)
  File "/opt/noske/share/kontext/public/../lib/conclib.py", line 362, in _get_async_conc
    raise e
Exception: Unknown action: ~

empty result - no message

In the development version - in case no result is found no message is provided.
Maybe this is more general problem - it must be further investigated.

Conc. persistence - store persistence level flag

Currently, only initial TTL distincts between short and long persistence which may (and it already does) easily confuse archivation script. In other words - there is no clear sign of when a conc. persistence record should be archived.

transform "concache" into a plug-in

... and implement a Redis-based solution. This should bring automatic record locking and simplify related code. Current solution should be still available (though as a plug-in).

Implement searchable keywords for configured corpora

  1. in config.xml, it should be possible to add 0..n keywords (or "tags") to any corpus
  2. corpora archive should be able to filter corpora list using these keywords

This may replace tree-like structure in the corpus selection widget.

Kontext dont display last page in koncordance

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.