Code Monkey home page Code Monkey logo

mtas's Introduction

Multi Tier Annotation Search

See textexploration.github.io/mtas/ for more documentation and instructions.


A docker image providing a Solr based demonstration scenario with indexing and querying of some sample documents is available. To pull and run

docker pull textexploration/mtas
docker run -t -i -p 8080:80 --name mtas textexploration/mtas

Or to build and run

docker build -t mtas https://raw.githubusercontent.com/textexploration/mtas/master/docker/Dockerfile
docker run -t -i -p 8080:80 --name mtas mtas

This will provide a website on port 8080 on the ip of your docker host with more information.


This project builds upon the latest commit from April 30, 2018 for meertensinstituut/mtas. See also the related broker project, another continuation of previous work.


One of the primary use cases for Mtas, the Nederlab project, currently1 provides access, both in terms of metadata and annotated text, to over 74 million items for search and analysis as specified below.

Total Mean Min Max
Solr index size 2,715 G 60.3 G 75 k 288 G
Solr documents 74,762,559 1,661,390 119 11,912,415

Collections are added and updated regularly by adding new cores, replacing cores and/or merging new cores with existing ones. Currently, the data is divided over 44 separate cores. For 41,437,881 of these documents, annotated text varying in size from 1 to over 3.5 million words is included:

Total Mean Min Max
Words 18,494,454,357 446 1 3,537,883
Annotations 95,921,919,849 2,314 4 23,589,831

Mtas is also used on Middelnederlands.nl, including geographical selections and new analysis options.2

example document

Keyword in context

example kwic

Group results

example group

Geographic conditions

example geographic

Correlation analysis

example correlation

Geographical analysis

example map1

example map1


1 situation June 2018

2 release April 2020

mtas's People

Contributors

betoboullosa avatar hayco avatar matthijsbrouwer avatar reckart avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

mtas's Issues

Question: How to search for unset values

Is it possible to search for features that have no value attached?

It is possible to search for a feature having any value, like so:

<layer.feature=""/>

But how can I search for layers with unset feature values? Such a search could look like one of these:

<layer.feature!=""/>
<!layer.feature=""/>
<layer.feature=none/>
<layer.feature=false/>

ids for s and w in FOLIA

I apologize if this is documented - I couldn't find it:

I am indexing a FOLIA corpus, to be queried via CQL. This works fine as far as "normal" annotations are concerned, i.e. I can query for (e.g.) POS or lemma on the token level, and also for annotations on the sentence level. However, it remains unclear to me how to account for the xml:id attribute on <s> and <w> elements. The XML looks like this:

<s class="line" xml:id="s3">
            <w xml:id="s3.w1">
                <t>are</t>
                <lemma class="be"/>
                <pos class="VBB"/>
            </w>
            <w xml:id="s3.w2">
                <t>you</t>
                <lemma class="you"/>
                <pos class="PNP"/>
            </w>
            <w xml:id="s3.w3">
                <t>ready</t>
                <lemma class="ready"/>
                <pos class="AV0"/>
            </w>
</s>

And I've tried several variants in the indexing configuration file such as:

<!-- id for the <w>-element -->
<token type="string" offset="false" realoffset="false" parent="false">
             <pre>
                  <item type="string" value="word.id" />
               </pre>
                <post> 
                    <item type="attribute" name="#" />
                 </post>
</token>

So far, I haven't been able to find or do anything with the xml:ids.

What I'd like to understand/do is:

  1. How to represent xml:id on both sentence and token level in the config file
  2. How to integrate them into a CQL query
  3. How to access the ids programmatically after having done a query

For (3), I currently test my attempts like so:

  List<String> prefixes = new ArrayList<>();
  prefixes.add("t");
  prefixes.add("word.id");
  List<CodecSearchTree.MtasTreeHit<String>> allHits 
          = mtasCodecInfo.getPositionedTermsByPrefixesAndPositionRange("content", index, prefixes, spans.startPosition(), 
              spans.endPosition()-1);
  allHits.sort((MtasTreeHit<String> o1, MtasTreeHit<String> o2) -> Integer.compare(o1.startPosition, o2.startPosition));
  for (CodecSearchTree.MtasTreeHit<String> hit : allHits){
      System.out.print(CodecUtil.termValue(hit.data) + "(" + hit.startPosition + ")" +  " / " );
  }

I'd be grateful if somebody could point me in the right direction. Thanks in advance.

CQL Support

Hi

It looks like MTAS doesn't support the full CQL language. This is based on the current specifications.

For example the ability to search on a span using multiple attributes value pairs, eg [t="dog" & pos="NN"] as per https://www.sketchengine.eu/documentation/cql-basics/#boolean

The MTAS documentation also shows an example of doing this in a slightly different way (no square brackets) using something like:

t="dog" & POS="NN"

However this appears to generate an exception in the CQL parser "mtas.parser.cql.ParseException: Encountered "" at line 1, column 1.".

The closest I've come to doing this is using fullyalignedwith to repeatedly match each attribute value pair.

Thanks

Tony

Question: Wildcards for layers

Is it possible to use wildcards for layers?

Imagine having several layers with the same features and values:

<layer_1.feature="value">
<layer_2.feature="value">
<layer_3.feature="value">

Is it possible to search for all layers that have feature="value", e.g.

<*.feature="value">

At the moment this works only by chaining all possible layers with or (|)

<layer_1.feature="value"> | <layer_2.feature="value"> | <layer_2.feature="value">

Help regarding configuration

Hi there,
I found your software after a small search online and it seems to probably fit my needs. Unfortunately, I am a bit lost with the configuration for indexing. I rarely dealt with search engine and I am a bit lost here.

I wrote some XML looking like the main example:

<text xml:id="urn:cts:latinLit:anthoLat.anthoLat.musisquedeoque-lat1">
<seg xml:id="urn:cts:latinLit:anthoLat.anthoLat.musisquedeoque-lat1:line.002.1">
<w xml:id="urn:cts:latinLit:anthoLat.anthoLat.musisquedeoque-lat1:line.002.1#w.1">
<t>Qualis</t>
<pos>PROint</pos>
<lemma>qualis2</lemma>
<morph>Case=Nom|Numb=Sing</morph>
</w>

But it's not clear to me what I should do from there... Can you point me to the relevant example or docs ?

Question: limiting the number of results to speed up a query

Is it possible to limit the number of documents that are queried to potentially speed up the query resolution time? We are working with large text-corpora (more than 1 billion words) and would like to quickly obtain at most N results, preferably but not necessary in random order. Our goal is to quickly provide some results, which should be enough for most cases. Now we are using list query for getting a page of results (with start: and number: parameters), but it is slow for queries with millions of matches. As far as we understand, it is because in such cases nearly all documents need to be queried and this takes several minutes on a single machine.

Question: How to query programmatically

The code we are currently using in INCEpTION to perform an MTAS search looks pretty complicated - but I am pretty sure that is the way we were told that querying MTAS would work:

    private static void doQuery(IndexReader indexReader, String field, MtasSpanQuery q,
            List<String> prefixes)
        throws IOException
    {
        ListIterator<LeafReaderContext> iterator = indexReader.leaves().listIterator();
        IndexSearcher searcher = new IndexSearcher(indexReader);
        final float boost = 0;
        SpanWeight spanweight = q.rewrite(indexReader).createWeight(searcher, false, boost);

        while (iterator.hasNext()) {
            LeafReaderContext lrc = iterator.next();
            Spans spans = spanweight.getSpans(lrc, SpanWeight.Postings.POSITIONS);
            SegmentReader segmentReader = (SegmentReader) lrc.reader();
            Terms terms = segmentReader.terms(field);
            CodecInfo mtasCodecInfo = CodecInfo.getCodecInfoFromTerms(terms);
            if (spans != null) {
                while (spans.nextDoc() != Spans.NO_MORE_DOCS) {
                ...

But normally, querying lucene would use something like searcher.query(q, myCollector) - and there are also search signatures which would e.g. allow for sorting results etc. So I was wondering (I haven't tried it yet): can the search/collector approach really not be used with MTAS? If not, why? And if it can be used, does anybody have an example for it?

Unable to read index after MTAS upgrade

Lucene is able to work with indexes created with older versions of Lucene.

However, when upgrading MTAS say from 7.7.1.0 to 8.11.1.0, an exception is generated when trying to open the index:

2022-07-01 20:33:06 [main] ERROR MtasDocumentIndex - Unable to read MTAS index: codec mismatch: actual codec=Lucene70SegmentInfo vs expected codec=Lucene86SegmentInfo (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/Users/bluefire/git/inception-application/inception/inception-search-mtas/target/test-output/MtasUpgradeTest/project/1/indexMtas/_0.si")))
org.apache.lucene.index.CorruptIndexException: codec mismatch: actual codec=Lucene70SegmentInfo vs expected codec=Lucene86SegmentInfo (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/Users/bluefire/git/inception-application/inception/inception-search-mtas/target/test-output/MtasUpgradeTest/project/1/indexMtas/_0.si")))
	at org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:208) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:198) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:255) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.codecs.lucene86.Lucene86SegmentInfoFormat.read(Lucene86SegmentInfoFormat.java:95) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:357) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:291) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1037) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.