textexploration / mtas Goto Github PK

View Code? Open in Web Editor NEW

11.0 4.0 5.0 52.3 MB

Multi Tier Annotation Search

Home Page: https://textexploration.github.io/mtas/

License: Apache License 2.0

CSS 0.03% HTML 1.63% JavaScript 0.13% Java 97.81% Dockerfile 0.40%

solr lucene annotations cql text-analysis distributed text big-data structure search

mtas's Introduction

Multi Tier Annotation Search

See textexploration.github.io/mtas/ for more documentation and instructions.

A docker image providing a Solr based demonstration scenario with indexing and querying of some sample documents is available. To pull and run

docker pull textexploration/mtas
docker run -t -i -p 8080:80 --name mtas textexploration/mtas

Or to build and run

docker build -t mtas https://raw.githubusercontent.com/textexploration/mtas/master/docker/Dockerfile
docker run -t -i -p 8080:80 --name mtas mtas

This will provide a website on port 8080 on the ip of your docker host with more information.

This project builds upon the latest commit from April 30, 2018 for meertensinstituut/mtas. See also the related broker project, another continuation of previous work.

One of the primary use cases for Mtas, the Nederlab project, currently¹ provides access, both in terms of metadata and annotated text, to over 74 million items for search and analysis as specified below.

	Total	Mean	Min	Max
Solr index size	2,715 G	60.3 G	75 k	288 G
Solr documents	74,762,559	1,661,390	119	11,912,415

Collections are added and updated regularly by adding new cores, replacing cores and/or merging new cores with existing ones. Currently, the data is divided over 44 separate cores. For 41,437,881 of these documents, annotated text varying in size from 1 to over 3.5 million words is included:

	Total	Mean	Min	Max
Words	18,494,454,357	446	1	3,537,883
Annotations	95,921,919,849	2,314	4	23,589,831

Mtas is also used on Middelnederlands.nl, including geographical selections and new analysis options.²

Keyword in context

Group results

Geographic conditions

Correlation analysis

Geographical analysis

¹ situation June 2018

² release April 2020

mtas's People

Contributors

Stargazers

Watchers

Forkers

hayco mwasiluk ycgoodluck zentrum-lexikographie reckart

mtas's Issues

Question: How to search for unset values

Is it possible to search for features that have no value attached?

It is possible to search for a feature having any value, like so:

<layer.feature=""/>

But how can I search for layers with unset feature values? Such a search could look like one of these:

<layer.feature!=""/>
<!layer.feature=""/>
<layer.feature=none/>
<layer.feature=false/>

ids for s and w in FOLIA

I apologize if this is documented - I couldn't find it:

I am indexing a FOLIA corpus, to be queried via CQL. This works fine as far as "normal" annotations are concerned, i.e. I can query for (e.g.) POS or lemma on the token level, and also for annotations on the sentence level. However, it remains unclear to me how to account for the xml:id attribute on <s> and <w> elements. The XML looks like this:

<s class="line" xml:id="s3">
            <w xml:id="s3.w1">
                <t>are</t>
                <lemma class="be"/>
                <pos class="VBB"/>
            </w>
            <w xml:id="s3.w2">
                <t>you</t>
                <lemma class="you"/>
                <pos class="PNP"/>
            </w>
            <w xml:id="s3.w3">
                <t>ready</t>
                <lemma class="ready"/>
                <pos class="AV0"/>
            </w>
</s>

And I've tried several variants in the indexing configuration file such as:

<!-- id for the <w>-element -->
<token type="string" offset="false" realoffset="false" parent="false">
             <pre>
                  <item type="string" value="word.id" />
               </pre>
                <post> 
                    <item type="attribute" name="#" />
                 </post>
</token>

So far, I haven't been able to find or do anything with the xml:ids.

What I'd like to understand/do is:

How to represent xml:id on both sentence and token level in the config file
How to integrate them into a CQL query
How to access the ids programmatically after having done a query

For (3), I currently test my attempts like so:

  List<String> prefixes = new ArrayList<>();
  prefixes.add("t");
  prefixes.add("word.id");
  List<CodecSearchTree.MtasTreeHit<String>> allHits 
          = mtasCodecInfo.getPositionedTermsByPrefixesAndPositionRange("content", index, prefixes, spans.startPosition(), 
              spans.endPosition()-1);
  allHits.sort((MtasTreeHit<String> o1, MtasTreeHit<String> o2) -> Integer.compare(o1.startPosition, o2.startPosition));
  for (CodecSearchTree.MtasTreeHit<String> hit : allHits){
      System.out.print(CodecUtil.termValue(hit.data) + "(" + hit.startPosition + ")" +  " / " );
  }

I'd be grateful if somebody could point me in the right direction. Thanks in advance.

CQL Support

It looks like MTAS doesn't support the full CQL language. This is based on the current specifications.

For example the ability to search on a span using multiple attributes value pairs, eg [t="dog" & pos="NN"] as per https://www.sketchengine.eu/documentation/cql-basics/#boolean

The MTAS documentation also shows an example of doing this in a slightly different way (no square brackets) using something like:

t="dog" & POS="NN"

However this appears to generate an exception in the CQL parser "mtas.parser.cql.ParseException: Encountered "" at line 1, column 1.".

The closest I've come to doing this is using fullyalignedwith to repeatedly match each attribute value pair.

Thanks

Tony

Question: Wildcards for layers

Is it possible to use wildcards for layers?

Imagine having several layers with the same features and values:

<layer_1.feature="value">
<layer_2.feature="value">
<layer_3.feature="value">

Is it possible to search for all layers that have feature="value", e.g.

<*.feature="value">

At the moment this works only by chaining all possible layers with or (|)

<layer_1.feature="value"> | <layer_2.feature="value"> | <layer_2.feature="value">

Help regarding configuration

Hi there,
I found your software after a small search online and it seems to probably fit my needs. Unfortunately, I am a bit lost with the configuration for indexing. I rarely dealt with search engine and I am a bit lost here.

I wrote some XML looking like the main example:

<text xml:id="urn:cts:latinLit:anthoLat.anthoLat.musisquedeoque-lat1">
<seg xml:id="urn:cts:latinLit:anthoLat.anthoLat.musisquedeoque-lat1:line.002.1">
<w xml:id="urn:cts:latinLit:anthoLat.anthoLat.musisquedeoque-lat1:line.002.1#w.1">
<t>Qualis</t>
<pos>PROint</pos>
<lemma>qualis2</lemma>
<morph>Case=Nom|Numb=Sing</morph>
</w>

But it's not clear to me what I should do from there... Can you point me to the relevant example or docs ?

Multi-position query without postfix

It would be nice if searching for <field/> would have the same effect as searching for <field=".*"/>.

Question: limiting the number of results to speed up a query

Is it possible to limit the number of documents that are queried to potentially speed up the query resolution time? We are working with large text-corpora (more than 1 billion words) and would like to quickly obtain at most N results, preferably but not necessary in random order. Our goal is to quickly provide some results, which should be enough for most cases. Now we are using list query for getting a page of results (with start: and number: parameters), but it is slow for queries with millions of matches. As far as we understand, it is because in such cases nearly all documents need to be queried and this takes several minutes on a single machine.

Question: How to query programmatically

The code we are currently using in INCEpTION to perform an MTAS search looks pretty complicated - but I am pretty sure that is the way we were told that querying MTAS would work:

    private static void doQuery(IndexReader indexReader, String field, MtasSpanQuery q,
            List<String> prefixes)
        throws IOException
    {
        ListIterator<LeafReaderContext> iterator = indexReader.leaves().listIterator();
        IndexSearcher searcher = new IndexSearcher(indexReader);
        final float boost = 0;
        SpanWeight spanweight = q.rewrite(indexReader).createWeight(searcher, false, boost);

        while (iterator.hasNext()) {
            LeafReaderContext lrc = iterator.next();
            Spans spans = spanweight.getSpans(lrc, SpanWeight.Postings.POSITIONS);
            SegmentReader segmentReader = (SegmentReader) lrc.reader();
            Terms terms = segmentReader.terms(field);
            CodecInfo mtasCodecInfo = CodecInfo.getCodecInfoFromTerms(terms);
            if (spans != null) {
                while (spans.nextDoc() != Spans.NO_MORE_DOCS) {
                ...

But normally, querying lucene would use something like searcher.query(q, myCollector) - and there are also search signatures which would e.g. allow for sorting results etc. So I was wondering (I haven't tried it yet): can the search/collector approach really not be used with MTAS? If not, why? And if it can be used, does anybody have an example for it?

MTAS for Lucene 9.x

Will there be MTAS versions based on Lucene 9.x?

Unable to read index after MTAS upgrade

Lucene is able to work with indexes created with older versions of Lucene.

However, when upgrading MTAS say from 7.7.1.0 to 8.11.1.0, an exception is generated when trying to open the index:

2022-07-01 20:33:06 [main] ERROR MtasDocumentIndex - Unable to read MTAS index: codec mismatch: actual codec=Lucene70SegmentInfo vs expected codec=Lucene86SegmentInfo (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/Users/bluefire/git/inception-application/inception/inception-search-mtas/target/test-output/MtasUpgradeTest/project/1/indexMtas/_0.si")))
org.apache.lucene.index.CorruptIndexException: codec mismatch: actual codec=Lucene70SegmentInfo vs expected codec=Lucene86SegmentInfo (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/Users/bluefire/git/inception-application/inception/inception-search-mtas/target/test-output/MtasUpgradeTest/project/1/indexMtas/_0.si")))
	at org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:208) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:198) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:255) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.codecs.lucene86.Lucene86SegmentInfoFormat.read(Lucene86SegmentInfoFormat.java:95) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:357) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:291) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1037) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
...

textexploration / mtas Goto Github PK

mtas's Introduction

Multi Tier Annotation Search

mtas's People

Contributors

Stargazers

Watchers

Forkers

mtas's Issues

Is it possible to search for features that have no value attached?

Is it possible to use wildcards for layers?

Recommend Projects

Recommend Topics

Recommend Org