annotation / text-fabric Goto Github PK

View Code? Open in Web Editor NEW

60.0 11.0 21.0 1.61 GB

File format, model, API, and apps for manipulating text and its annotated features

License: MIT License

Python 18.72% Jupyter Notebook 78.02% CSS 0.46% JavaScript 1.52% Shell 0.01% HTML 1.28% Dockerfile 0.01%

etcbc etcbc-data hebrew hebrew-bible greek greek-bible annotation search-algorithm search-engine text-fabric

text-fabric's People

Contributors

Stargazers

Watchers

text-fabric's Issues

Option for Shuffled B.show Results

It would be nice to see B.show results in a randomized order. Though this is doable with an extra random.shuffle line, I find myself needing to do this quite often.

Converters

TF can import from and export to XML.

It is desirable that TF can import corpora from other formats as easily as possible.

Yet, input formats differ widely.

Make an intermediate format that TF can consume and that is easy to generate from e.g. TEI, other XML and other database formats.

Write an import module for this intermediate format.

I'm getting an out-of-memory error message when I load the bhsa corpus. (I am able to load the srynt corpus.) I would think that my machine has enough memory. As shown in the screenshot below, I have 4.7 GB available when it failed to run. (The bump you see in the graph is where I tried to open the bhsa corpus.) I don't know if this is a text-fabric error, or if I truly don't have enough memory to display the corpus. If the latter is the case, it would be good if memory requirements were documented more specifically.

PS C:\Users\Adam\Desktop> text-fabric
This is Text-Fabric 6.3.1
specify data source [bhsa/peshitta/cunei/syrnt] > bhsa
Cleaning up remnant processes, if any ...
Loading data for bhsa. Please wait ...
Setting up TF kernel for bhsa
Using bhsa-c r1.4 in ~/text-fabric-data/etcbc/bhsa/tf/c
Using phono-c r1.1 in ~/text-fabric-data/etcbc/phono/tf/c
Using parallels-c r1.1 in ~/text-fabric-data/etcbc/parallels/tf/c
   |     1.02s T number               from C:\Users\Adam/text-fabric-data/etcbc/bhsa/tf/c
  4.68s Not all features could be loaded/computed
TF is out of memory!
If this happens and your computer has more than 3GB RAM on board:
Close all other programs and try again.

A new notion of adjacency

Suppose you want to look for the first clause after an earlier clause, but not necessarily tightly adjacent. Can we define a spatial operator for that?

This is in response to an ongoing request by Cody Kingham.

Feature comparison for TF-Query

Request by Oliver Glanz:

I would like to see how to define the relation between actual node attributes of different nodes.
Something like:

word lex=L1
=: word lex=L2
L1=L2

For example, how can you express this MQL query?

select all objects where
[clause
  [word FOCUS vt IN (perf, impf, wayq, impv)
    [word AS infalex]
  ]
  [word FOCUS vt IN (infa)
    [word lex = infalex.lex]
  ]
]

Cunei: filenames not windows compatible

The Cunei data contans images files with names such as

|(HIx1(N57))&(HI+1(N57))|.jpg

But the | in file names is not compatible with windows.

Add a A.showSetup

There is an A.prettySetup. I would like an A.showSetup. This is because I'm calling a lot of A.show in my NB, and constantly have to specify what I want condensed, that i want nodes, etc.

Relating feature values in search templates

Currently, when writing search templates you cannot express that the value of one feature on on node
is equal to the value of another feature on another node.

For example, you might want to look for clauses in which the verb in a Pred phrase has the same
number as the noun in a Subj phrase. Let's do it only for 1-word Pred and Sunj phrases.

clause
  phrase function=Pred
    =: v:word pdp=verb
    :=
  phrase function=Subj
    =: s:word pdp=subs nu=v.nu
    :=

The bit nu=v.nu is not possible currently.

We also would like to be able to say in the template above

s.gn = v.gn

(gn is the gender feature).
This is also not supported yet.

It would be nice to have both syntaxes working in TF search.

Toggling Max-sized Object in B.prettyTuple/B.show

B.prettyTuple and B.show currently display results in the context of their enclosing verse. I would like to be able to request a different maximum sized object such as a sentence or clause.

Perhaps this is difficult to implement since the maximum object should fully enclose the results. If I request a maximum size of a phrase on a tuple containing clause nodes, what to do? One option would be to simply display all contained phrases without sentence or clause boundaries. This would look similar to the Bible Online Learner view when clause/sentence boundaries are toggled off but phrase boundaries are on. So if I toggle 'word', for instance, I get a bunch of word boxes with no phrase boundaries.

Going up could behave similarly. If I request clause, then I get enclosing clauses for all attested clauses in the tuple.

Perhaps this could be toggled in B.prettySetup.

Export of query results

In the TF-Browser you can export your query results to Excel-friendly tab-separated files.

This should also be possible from the API that runs in a notebook.

Outer Names in Inner Quantifiers

I want to revisit the issue of multiple names in quantifiers. There are certain relationships which the present implementation is not able to express.

Here is an example template that is currently not possible. There are two words in a phrase. And we want to specify that a preposition should not intervene between them:

phrase
    w1:word
    [...]
    w2:word

/without/
w1
< word pdp=prep
< w2
/-/

In the current implementation, these kinds of relation expressions can only be approximated by repeating all of the specifications under w1 under a quantifier:

phrase
    w1:word
    [...]
    w2:word
    /without/
    phrase
        w1:word
        [...]
        < word pdp=prep
        w2
    /-/

In cases where [...] contains lots of specs, this becomes unwieldy. Furthermore, there is no guarantee in the above pattern that the w1 in the phrase template will be identical to the w1 in the quantifier template.

I know that there are major difficulties to implementing multiple names in quantifiers. However, if each name is assigned to a node in an outer level, perhaps it is possible to pass those nodes as sets to the inner levels, so that a w1 defined in level 0 can be directly referred to in level 1. Or perhaps there is yet a better way.

Add option to manually upload new features

Text-Fabric Browser should have an option to upload node ID / feature pairings.

For instance, if I have a .csv file where one column is nodes and another is a string feature, I should be able to click an upload button in TF Browser and upload that .csv into the corpus for further exploration and analysis.

Probably those exported files would get pushed to a new directory on disk? A user could then upload to Github when they feel their features are in a finished state.

Update Software Heritage badge

Hello @dirkroorda,
At Software Heritage we really liked your idea of putting a SWH badge on the README.
We liked it so much that we created three types of badges that you can find on the permalink tab on the right :-)

The archived repository badge:
Coded with:
[![SWH](https://archive.softwareheritage.org/badge/origin/https://github.com/annotation/text-fabric/)](https://archive.softwareheritage.org/browse/origin/https://github.com/annotation/text-fabric/)

The archived release badge:
Coded with:
[![SWH](https://archive.softwareheritage.org/badge/swh:1:rel:4f61758988dfa0479b69f790f2f857738574e3e8;origin=https://github.com/annotation/text-fabric/)](https://archive.softwareheritage.org/swh:1:rel:4f61758988dfa0479b69f790f2f857738574e3e8;origin=https://github.com/annotation/text-fabric/)

And the badges to a specific artifact identified with the swh-id,
for example a snapshot with all branches and releases captured in April 2019:
Coded with:
[![SWH](https://archive.softwareheritage.org/badge/swh:1:snp:1de172a10590eed2f1e3dcd049f13b4ed7d72bd0;origin=https://github.com/annotation/text-fabric/)](https://archive.softwareheritage.org/swh:1:snp:1de172a10590eed2f1e3dcd049f13b4ed7d72bd0;origin=https://github.com/annotation/text-fabric/)

Hope you like the result.
Cheers,
Morane

Passage pad

We have a node pad, a search pad.
Add a passage pad above the node pad, where you can enter a passage, such as Genesis 3:16 or P30518

A click on the passage label of a result should add the passage to the passage pad.

This way you can start with a passage, expand it, switch node numbers on, click on them to add nodes to the node pad, combine them to make a few tuples. So, on top of your query results, you can easily collect some special cases and examples.

Easy download data

The data for the BHSA is in a github repo. Cloning it means downloading 5 GB of data.
The typical user needs only 25 MB of zipped TF data.

For the Cunei data something similar holds: the repo is 1,5 GB, bit the TF data is 1.6 MB.
The Cunei user also might want the image files (derived from CDLI) which are roughtly 500 MB.

Custom colors

In notebooks, you can customize the highlight colors.
Give Text-Fabric-Browser users that possibility as well.

NB It is not possible to provide the full flexibility of the highlights parameter in this context. But the colormap parameter can easily be done.

Quantifiers in search

In MQL there is a NOTEXIST operator. We want something like that in TF as well.

Here is a proposal to add Quantifiers in search. They are a generalization of NOTEXIST.

A few examples

clause
  all:
    phrase function=Pred
  have:
	word sp=verb

This search for clauses in which all Pred-phrases have a verb in it.

The all: ... have: ... is called a quantifier template.

Quantifier templates do not contribute to result nodes in a tuple.

There will be one other quantifier: no: ....

clause
  no:
    phrase function=Pred

This results in all clauses which do not have a Pred-phrase in them.

You may have multiple quantifiers, combine them with ordinary templates and nest them.
Here is a pretty meaningless example

clause
  all:
    p1:phrase
      all:
        w1:word
      have:
        w2:word sp=verb
        w1 = w2
  have:
    p2:phrase function=Pred
    <: phrase function=Subj
    p1 = p2

  phrase function=Obj

In words:

Results are tuples (clause, phrase) where phrase is an Obj within clause.
But only for clauses that satisfy this condition:
All phrases that consist of verbs only have function=Pred and are immediately followed by a Subj phrase.

Shared memory

Text-Fabric uses a lot of memory. If you want to use TF in a webserver, it is not feasible to load TF for every request.
Rather you would like to start TF once, with the webserver, let the webserver initialize TF, and pass the TF object to the code that processes requests.

But I am not sure whether this works and how TF behaves if it is used multiple times at the same time.
This whole scenario needs to be explained.

Custom features

In a notebook you can call prettySetup to add features to the pretty displays. You need this when you have features of your own that you have included in your custom dataset.

Give users of the Text Fabric browser the possibility to use prettySetup

Non-matching template lines

Is it currently possible to indicate that template lines should not appear in results, i.e. are non-matching in the sense of non-matching regular expression groups ((?:...))?

In some cases this can be simulated with quantifiers:

verse
	non-matching word lex=KL/

would be identical to:

verse
/with/
	word lex=KL/
/-/

However, I don't see how to express the following:

verse
	non-matching clause
		word language=Aramaic lex=HWH[
		word vt=ptca

Perhaps it can be done with something like this, but that looks cumbersome:

verse
	/with/
		clause
			w1:word
			w2:word
			w1=hwy
			w2=ptc
	/-/
	hwy:word language=Aramaic lex=HWH[
	ptc:word vt=ptca

Section levels

TF supports 2 or three section levels.
The TF browser can work with 2 or three section levels.

Make the number of levels completely free: 0, 1, 2, 3, 4, ...
Take care that the TF browses honours all section levels.
Define what the TF browser should do if there are no section levels.

When we add progressively new corpora into TF, we'll need this flexibility.

Sets and shallow results in search

We want to have the possibility of using custom sets in search, like so

query = '''
clause
  gapphrase
    word sp=verb
'''

Here gapphrase is not a node type in the data set, but refers to a node set that you have constructed yourself.

You can then search with this template by passing a dictionary of sets to search():

gappedPhrases = ... code for finding the gapped phrases ...

results = S.search(query, sets=dict(gapphrase=gappedPhrases))

Sets liked gappedPhrases may or may not be constructed with `search()`.
If you use `search()` it is handy to get the set of top-level nodes instead the full list of tuples.

E.g. we want the set of clauses that consist exactly of a Pred and a Subj phrase:

```python
query = '''
clause
  =: phrase function=Pred
  <: phrase function=Subj
  :=
'''

Then you can say:

simpleClauses = S.search(query, shallow=True)

and now simpleClauses is a node set, ready for use as a custom set in a query.

Corrupted data after memory problems

When the TF data is loaded by the TF browser for the first time, and there is too little RAM, a MemoryError is raised. After that, the generated binary data of TF is corrupt.

For situations where the user can free more RAM, he needs a convenient way to move the corrupt data out of the way.

In a Jupyter notebook, one can say: TF.clearCache(), but we need a method that is easier to invoke, and picks exactly the right datasets.

Silence Argument for Bhsa

I would like a silence argument for Bhsa to keep things clean in my NB opening statements.

Use of special characters in feature conditions with RE impossible

Suppose you have a feature with values like a.1|b.2, and you want to search for exactly
this feature.
Then you enter "escape-hell", because your search string will be parsed by Python, TF, and the regular expression engine. So how many backslashes do you need for escaping the | and .?
It turned out: 1 for . and three for |. There was also a bug. So this is a very unsatisfactory
situation.

Applib versioning

Today it occurred that I changed two TF apps for some reason, and made them rely on a new feature in the TF applib, the library in TF that helps the apps to work.

I updated the TF apps, but waited a bit with pushing the main TF.

A user of that app fell victim to the app auto-updating itself and then failing because the new version of TF was not yet online.

In order to mitigate scenarios like this, the TF applib should get a separate version number, and apps should state which version they need to run.

In this case, because of the new feature, I would have bumped the applib version, and made the two TF apps require the new version at least.

When TF loads an app, it can see the applib version requirement of the TF app, inspect its own applib version, and if there is a mismatch, it can issue a friendly warning and hint.

Sidebar

Make a left sidebar, expandable, like in Jupyter Lab. Use it for:

• opening a previous query
• exporting and filling out metadata
• help sections

Apps should become more independent

Apps are part of the TF code base.

When the number of apps grows, they must be weaned from the TF core code.
The logical separation is already good, but it should be extended to infrastructural separation, so that apps can be developed in several repos by several people, in the ideal case people that are intimate with the corpus in question.

# search operator for features

Is it possible to extend the search template language to allow # for features?

For example, I want to find sentences where אמר is followed by some ל+infinitive construct, but want to exclude the infinitive construct of אמר. I would then like to use the query below (or is there an easier way to do this with the current search template syntax?):

sentence
    clause
        word lex=>MR[
    < clause
        word lex=L
        <: word vt=infc lex#>MR[

Highlighting in plain()

Now you can also pass a dictionary of highlights to the plain() display. But you can only highlight slots within the text, not higher concepts. For that you need pretty.

Yet, it is not to difficult to highlight the slots of say phrases within sentences.

And, right now, all highlighting in plain() comes out as yellow. It would be nice that the colours you put in the highlight dict is repspected in the display.

TF Browser display Uruk

The display in Ururk can better. It has 6000 tablets, each tablet has just a few colums, and they a few lines. This section structure is not pleasant to browse.

the tablet list should be filterable by substrings, as the user enters them
the browser should show a display for each tablet, and when the user clicks a column or line underneath, it should highlight them in the display.

Sanity check on oslots

When converting data to TF, you have to create an oslots feature that maps non-slot nodes to subsets of slots. All non-slot nodes should be mapped, and no slot nodes should be mapped.

If this requirement is violated, you get weird errors when TF tries to load the new dataset.
It is near impossible to find out what is wrong from the error messages.

When oslots is read and written, a clear sanity check should be performed.

Fetch node tuples

Add an input field for node tuples that you want displayed.
So in addition to searching (which delivers node tuples), you can also paste in your own tuples.

If you have run a search, make a button that copies the currently displayed results as tuples into an input box. Every tuple appears as a comma separated list of numbers on its own line.

The interface will display the combined results of the given tuples and the search results.
The resulting set can still be condensed by the user ad libitum.

Description field and metadata

Add a description field to the TF browser, so that when you save results as a pdf, you can enter an explanation.

I can also add some metadata (time of running the query) and some statistics (the amount of nodes in the total results, split out per node type, compared with the totals in the whole corpus).

That way the PDF becomes a "proof of work".

Clearer Browser Search Icons

The current browser search icons are a bit confusing (recycle sign and gear). In demo'ing TF to several people this has caused some confusion. A google search for "search icon" gives results for magnifying glasses instead. I think this would be a better fit.

silent=True does not silence

BHSA app load with silent = True now produces a read out. Before latest update it was quiet.

Add cutoff indicator for end=

It would be nice if end= also produced a statement similar to pd.Dataframes when a df extends beyond the limit, e.g. "22 more results not shown" or something like this. But this may be too minor to warrant addition. Feel free to ignore.

Locality generators

Perhaps I'm overlooking something but it seems there is on easy way to get the next n words using the locality API. You could chain L.n, but then you need to check continuously whether you have reached the last word or not. Would it be useful to add functionality to get a Python generator with yield that continues yielding next items until none are left?

I'm happy to implement it myself and create a PR, but wanted to check whether you think this is something useful to have in text-fabric and whether it would be better to extend the current L.u and L.p or create new methods.

k-relations

There are various relaxed spatial relations, see
nearness relations.

A <k: B meaning A adjacent before B

A =k: B meaning A and B start at the same slot

A :k= B meaning A and B end at the same slot

A :k: B meaning A and B have the same boundary slots

but up to a fuzziness of k slots.

The first relation is asymmetrial, the others are symmetrical.

The k-fuzziness leads to unexpected cases:

If A = [1, 2, 3] and B = [5, 6, 7] it holds that A <1: B

And if C = [2, 4, 5] it also holds that A <2: B, now the fuzziness works in the other direction.

It even holds that A <3: A, strangely enough.

Better use of names

Direct use of names in atoms

Take this example of a query with quantifiers

phrase
all:
  ^ w:word
have:
  ww:word pdp=verb
  w = ww
end:

This looks clumsy, and is causes by the fact that names (like w) cannot be used in places where an atom is expected. So I propose to lift that restriction, so that you can say

phrase
all:
  ^ w:word
have:
  w pdp=verb
end:

This should also work outside quantifiers.

I intend to implement this as mere syntactic sugar: during tokenization of the query the second form will be translated to the first form.

The parent node

Related to this: all quantifiers are relative to a parent atom. But there is no standard name
to refer to the parent atom. So if you want to say of a phrase
that if it contains a verb, its function should be Pred
you have to work like this:

p:phrase
all:
  ^ w:word pdp=verb
have:
  q:phrase function=Pred
  q = p
end:

but if .. would be the name of the parent, you could say

phrase
all:
  ^ w:word pdp=verb
have:
  .. function=Pred
end:

This uses both proposed new devices: extended name usage, and .. for the parent node of a quantifier.

Make seqNumber optional in prettyTuple

prettyTuple currently requires the seqNumber argument. However, I often look at tuples in isolation without a loop. This argument should be optional.

Custom datasources

Add a config option pointing to a directory, containing spec files for data sources. These are little python files, defining a few parameters, and containing TF init avd TF.load statements. You can control which additional modules and which features will be loaded.

Save your work

The Text-Fabric browser should save the input fields in a file, so that you can resume working where you left off

Indentation Before the Quantifier

TFS currently relates quantifiers to the immediately preceding atom. This happens even when that atom is indented under another element:

clause
    phrase function=Pred
no:
    ^ word pdp=prep
end:

The quantifier here relates to the phrase and not the clause, despite the indentation being even with clause.

Proposal: Use the indentation preceding the quantifier to make the link with the atom. The above query would then look like this:

clause
    phrase function=Pred
    no:
        ^ word pdp=prep
    end:

Quantifying templates in addition to atoms

Currently, you can only quantify atoms, i.e. single-lines of a template.
But we can generalize this to quantifiying pieces of template, with a natural semantics.
Also, the implementation does not require new techniques.

If somebody can provide good examples using this, I'm willing to implement this extension.

This is what is looks like:

/those/
templateR
/without/
templateN
/-/

Meaning:

tuple RR is a result of this template if RR is a result of templateR and
if RR cannot be extended to a tuple (RR, RN) that is the result of

templateR
templateN

/those/
templateR
/where/
templateA
/have/
templateH
/-/

Meaning:

node RR is a result of this template if RR is a result of templateR and
if all extensions (RR, RA) that are results of

templateR
templateA

can be extended to a tuple (RR, RA, RH) that is a result of

templateR
templateA
templateH

/those/
templateR
/with/
templateO1
/or/
templateO2
/or/
templateO3
/-/

Meaning:

node RR is a result of this template if RR is a result of templateR and
there is an extension (RR, R1) that is a result of

templateR
templateO1

or there is an extension (RR, R2) that is a result of

templateR
templateO2

or there is an extension (RR, R3) that is a result of

templateR
templateO3

Show Verse Labels in Condensed View

When using B.show with condenseType=clause the verse label no longer appears. This is undesirable since the reference is an important coordinate, regardless of whether the whole verse is displayed.

Syntax of Quantifiers

After just a few days it turns out that the indentation in the current syntax of quantifiers is very error-prone and confusing. Users have to do a lot of mental juggling to get the indents right, and the rules, while explicit, are difficult to internalize.

I'm going to change that drastically, and the old syntax will no longer work.

In the new syntax, quantifiers do not introduce any level of indentation of their own, they just follow the indentation of the atom they are quantifying. No hassle with carets anymore.

A consequence of this is that nested quantifiers do not always appear indented with respect to each other. So the quantifier keywords have to stand out more.

I have chosen a syntax that is conspicuous, but not too load, and not too unclean.

Glitches in v7

The list of features after the incantation in a notebook:

Features are given as underlined links. Underlining should be suppressed.
Features are not properly linked to their docs.
This is a bug in _featuresPerModule, easy to fix.

The wrong line is if mLoc == baseLoc else, line 782.
It should be if mId == (app.org, app.repo, app.relative) else.
Various typos in the docs and the new share tutorial.
add to the description in mkdocs.yml that TF is also an api and a set of apps.

Comments starting after whitespace

I know the documentation says that "you cannot comment out parts of lines, only whole lines", and understand that allowing % after text could lead to ambiguity, but would it be possible to allow whitespace before %? This would allow you to comment on parts of a query without obscuring the indentation structure. Essentially I'd propose to use the regular expression ^\s*(%|$) to strip comments (this also checks for whitespace-only lines).

Apps for closed data

Central to the TF-App functionality is the auto-download feature.
But this requires the data to be open.

Yet a corpus cannot always be open, e.g. when it contains sensitive data (privacy, commercial interest, copyrighted material).

Also in those cases we want to be able to run the TF browser, which requires a TF app.

So: make the auto-download part optional for TF-apps.

annotation / text-fabric Goto Github PK

text-fabric's People

Contributors

Stargazers

Watchers

Forkers

text-fabric's Issues

Direct use of names in atoms

The parent node

Recommend Projects

Recommend Topics

Recommend Org