annotation / text-fabric Goto Github PK
View Code? Open in Web Editor NEWFile format, model, API, and apps for manipulating text and its annotated features
License: MIT License
File format, model, API, and apps for manipulating text and its annotated features
License: MIT License
It would be nice to see B.show
results in a randomized order. Though this is doable with an extra random.shuffle
line, I find myself needing to do this quite often.
TF can import from and export to XML.
It is desirable that TF can import corpora from other formats as easily as possible.
Yet, input formats differ widely.
Make an intermediate format that TF can consume and that is easy to generate from e.g. TEI, other XML and other database formats.
Write an import module for this intermediate format.
I'm getting an out-of-memory error message when I load the bhsa corpus. (I am able to load the srynt corpus.) I would think that my machine has enough memory. As shown in the screenshot below, I have 4.7 GB available when it failed to run. (The bump you see in the graph is where I tried to open the bhsa corpus.) I don't know if this is a text-fabric error, or if I truly don't have enough memory to display the corpus. If the latter is the case, it would be good if memory requirements were documented more specifically.
PS C:\Users\Adam\Desktop> text-fabric
This is Text-Fabric 6.3.1
specify data source [bhsa/peshitta/cunei/syrnt] > bhsa
Cleaning up remnant processes, if any ...
Loading data for bhsa. Please wait ...
Setting up TF kernel for bhsa
Using bhsa-c r1.4 in ~/text-fabric-data/etcbc/bhsa/tf/c
Using phono-c r1.1 in ~/text-fabric-data/etcbc/phono/tf/c
Using parallels-c r1.1 in ~/text-fabric-data/etcbc/parallels/tf/c
| 1.02s T number from C:\Users\Adam/text-fabric-data/etcbc/bhsa/tf/c
4.68s Not all features could be loaded/computed
TF is out of memory!
If this happens and your computer has more than 3GB RAM on board:
Close all other programs and try again.
Suppose you want to look for the first clause after an earlier clause, but not necessarily tightly adjacent. Can we define a spatial operator for that?
This is in response to an ongoing request by Cody Kingham.
Request by Oliver Glanz:
I would like to see how to define the relation between actual node attributes of different nodes.
Something like:
word lex=L1
=: word lex=L2
L1=L2
For example, how can you express this MQL query?
select all objects where
[clause
[word FOCUS vt IN (perf, impf, wayq, impv)
[word AS infalex]
]
[word FOCUS vt IN (infa)
[word lex = infalex.lex]
]
]
The Cunei data contans images files with names such as
|(HIx1(N57))&(HI+1(N57))|.jpg
But the |
in file names is not compatible with windows.
There is an A.prettySetup
. I would like an A.showSetup
. This is because I'm calling a lot of A.show
in my NB, and constantly have to specify what I want condensed, that i want nodes, etc.
Currently, when writing search templates you cannot express that the value of one feature on on node
is equal to the value of another feature on another node.
For example, you might want to look for clauses in which the verb in a Pred phrase has the same
number as the noun in a Subj phrase. Let's do it only for 1-word Pred and Sunj phrases.
clause
phrase function=Pred
=: v:word pdp=verb
:=
phrase function=Subj
=: s:word pdp=subs nu=v.nu
:=
The bit nu=v.nu
is not possible currently.
We also would like to be able to say in the template above
s.gn = v.gn
(gn
is the gender feature).
This is also not supported yet.
It would be nice to have both syntaxes working in TF search.
B.prettyTuple
and B.show
currently display results in the context of their enclosing verse. I would like to be able to request a different maximum sized object such as a sentence or clause.
Perhaps this is difficult to implement since the maximum object should fully enclose the results. If I request a maximum size of a phrase
on a tuple containing clause nodes, what to do? One option would be to simply display all contained phrases without sentence or clause boundaries. This would look similar to the Bible Online Learner view when clause/sentence boundaries are toggled off but phrase boundaries are on. So if I toggle 'word', for instance, I get a bunch of word boxes with no phrase boundaries.
Going up could behave similarly. If I request clause
, then I get enclosing clauses for all attested clauses in the tuple.
Perhaps this could be toggled in B.prettySetup
.
In the TF-Browser you can export your query results to Excel-friendly tab-separated files.
This should also be possible from the API that runs in a notebook.
I want to revisit the issue of multiple names in quantifiers. There are certain relationships which the present implementation is not able to express.
Here is an example template that is currently not possible. There are two words in a phrase. And we want to specify that a preposition should not intervene between them:
phrase
w1:word
[...]
w2:word
/without/
w1
< word pdp=prep
< w2
/-/
In the current implementation, these kinds of relation expressions can only be approximated by repeating all of the specifications under w1
under a quantifier:
phrase
w1:word
[...]
w2:word
/without/
phrase
w1:word
[...]
< word pdp=prep
w2
/-/
In cases where [...]
contains lots of specs, this becomes unwieldy. Furthermore, there is no guarantee in the above pattern that the w1
in the phrase template will be identical to the w1
in the quantifier template.
I know that there are major difficulties to implementing multiple names in quantifiers. However, if each name is assigned to a node in an outer level, perhaps it is possible to pass those nodes as sets to the inner levels, so that a w1
defined in level 0 can be directly referred to in level 1. Or perhaps there is yet a better way.
Text-Fabric Browser should have an option to upload node ID / feature pairings.
For instance, if I have a .csv file where one column is nodes and another is a string feature, I should be able to click an upload button in TF Browser and upload that .csv into the corpus for further exploration and analysis.
Probably those exported files would get pushed to a new directory on disk? A user could then upload to Github when they feel their features are in a finished state.
Hello @dirkroorda,
At Software Heritage we really liked your idea of putting a SWH badge on the README.
We liked it so much that we created three types of badges that you can find on the permalink tab on the right :-)
The archived repository badge:
Coded with:
[![SWH](https://archive.softwareheritage.org/badge/origin/https://github.com/annotation/text-fabric/)](https://archive.softwareheritage.org/browse/origin/https://github.com/annotation/text-fabric/)
The archived release badge:
Coded with:
[![SWH](https://archive.softwareheritage.org/badge/swh:1:rel:4f61758988dfa0479b69f790f2f857738574e3e8;origin=https://github.com/annotation/text-fabric/)](https://archive.softwareheritage.org/swh:1:rel:4f61758988dfa0479b69f790f2f857738574e3e8;origin=https://github.com/annotation/text-fabric/)
And the badges to a specific artifact identified with the swh-id,
for example a snapshot with all branches and releases captured in April 2019:
Coded with:
[![SWH](https://archive.softwareheritage.org/badge/swh:1:snp:1de172a10590eed2f1e3dcd049f13b4ed7d72bd0;origin=https://github.com/annotation/text-fabric/)](https://archive.softwareheritage.org/swh:1:snp:1de172a10590eed2f1e3dcd049f13b4ed7d72bd0;origin=https://github.com/annotation/text-fabric/)
Hope you like the result.
Cheers,
Morane
We have a node pad, a search pad.
Add a passage pad above the node pad, where you can enter a passage, such as Genesis 3:16 or P30518
A click on the passage label of a result should add the passage to the passage pad.
This way you can start with a passage, expand it, switch node numbers on, click on them to add nodes to the node pad, combine them to make a few tuples. So, on top of your query results, you can easily collect some special cases and examples.
The data for the BHSA is in a github repo. Cloning it means downloading 5 GB of data.
The typical user needs only 25 MB of zipped TF data.
For the Cunei data something similar holds: the repo is 1,5 GB, bit the TF data is 1.6 MB.
The Cunei user also might want the image files (derived from CDLI) which are roughtly 500 MB.
In notebooks, you can customize the highlight colors.
Give Text-Fabric-Browser users that possibility as well.
NB It is not possible to provide the full flexibility of the highlights parameter in this context. But the colormap parameter can easily be done.
In MQL there is a NOTEXIST operator. We want something like that in TF as well.
Here is a proposal to add Quantifiers in search. They are a generalization of NOTEXIST.
A few examples
clause
all:
phrase function=Pred
have:
word sp=verb
This search for clauses in which all Pred-phrases have a verb in it.
The all: ... have: ...
is called a quantifier template.
Quantifier templates do not contribute to result nodes in a tuple.
There will be one other quantifier: no: ...
.
clause
no:
phrase function=Pred
This results in all clauses which do not have a Pred-phrase in them.
You may have multiple quantifiers, combine them with ordinary templates and nest them.
Here is a pretty meaningless example
clause
all:
p1:phrase
all:
w1:word
have:
w2:word sp=verb
w1 = w2
have:
p2:phrase function=Pred
<: phrase function=Subj
p1 = p2
phrase function=Obj
In words:
Results are tuples (clause, phrase) where phrase is an Obj within clause.
But only for clauses that satisfy this condition:
All phrases that consist of verbs only have function=Pred and are immediately followed by a Subj phrase.
Text-Fabric uses a lot of memory. If you want to use TF in a webserver, it is not feasible to load TF for every request.
Rather you would like to start TF once, with the webserver, let the webserver initialize TF, and pass the TF object to the code that processes requests.
But I am not sure whether this works and how TF behaves if it is used multiple times at the same time.
This whole scenario needs to be explained.
In a notebook you can call prettySetup to add features to the pretty displays. You need this when you have features of your own that you have included in your custom dataset.
Give users of the Text Fabric browser the possibility to use prettySetup
Is it currently possible to indicate that template lines should not appear in results, i.e. are non-matching in the sense of non-matching regular expression groups ((?:...)
)?
In some cases this can be simulated with quantifiers:
verse
non-matching word lex=KL/
would be identical to:
verse
/with/
word lex=KL/
/-/
However, I don't see how to express the following:
verse
non-matching clause
word language=Aramaic lex=HWH[
word vt=ptca
Perhaps it can be done with something like this, but that looks cumbersome:
verse
/with/
clause
w1:word
w2:word
w1=hwy
w2=ptc
/-/
hwy:word language=Aramaic lex=HWH[
ptc:word vt=ptca
TF supports 2 or three section levels.
The TF browser can work with 2 or three section levels.
Make the number of levels completely free: 0, 1, 2, 3, 4, ...
Take care that the TF browses honours all section levels.
Define what the TF browser should do if there are no section levels.
When we add progressively new corpora into TF, we'll need this flexibility.
We want to have the possibility of using custom sets in search, like so
query = '''
clause
gapphrase
word sp=verb
'''
Here gapphrase
is not a node type in the data set, but refers to a node set that you have constructed yourself.
You can then search with this template by passing a dictionary of sets to search()
:
gappedPhrases = ... code for finding the gapped phrases ...
results = S.search(query, sets=dict(gapphrase=gappedPhrases))
Sets liked gappedPhrases may or may not be constructed with `search()`.
If you use `search()` it is handy to get the set of top-level nodes instead the full list of tuples.
E.g. we want the set of clauses that consist exactly of a Pred and a Subj phrase:
```python
query = '''
clause
=: phrase function=Pred
<: phrase function=Subj
:=
'''
Then you can say:
simpleClauses = S.search(query, shallow=True)
and now simpleClauses
is a node set, ready for use as a custom set in a query.
When the TF data is loaded by the TF browser for the first time, and there is too little RAM, a MemoryError is raised. After that, the generated binary data of TF is corrupt.
For situations where the user can free more RAM, he needs a convenient way to move the corrupt data out of the way.
In a Jupyter notebook, one can say: TF.clearCache()
, but we need a method that is easier to invoke, and picks exactly the right datasets.
I would like a silence argument for Bhsa
to keep things clean in my NB opening statements.
Suppose you have a feature with values like a.1|b.2
, and you want to search for exactly
this feature.
Then you enter "escape-hell", because your search string will be parsed by Python, TF, and the regular expression engine. So how many backslashes do you need for escaping the |
and .
?
It turned out: 1 for .
and three for |
. There was also a bug. So this is a very unsatisfactory
situation.
Today it occurred that I changed two TF apps for some reason, and made them rely on a new feature in the TF applib, the library in TF that helps the apps to work.
I updated the TF apps, but waited a bit with pushing the main TF.
A user of that app fell victim to the app auto-updating itself and then failing because the new version of TF was not yet online.
In order to mitigate scenarios like this, the TF applib should get a separate version number, and apps should state which version they need to run.
In this case, because of the new feature, I would have bumped the applib version, and made the two TF apps require the new version at least.
When TF loads an app, it can see the applib version requirement of the TF app, inspect its own applib version, and if there is a mismatch, it can issue a friendly warning and hint.
Make a left sidebar, expandable, like in Jupyter Lab. Use it for:
• opening a previous query
• exporting and filling out metadata
• help sections
Apps are part of the TF code base.
When the number of apps grows, they must be weaned from the TF core code.
The logical separation is already good, but it should be extended to infrastructural separation, so that apps can be developed in several repos by several people, in the ideal case people that are intimate with the corpus in question.
Is it possible to extend the search template language to allow #
for features?
For example, I want to find sentences where אמר is followed by some ל+infinitive construct, but want to exclude the infinitive construct of אמר. I would then like to use the query below (or is there an easier way to do this with the current search template syntax?):
sentence
clause
word lex=>MR[
< clause
word lex=L
<: word vt=infc lex#>MR[
Now you can also pass a dictionary of highlights to the plain() display. But you can only highlight slots within the text, not higher concepts. For that you need pretty.
Yet, it is not to difficult to highlight the slots of say phrases within sentences.
And, right now, all highlighting in plain() comes out as yellow. It would be nice that the colours you put in the highlight dict is repspected in the display.
The display in Ururk can better. It has 6000 tablets, each tablet has just a few colums, and they a few lines. This section structure is not pleasant to browse.
When converting data to TF, you have to create an oslots
feature that maps non-slot nodes to subsets of slots. All non-slot nodes should be mapped, and no slot nodes should be mapped.
If this requirement is violated, you get weird errors when TF tries to load the new dataset.
It is near impossible to find out what is wrong from the error messages.
When oslots is read and written, a clear sanity check should be performed.
Add an input field for node tuples that you want displayed.
So in addition to searching (which delivers node tuples), you can also paste in your own tuples.
If you have run a search, make a button that copies the currently displayed results as tuples into an input box. Every tuple appears as a comma separated list of numbers on its own line.
The interface will display the combined results of the given tuples and the search results.
The resulting set can still be condensed by the user ad libitum.
Add a description field to the TF browser, so that when you save results as a pdf, you can enter an explanation.
I can also add some metadata (time of running the query) and some statistics (the amount of nodes in the total results, split out per node type, compared with the totals in the whole corpus).
That way the PDF becomes a "proof of work".
The current browser search icons are a bit confusing (recycle sign and gear). In demo'ing TF to several people this has caused some confusion. A google search for "search icon" gives results for magnifying glasses instead. I think this would be a better fit.
BHSA app load with silent = True now produces a read out. Before latest update it was quiet.
It would be nice if end=
also produced a statement similar to pd.Dataframes
when a df extends beyond the limit, e.g. "22 more results not shown" or something like this. But this may be too minor to warrant addition. Feel free to ignore.
Perhaps I'm overlooking something but it seems there is on easy way to get the next n words using the locality API. You could chain L.n
, but then you need to check continuously whether you have reached the last word or not. Would it be useful to add functionality to get a Python generator with yield
that continues yielding next items until none are left?
I'm happy to implement it myself and create a PR, but wanted to check whether you think this is something useful to have in text-fabric and whether it would be better to extend the current L.u
and L.p
or create new methods.
There are various relaxed spatial relations, see
nearness relations.
A <k: B
meaning A adjacent before B
A =k: B
meaning A and B start at the same slot
A :k= B
meaning A and B end at the same slot
A :k: B
meaning A and B have the same boundary slots
but up to a fuzziness of k slots.
The first relation is asymmetrial, the others are symmetrical.
The k-fuzziness leads to unexpected cases:
If A = [1, 2, 3]
and B = [5, 6, 7]
it holds that A <1: B
And if C = [2, 4, 5]
it also holds that A <2: B
, now the fuzziness works in the other direction.
It even holds that A <3: A
, strangely enough.
Take this example of a query with quantifiers
phrase
all:
^ w:word
have:
ww:word pdp=verb
w = ww
end:
This looks clumsy, and is causes by the fact that names (like w
) cannot be used in places where an atom is expected. So I propose to lift that restriction, so that you can say
phrase
all:
^ w:word
have:
w pdp=verb
end:
This should also work outside quantifiers.
I intend to implement this as mere syntactic sugar: during tokenization of the query the second form will be translated to the first form.
Related to this: all quantifiers are relative to a parent atom. But there is no standard name
to refer to the parent atom. So if you want to say of a phrase
that if it contains a verb, its function should be Pred
you have to work like this:
p:phrase
all:
^ w:word pdp=verb
have:
q:phrase function=Pred
q = p
end:
but if ..
would be the name of the parent, you could say
phrase
all:
^ w:word pdp=verb
have:
.. function=Pred
end:
This uses both proposed new devices: extended name usage, and ..
for the parent node of a quantifier.
prettyTuple
currently requires the seqNumber
argument. However, I often look at tuples in isolation without a loop. This argument should be optional.
Add a config option pointing to a directory, containing spec files for data sources. These are little python files, defining a few parameters, and containing TF init avd TF.load statements. You can control which additional modules and which features will be loaded.
The Text-Fabric browser should save the input fields in a file, so that you can resume working where you left off
TFS currently relates quantifiers to the immediately preceding atom. This happens even when that atom is indented under another element:
clause
phrase function=Pred
no:
^ word pdp=prep
end:
The quantifier here relates to the phrase and not the clause, despite the indentation being even with clause.
Proposal: Use the indentation preceding the quantifier to make the link with the atom. The above query would then look like this:
clause
phrase function=Pred
no:
^ word pdp=prep
end:
Currently, you can only quantify atoms, i.e. single-lines of a template.
But we can generalize this to quantifiying pieces of template, with a natural semantics.
Also, the implementation does not require new techniques.
If somebody can provide good examples using this, I'm willing to implement this extension.
This is what is looks like:
/those/
templateR
/without/
templateN
/-/
Meaning:
tuple RR is a result of this template if RR is a result of templateR
and
if RR cannot be extended to a tuple (RR, RN) that is the result of
templateR
templateN
/those/
templateR
/where/
templateA
/have/
templateH
/-/
Meaning:
node RR is a result of this template if RR is a result of templateR
and
if all extensions (RR, RA) that are results of
templateR
templateA
can be extended to a tuple (RR, RA, RH) that is a result of
templateR
templateA
templateH
/those/
templateR
/with/
templateO1
/or/
templateO2
/or/
templateO3
/-/
Meaning:
node RR is a result of this template if RR is a result of templateR
and
there is an extension (RR, R1) that is a result of
templateR
templateO1
or there is an extension (RR, R2) that is a result of
templateR
templateO2
or there is an extension (RR, R3) that is a result of
templateR
templateO3
When using B.show
with condenseType=clause
the verse label no longer appears. This is undesirable since the reference is an important coordinate, regardless of whether the whole verse is displayed.
After just a few days it turns out that the indentation in the current syntax of quantifiers is very error-prone and confusing. Users have to do a lot of mental juggling to get the indents right, and the rules, while explicit, are difficult to internalize.
I'm going to change that drastically, and the old syntax will no longer work.
In the new syntax, quantifiers do not introduce any level of indentation of their own, they just follow the indentation of the atom they are quantifying. No hassle with carets anymore.
A consequence of this is that nested quantifiers do not always appear indented with respect to each other. So the quantifier keywords have to stand out more.
I have chosen a syntax that is conspicuous, but not too load, and not too unclean.
The list of features after the incantation in a notebook:
Features are given as underlined links. Underlining should be suppressed.
Features are not properly linked to their docs.
This is a bug in _featuresPerModule, easy to fix.
The wrong line is if mLoc == baseLoc else
, line 782.
It should be if mId == (app.org, app.repo, app.relative) else
.
Various typos in the docs and the new share tutorial.
add to the description in mkdocs.yml that TF is also an api and a set of apps.
I know the documentation says that "you cannot comment out parts of lines, only whole lines", and understand that allowing %
after text could lead to ambiguity, but would it be possible to allow whitespace before %
? This would allow you to comment on parts of a query without obscuring the indentation structure. Essentially I'd propose to use the regular expression ^\s*(%|$)
to strip comments (this also checks for whitespace-only lines).
Central to the TF-App functionality is the auto-download feature.
But this requires the data to be open.
Yet a corpus cannot always be open, e.g. when it contains sensitive data (privacy, commercial interest, copyrighted material).
Also in those cases we want to be able to run the TF browser, which requires a TF app.
So: make the auto-download part optional for TF-apps.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.