The stemmatology from tla

make_tradition.pl script shouldn't create empty traditions

The make_tradition script flags an error if the given input type doesn't exist, but a tradition is still created (and put in the database, if applicable.) This may also happen when an error is thrown. Either way the empty tradition should not be created.

TEI parser can break if run twice in the same Perl invocation

If the TEI parser has to parse a file with namespaces more than once (or, presumably, a file without namespaces after a file with), it will break with an error like this:

XPath error : Invalid expression
//tei:tei:listWit/tei:tei:witness
         ^ at /opt/local/lib/perl5/site_perl/5.16.1/Text/Tradition/Parser/TEI.pm line 128.

Overhaul CTE to cope with intermixed witStart/witEnd and multiply-specified witness readings

Text direction should be configurable

so that graph export for RTL texts goes from R to L.

Finish and test method for (re-)rooting stemma graph

Integration with a phylogeny or other stemma generation package means that we will sometimes have unrooted stemmas. We need to give the user some way to assign a root (archetype) to an unrooted stemma.

Need to deal with Moose exceptions being objects

The exception raising mechanism in Text::Tradition::Error tends to assume that whatever error it is asked to raise is a string. Moose has turned all of its exceptions into objects, which causes an exception if we try to treat it as a string when throwing an exception (confused yet?) We need to add a check for this.

Collation merge_readings can be unpredictable if propagation hasn't happened

Depending on the order returned by $collation->readings(), if you try to merge/collapse readings by relationship type and the relationship type has not been applied transitively (i.e. propagated) then the merge may or may not be complete. We need to do this recursively to be sure.

TEI parallel segmentation parsing should recognise 'lem' tags

A <lem> tag in a TEI_PS file should result in readings with is_lemma set to true.

Need ability to read & save undirected stemma graphs

The Stemma object needs to be able correctly to parse a dot file with an undirected graph description, as generated from a Newick specification.

Failing Collation.pm test on Text::Tradition

t/text_tradition_collation.t .................... 9/? 
#   Failed test 'Reading r7.5 correctly removed'
#   at t/text_tradition_collation.t line 89.
#          got: ''
#     expected: '1'

#   Failed test 'Reading r7.6 correctly retained'
#   at t/text_tradition_collation.t line 89.
#          got: '1'
#     expected: ''
t/text_tradition_collation.t .................... 159/? # Looks like you failed 2 tests of 171.
t/text_tradition_collation.t .................... Dubious, test returned 2 (wstat 512, 0x200)
Failed 2/171 subtests

Support TEI double-endpoint-attachment output

This needs to conform to what Classical Text Editor expects.

Add facility to parse Stemweb results into one or more stemmata for a tradition

Need to implement the conversion of a Stemweb calculation into one or more Text::Tradition::Stemma objects tied to a particular tradition, as described in the API given here:
http://treeoftexts.arts.kuleuven.be/?p=58

CTE parser not recognizing witStart and witEnd tags

A use case was submitted that makes use of the apparatus codicum, so that witStart tags and witEnd tags are present in the XML. These need to be parsed correctly.

Write tests for CSV and TSV export

In fixing #7 I found that the CSV export was breaking on a spurious decode_utf8() call. That should have been in a test.

Need to know what relationships disappear when reading is duplicated.

The call to duplicate_readings can cause certain relationships to no longer be valid, and it will remove them. Right now this is done silently, but for UI purposes the relationship removal needs to be propagated.

Add facility to export tab-separated "CSV"

The CSV export should allow tabs as well as commas for separation purposes.

UTF-8 bug in mysql storage

Need finally to trace and zap the UTF-8 encoding bug in tradition names in the MySQL tables.

Perl 5.18 doesn't like open file handles on (char) strings

Tests start failing when we open a file handle to read from a UTF-8 character string. We need to stop doing that, by converting them to byte strings first.

Restriction on merge_readings throws up bug in equivalence graph

It turns out that, when a check is made to prevent merge of readings that shouldn't be merged (i.e. as in 27e161b), a tight cycle is found in the equivalence graph. This breaks things.

Failing POD test on Text::Tradition

#   Failed test 'POD test for blib/lib/Text/Tradition/Collation.pm'
#   at /opt/perl-5.21.5/lib/site_perl/5.21.5/Test/Pod.pm line 186.
# blib/lib/Text/Tradition/Collation.pm (653): Non-ASCII character seen before =encoding in ''Mü11475''. Assuming UTF-8
# Looks like you failed 1 test of 18.
t/02pod.t ....................................... 
Dubious, test returned 1 (wstat 256, 0x100)`

Failing POD coverage test on Text::Tradition

not ok 2 - Pod coverage on Text::Tradition::Parser::CTE
#   Failed test 'Pod coverage on Text::Tradition::Parser::CTE'
#   at t/03podcoverage.t line 29.
# Coverage for Text::Tradition::Parser::CTE is 50.0%, with 1 naked subroutine:
#   do_warn

Collect IDP utility functions into a proper library

All the IDP scripts need to be refactored around a central library.

JSON parsing gets the ranks wrong

Looks like a straightforward off-by-one error.

Make decent workaround for Graph::Reader::Dot

The Graph::Reader::Dot module fails to parse name tokens with characters outside the ASCII \w range unless the tokens are wrapped in double-quotes. Various hacks exist but this needs a real workaround.

Add logic concerning a reading's normal form when there is a lemma set

The term 'lemma' is annoyingly overloaded, but...

When a reading is chosen as a lemma, its spelling and orthographic (and arguably punctuation) variants should have the same normal form as the lemma.

Make a test file for CTE parsing

The CTE parser pretty badly needs some tests added to it...

CollateX JSON output format has changed

...and our JSON parser ought to reflect this.

Make IDP solver URL configurable

Native XML export should include witness information

At the moment, any witness information is thrown away. This is unfortunate.

Need to be able to restore traditions with active Stemweb job IDs.

JSON parser doesn't work with a.c. witnesses

...because ' (a.c.)' fails XML Name validation.

CSV/TSV formats need option to exclude a.c. wits

At the moment it makes little sense to include a.c. witnesses in stemmatological reckoning. So when generating the collation table for those, we should exclude them. Related to tla/stemmaweb#29

compress_readings fails when there are readings with join_prior or join_next

The compress_readings method is failing when it shouldn't be. This is because of a naive means of assembling the original text in the sanity checking code.

Analysis: Transposition symmetry check evidently relying on arbitrary order of witnesses

It seems that when we check for transposition symmetry in the Analysis module, we've been comparing stringified versions of witness sets without sorting them first. Not sure how this ever reliably worked.

Consider making CTE parser cope with 'post X transp.' notation in the apparatus.

It is very common for scholars using CTE to note a transposition via 'post X transp.' in the apparatus. This has all kinds of pitfalls, but it might be nice if we make a first pass at solving it.

Support TEI parallel-segmentation output

Repetition relationship can't be set on witness + a.c. witness

It should be possible to set a repetition relationship between a reading from X and a later reading from X (a.c.) - at the moment this is excluded.

CTE parsing does not handle witStart and witEnd tags correctly.

Need to revamp handling of witStart and witEnd, to accord with what the XML is actually trying to represent. Current implementation is a misinterpretation.

Pull graph / dot manipulation functions into StemmaUtil

We should not be calling ::Stemma for graph manipulation that doesn't concern an actual Stemma object. Thus need to refactor in order to avoid circular dependency on ::Stemma <-> ::StemmaUtil.

CollateX format

I realise that there hasn't been much development on this in the last decade, but I'm still interested as a perl user of 25 years and working on the text editing and analysis tool at menotag.ku.dk

I'm putting in CollateX JSON output but running into two problems: CollateX may have changed its format since this tool was updated, because (when using the JSON string):

Can't use an undefined value as an ARRAY reference at /usr/local/share/perl5/Text/Tradition/Parser/JSON.pm line 160, line 41.

Additionally, I'm using string input because Text::Tradition won't read the JSON file:

malformed JSON string, neither array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/local/share/perl5/Text/Tradition/Parser/JSON.pm line 117.

Where the JSON file is perfectly fine.

Reading duplication can cause invalid graphs

It seems that the duplicate_reading function can, if used unwisely, lead to a bad graph. There needs to be a check to ensure that neither the new reading nor the old reading will get dissociated from the graph, which is to say, they both need to have at least one witness afterward.

CollateX input parser should account for a.c. witnesses

Some users might want to collate a.c. readings in their manuscripts. If a witness is marked a.c., then the CollateX parser should treat it as a variant of the base witness and not as a new witness in its own right.

tla / stemmatology Goto Github PK

stemmatology's People

Contributors

Stargazers

Watchers

Forkers

stemmatology's Issues

Recommend Projects

Recommend Topics

Recommend Org