psl-chartes-htr-students / hn2021-boccace Goto Github PK

This repository hosts all the documents, including transcriptions, bibliographical references and introduction that serve the team Boccace for the validation of the course "Bonnes pratiques du developpement collaboratif : initiation à Git" (prof. Thibault Clérice), of the first semester - Master Humanités Numériques ENC-PSL 2021-2022.

License: MIT License

TeX 100.00%

hn2021-boccace's People

Contributors

Stargazers

Forkers

apogonoe mmaulu malamatenia

hn2021-boccace's Issues

Accidentally deleted Noé's version of the text while merging mine

While trying to merge my .md file of the transcription with Noé's version that was already in the main branch -in order to find and resolve conflicts- his file was deleted and substituted by mine. At the same time when proceeding for the pull request it seemed that there was no conflict between the two branches.

The name of my file was identical as Noé's file name. A new merge is needed in any case in order to resolve conflicts between the two transcriptions.

How about the HTR-United framework?

I have an idea regarding the framework of the project.

Since this project is (in theory) already part of the ENC and Marco's biannual project funded by the school, it would be interesting (let alone practical) to try and link our work with the https://htr-united.github.io HTR-United initiative, that (sic) ... vise à mettre en commun les transcriptions HTR/OCR de textes de toutes périodes et de tout style, principalement en français mais de manière non restricive. Elle est née du simple besoin - pour des projets - d'avoir de potentiels vérités de terrain pour entraîner des modèles rapidement sur des corpus plus petits(sic.). More information found here in their recent article : Alix Chagué, Thibault Clérice, Laurent Romary. HTR-United : Mutualisons la vérité de terrain !. 2021. https://hal.archives-ouvertes.fr/hal-03398740/document

The fact that they have already produced detailed guidelines and workflows that ensure the control of the quality of the data sets and the ground truths, facilitates the interoperability of the data, that can be shared, verified, and ameliorated in the long term, guaranteeing their sustainability.

This entails for our project several things:
a) transcripts aligned with images, in a standard format such as XML PAGE or XML ALTO ;
b) The structure of our repository, should consist of two separate subfolders in each folder, containing respectively the training corpus and the ground truth, as well as the respective sources, sc. the pdf images of the incunabula (just the IIIF reference is also acceptable but it's easier if they are directly uploaded in the folders) ;
c) To include in our README.md file the description of the repository, as well as any important information about our procedure.
d) The creation of a YAML document named (strictly) htr-united.yml document containing all the metadata regarding the ground truths produced by the repository (accessible through https://htr-united.github.io/document-your-data.html )
e) The creation of a CITATION.cff file to cite the repository.

Let me know what you think.

Restructuring the repository to better document the individual work and collaborate.

After a briefing (12/12/2021) with Professor Clérice, we decided to reorganize the repository in a way that would allow two things:

Better documentation of the entries and documentation of each step of the work, as to make the procedure clear for anyone that takes a look into the repository and wishes to be guided through. This will be achieved with the creation of issues at every upload/step/enhancement/documentation procedure or every problem encountered before any push/pull requests that correspond to it.
Better structuring of the folders to facilitate external verification of the models, the training corpus,and its initial transcription, the verification corpus (ground truth). This allows for easier collaboration in a second time if necessary for the Boccace project.

Implementation of chocomufin software for GitHub Actions

Following the practices of HTR-United as per this far, I found it interesting to implement the software chocomufin developed by Professors Clérice and Pinche in order to ensure the quality of our transcription. All relevant information can be found detailed here: https://github.com/PonteIneptique/choco-mufin but I will explain briefly the commands.

Note: the name did not run under choco-mufin but under chocomufin without a tilde so I am following this writing

1.The software can be installed via the terminal with the command pip install chocomufin inside the cloned repository/
Two possibilities follow for the
2. Creation of a spacial characters table

A table with the special characters does not exist inside the repository (table.csv), so we proceed to create it with the command chocomufin generate table.csv nameofyourfolder/**/*.xml This structure concerns a folders that contain one folder /**/ that contain any .XML file /*.xml. A table is created and pops up in the terminal.
Someone else has already created a table.csv and all we need is to enhance it (as owners) or propose to enhance it (as external collaborator) with the special characters contained in another respective .xml document. In this case, the cloned repository contains already a table.csv so we proceed to convert it as follows

After creating a branch (as the document is going to be pushed in the repository) the command ```chocomufin control table.csv nameoffolder/**/*/.xml reads the xml files to get the characters that are missing from the existing table.
Then in order to add the ones that are not included and covert the table we type chocomufin convert table.csv nameoffolder/**/*/.xml
Lastly, we push it to the repository to have the sum of the characters in both documents (for the moment).

The workflow: with each pull and push request the software checks if the new .xml documents are complying with the character table created. We can very well, if we find a new character, add it following the same process.

🎉

Well, it still does not run smoothly for the moment but I am on it 👍

Computer issue

My computer is dead, I can't access my virtual anymore... Could you please push the last .txt file in the Bnf folder? Thx

LaTeX document

Matenia, would it be possible to push the LaTeX document on the first page of the repository so that it's easier to find and modify? Thx _

transcription choices/guidelines and normalisation process

Following issue #22, our transcription norms should be conforming to the framework. This means

No development of abbreviations, sc. graphemic transcription
Using special characters that are either included in the HTR-united project table https://github.com/HTR-United/cremma-medieval/blob/main/table.csv or belong to the public domain of https://mufi.info/m.php?p=mufi The Medieval Unicode Font Initiative, in case a special character is not covered.

This way we will eventually be able to install ChocoMufin quality control (https://github.com/HTR-United/cremma-medieval/blob/main/.github/workflows/chocomufin.yml) in order to ensure that special characters are in accordance with the norms.

LaTeX document

I faced a problem while trying to implement a figure in the latex document. My figure got deplaced automaticly to another page, due to its dimensions I guess, and I cannot move it properly... If you have any idea ^^

Replace PDF with image in Mazarine Inc 59 - Verification Corpus

The current files are PDF, it should be jpg.

Can't push my documents on the repository

A problem appeared. As I was trying to push my first folder, an error message saying "The requested URL returned error 403" forbid me to do it. It makes me temporarily unable to share my work on the repository.

XML ALTO first page error

During the verification process of the XML files for Inc 59 with the TXT file, both produced by eScriptorium, I realized that the initial "G" of the first page (and only this initial) was never integrated in the XML file, no matter how many times I re-exported the docs.

We should stay vigilent for these types of minor - not so minor errors and manually check that everything is there, in order to ensure the quality of the ground truth.

Eventually, we should pass the documents that lack information directly from an XML editor to resolve the issue.

Hyphenization character should probably be ¬

At the end of each line with hyphenization, you used ', maybe it should be ¬

Adding the file about the Bnf Rés. J-845 document (Ancient French)

Adding of the first subfolder presenting the models of the project.

You can find in this folder the following elements :

A .jpg showing the accuracy of each model made on escriptorium.
A segmentation model
A first transcription model accurate at about 94%. It was made by using the entire initial sample of twenty pages of training.
A second transcription model, more accurate and that is the result of only 16 pages of the initial sample.

Be careful when you rename files

In the alto xml, there are links to the image file name. if you rename files, they are not linked anymore (you need to change the XML as well)

psl-chartes-htr-students / hn2021-boccace Goto Github PK

hn2021-boccace's People

Contributors

Stargazers

Forkers

hn2021-boccace's Issues

Recommend Projects

Recommend Topics

Recommend Org