I have an idea regarding the framework of the project.
Since this project is (in theory) already part of the ENC and Marco's biannual project funded by the school, it would be interesting (let alone practical) to try and link our work with the https://htr-united.github.io HTR-United initiative, that (sic) ... vise à mettre en commun les transcriptions HTR/OCR de textes de toutes périodes et de tout style, principalement en français mais de manière non restricive. Elle est née du simple besoin - pour des projets - d'avoir de potentiels vérités de terrain pour entraîner des modèles rapidement sur des corpus plus petits(sic.). More information found here in their recent article : Alix Chagué, Thibault Clérice, Laurent Romary. HTR-United : Mutualisons la vérité de terrain !. 2021. https://hal.archives-ouvertes.fr/hal-03398740/document
The fact that they have already produced detailed guidelines and workflows that ensure the control of the quality of the data sets and the ground truths, facilitates the interoperability of the data, that can be shared, verified, and ameliorated in the long term, guaranteeing their sustainability.
This entails for our project several things:
a) transcripts aligned with images, in a standard format such as XML PAGE or XML ALTO ;
b) The structure of our repository, should consist of two separate subfolders in each folder, containing respectively the training corpus and the ground truth, as well as the respective sources, sc. the pdf images of the incunabula (just the IIIF reference is also acceptable but it's easier if they are directly uploaded in the folders) ;
c) To include in our README.md file the description of the repository, as well as any important information about our procedure.
d) The creation of a YAML document named (strictly) htr-united.yml document containing all the metadata regarding the ground truths produced by the repository (accessible through https://htr-united.github.io/document-your-data.html )
e) The creation of a CITATION.cff file to cite the repository.
Let me know what you think.