Abbreviation expansion with regular expressions.

Expanding early modern Latin and Spanish abbreviations depending on their word structure. It is a complementary automatic correction, and preparation for a further list based abbreviation expanssion and manual scholarly correction and editing.

For instance: Most words ending in ẽdo should be expanded endo:

pudiẽdo ➡ pudiendo

Context: Early Modern Spanish in The School of Salamanca. A Digital Collection of Sources.

For more details about our digital edition see:

Some preliminary challenges https://github.com/CindyRicoCarmona/Name_Entity_Annotation#preliminary-challenges
Our text workflow https://blog.salamanca.school/en/2022/04/27/the-school-of-salamanca-text-workflow-from-the-early-modern-print-to-tei-all/
Our edition guidelines 3.2.4. Abbreviations and Printing Errors https://www.salamanca.school/en/guidelines.html#abbreviationsprinterrors

Sample Works:

Early Modern Spanish: León Pinelo, Confirmaciones Reales de Encomiendas (2021 [1630]), in: The School of Salamanca. A Digital Collection of Sources https://id.salamanca.school/texts/W0061
Early Modern Latin: Díaz de Luco, Practica criminalis canonica (2021 [1554]), in: The School of Salamanca. A Digital Collection of Sources https://id.salamanca.school/texts/W0041

Requirements

Input: xml file in TEI-tite format with no special character annotation <g> elements. It can be addapted to TEI-All texts in similar conditions.
Missing or double white spaces in the input text should be revised and silently resolved, in order to avoid false positves.

XSL:Style-sheet Details

Every template has a specific word structure case and a mode, so many searches are allowed in the same xsl:style-sheet:
For Spanish was added in every xsl:template, not(ancestor::*[@xml:lang = ('la','grc','gr','he','fr','pt','it')]) to exclude text marked with different languages.
For Latin not(ancestor::*[@xml:lang = ('es','grc','gr','he','fr','pt','it')])
Output: Copy of input text plus abbreviations tagged as:

<abbr rend="choice" resp="#auto"><abbr rend="abbr">[abbreviation]</abbr><abbr rend="expan" resp="#CR #auto">[expansion]</abbr></abbr>
Tilde and Macron characters are taken into account. It means, every case has several possible ocurrencies of composed and precomposed characters e.g ẽ|ẽ|ē

Case and Mode Example endo - `pudiẽdo` ➡ `pudiendo`

<xsl:variable name="endo">
    <xsl:apply-templates select="/" mode="endo"/>
</xsl:variable>
    
<!-- Copy of the original text - identity transforms -->
<xsl:template match="@*|node()" mode="endo">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()" mode="endo"/>
    </xsl:copy>
</xsl:template>

<!-- xsl:template with the regular expression of the specific case and mode. -->
<xsl:template match="text()[not(ancestor::tei:abbr or ancestor::*[@xml:lang = ('la','grc','gr','he','fr','pt','it')])]" mode="endo">
    <xsl:analyze-string select="." regex="{'(\s)([aA-zZñſç]+)(ẽ|ẽ|ē)(do)([ .,;\(\)])'}">
        <xsl:matching-substring>
            <xsl:value-of select="regex-group(1)"/>
            <xsl:element name="abbr">
                <xsl:attribute name="rend" select="'choice'"/>
                <xsl:attribute name="resp" select="'#auto'"/>
                <xsl:element name="abbr">
                    <xsl:attribute name="rend" select="'abbr'"/>
                    <xsl:value-of select="concat(regex-group(2),regex-group(3),regex-group(4))"/>
                </xsl:element>
                <xsl:element name="abbr">
                    <xsl:attribute name="rend" select="'expan'"/>
                    <xsl:attribute name="resp" select="'#CR #auto'"/>
                    <xsl:value-of select="concat(regex-group(2),'en',regex-group(4))"/>
                </xsl:element>
            </xsl:element>
            <xsl:value-of select="regex-group(5)"/>
        </xsl:matching-substring>
        <xsl:non-matching-substring>
            <xsl:value-of select="."/>
        </xsl:non-matching-substring>
    </xsl:analyze-string>
</xsl:template>

The output text can be used for a TEI-tite to TEI-All transformation automatically converteding abbreviations into:

<choice resp="#auto"><abbr>[abbreviation]</abbr><expan resp="#CR #auto">[expansion]</expan></choice>

How to use it

With any editor, which suports Saxon.

Or ...

With Ant and a small pipeline build.xml

See for Spanish ➡ Spanish_W0061\build.xml and for Latin ➡ Latin_W0041\build.xml.

They show manual and automatic steps to edit the text. For instance:

<target name="patch-000"> manual step. File W0061_001.xml wiht a basic structural annotation.
<target name="xslt-000"> automatic step. Input file W0061_001.xml is transformed by style-sheet W0061_001.xsl producing the output file W0061_002.xml annotated with abbreviation and expanssions.
<target name="finalize" depends="xslt-001"> finalizes the process with the last step W0061_001.xsl

This transformation is performed twice in the pipeline, as sometimes two abbreviations on the same line are not resolved in the first go. Example from file W0061:

<lb/>ay una peticiõ cõ eſta reſpueſta en aquellas Cortes:

peticiõ and cõ are two different abbreviations, however, they share a word boundary, namely, the white space between them. Therefore, only the first one is annotated in the first execution.

Cases

One case and mode for every template. See details in the style-sheets:

Latin_W0041\xsl\W0041_001.xsl
Spanish_W0061\xsl\W0061_001.xsl

Spanish

"endo - pudiẽdo", mode="endo"
"ando - dudãdo", mode="ando"
"ente|tes - gẽte", mode="ente"
"ende - entiẽde", mode="ende"
"on - Purificaciõ", mode="cion"
"ento - mandamiẽto", mode="ento"
"encia|cias - differẽcia", mode="encia"
"ancia - ignorãcia", mode="ancia"
"ẽ and er - entẽder", mode="ener"
"ẽ and ar - encomẽdar", mode="enar"
"ã and ar" - mãdar, mode="anar"
"ẽ - en - puedẽ, quiẽ, deuẽ", mode="final-en"
"ā - an - haziā, siruā, podrā", mode="final-an"
"ũ - un - pregũtar, renũciar", mode="unar"
"ũ - before (b|p) costũbre, cũplido", mode="umbp"
"õ - before (b|p) hōbres, Cōprar", mode="ombp"
"ō and dad|dades, bōdad, cōformidad", mode="ondad"
"õ and dido|dida|didas|didos, cōcedido, cōcedida" mode="ondido"
"ā and dad|dades, trāquilidad, Hermādad" mode="andad"
"ā and ça|ças, ordenāça, templāça", mode="ança"
"ẽ and ça|çan|ças, verguẽça, comiẽçan" mode="ença"
"đ at the beginning + \w, đspues, đllas (no further special characters like "ā|ẽ|õ|ũ")" mode="dewords"
"ꝓ q with tilde inside a word, flaq̃za, riq̃zas (flaqueza, riquezas)" mode="wquew"
"ꝓ pro at the beginning + \w, ꝓhibir, ꝓcesso (prohibir, processo)" mode="prow"

Depending on the text and the mixture between latin and spanish.

"final ũ", algũ - algun mode="final-un"
"q̃" - que, mode="only-que"
"ẽ - en", mode="only-en"
"ꝓ q with tilde at the end of a word, porq̃ - porque" mode="wque"
"đ| (char0111|charf159)- de" mode="only-de"
⁊ (char204a) ➡ y , mode="only-y"

Latin

Final ũ|ū - um, legũ, appellatũ", mode="final-um"
Final ā|ã - am, primā, verā, mode="final-am"
ā + final di|dum|t|ti|tibus|tis|tur, mode="antur"
Beginning pro (chara753), ꝓbari probari, mode="pro1"
Final - us (chara770), legitimꝰ - legitimus, mode="final-us"
õ + c|d|f|s|t ==> on, cōsensu consensu, mode="on-cdfst"
õ + final e|es, petitiōe petitione, mode="ones"
ũ + t|tur, deducũtur deducuntur, mode="untur"
ẽ + da|dam|di|dis|dus|sis|t|te|tia|tiam|tias|tur, legẽdam legendam, mode="entur"
ẽ + b|m|p, exẽplo exemplo, mode="em-pmb"
ĩ ==> in, only white spaces as boundaries.
đ ==> de, only white spaces boundaries, mode="de"

Names are tagged literal:

Clemẽ - Clemen + ., mode="Clemen"
Innocẽ - Innocen + ., mode="Innocen"
Alexā - Alexan + ., mode="Alexan"
Alexād - Alexand + ., mode="Alexand"
Ioā - Ioan + . , mode="Ioan"
q + ´ + ; ==> que, leuisq́;, mode="qac"
q3 + ´ (chare8bf0301) ==> que, Exemplum́, mode="q3accent"
q3 (chare8bf), ==> que, mode="q3"
⁊ (char204a) ==> et. , mode="only-et"

What is not covered?

Some words might appear separated by <pb/>, <cb/>, <lb/>, <note> or <milestone/>. These cases are not automatically covered yet, and are only manually expanded.

mã-<pb n="[21]v" facs="W0061-0078"/><lb type="nb"/>dando

Encomiẽ<lb type="nb"/>das

To avoid false positives at the end of <lb/>(s), new lines \n and tabs \t are not included as word boundaries. This also means, that words at the end of the lines are not annotated, eventhoug they might follow the pattern. e.g.

eſtãdo\n

Not found, maybe for future works?

Spanish:

Words with ĩ and ar, er, ir
Words with ã and er, ir
Words with õ + ir
Words with ũ + er
Words with ũ + ir
Few cases of "ꝓ" inside a word. e.g. "aꝓuechar" aprouechar

(\s)([aA-zZñſç]+)(ꝓ)([aA-zZñſç]+)([ \.,;])

How to add new cases

This information can be found in the xsl files. Here the Spanish example:

Test the new pattern in the input text, in this case W0061_001.xml Words found should neither yield exceptions, ambiguities nor show conflicts with other cases in this program.
Write the pattern with examples in the list "cases" above and assign a new mode. It should be different from all modes used before.
Between the last template and "Logging", write a new variable. Its name is usually the same name as the new mode. In <xsl:apply-templates/> select the last variable name, and place the new mode:
```
  <xsl:variable name="ExampleNew">
     <xsl:apply-templates select="$lastTemplateVariableName" mode="ExampleNew"/>
  </xsl:variable>
```

Write a template with a template with the identity transforms using the new mode:

 <xsl:template match="@*|node()" mode="ExampleNew">
     <xsl:copy>
         <xsl:apply-templates select="@*|node()" mode="ExampleNew"/>
     </xsl:copy>
 </xsl:template>

Write a template that matches only text in spanisch, which is not tagged as expansion yet and add the new mode:

<xsl:template match="text()[not(ancestor::tei:abbr or ancestor::*[@xml:lang = ('la','grc','gr','he','fr','pt','it')])]" mode="ExampleNew">

Regex-groups must be placed in parenthesis () and distributed in the new elements. See the templates below.

For logging purpuses and for keeping track of the new expanssions added, look for the following locations (variable $out and variable $Expansions) at the very end of the code in the "Logging" section and replace the new variable:

 <xsl:variable name="out">
     <xsl:copy-of select="$ExampleNew"/>
 </xsl:variable>

Unwanted characters in expansions.

 <xsl:variable name="Abbr" as="node()*" select="$ExampleNew//tei:abbr[@rend eq 'abbr' and following-sibling::node()/self::tei:abbr[@rend eq 'expan' and matches(.,'[̃ ãāēẽõōũūꝓđ]+')]]"/>
 
 <xsl:variable name="WrongExpansions" as="node()*" select="$ExampleNew//tei:abbr[@rend eq 'choice']//tei:abbr[@rend eq 'expan' and matches(.,'[̃ ãāēẽõōũūꝓđ]+')]"/>
 
 Abbr with no special character, check this out.
 $prow//tei:abbr[@rend eq 'abbr' and not(matches(.,'[ãẽõũꝓq̃]+'))]
 
 Update last case variable
 <xsl:variable name="Expansions" as="xs:integer" select="count($ExampleNew//tei:abbr[@rend eq 'choice']//tei:abbr[@rend eq 'expan'])"/>

cindyricocarmona / expand_abbreviations_with_regex Goto Github PK

expand_abbreviations_with_regex's Introduction

Abbreviation expansion with regular expressions.

Requirements

XSL:Style-sheet Details

Case and Mode Example endo - `pudiẽdo` ➡ `pudiendo`

How to use it

Cases

Spanish

Latin

What is not covered?

Not found, maybe for future works?

How to add new cases

expand_abbreviations_with_regex's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

cindyricocarmona / expand_abbreviations_with_regex Goto Github PK

expand_abbreviations_with_regex's Introduction

Abbreviation expansion with regular expressions.

Requirements

XSL:Style-sheet Details

Case and Mode Example endo - pudiẽdo ➡ pudiendo

How to use it

Cases

Spanish

Latin

What is not covered?

Not found, maybe for future works?

How to add new cases

expand_abbreviations_with_regex's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org

Case and Mode Example endo - `pudiẽdo` ➡ `pudiendo`