usfm-bible / tcdocs Goto Github PK

View Code? Open in Web Editor NEW

9.0 8.0 9.0 138.4 MB

Technical Committee Documents

License: Other

TeX 28.88% Python 69.50% Makefile 0.88% JavaScript 0.74%

tcdocs's Introduction

tcdocs

This is the document repository for the USFM/X Technical Committee.

tcdocs's People

Contributors

Stargazers

Watchers

Forkers

kentspiel klassenjm joelthe1 rmunn irahopkinson freely-given-org kavitharaju chrisvire lordfrishetti1

tcdocs's Issues

Clear, backward-compatible syntax definition for USFM is required.

I recently encountered some strange USFM produced in Paratext that broke my USFM parser. It was something like
\v 1\f + \ft note\f*Bible text
Note the missing space after the verse number. Until seeing this, and verifying that it was not flagged in any Paratext marker checks, it seems that I now need to adjust my USFM parser to accept this as normal. Is that the desired response?

Zero padding for Strong's Numbers

Michael Johnson suggested this issue in an email:

I just remembered another ambiguity in the current USX/USFM as I was working on a Bible translation with Strong's numbers. The requirement to zero pad numbers is not documented except in the checking code. The real-life example that prompted this is \w Abraham|strong="G11"\w*. That can't pass checks to get into the DBL, but \w Abraham|strong="G0011"\w* can. This is not so great for backwards compatibility, but something I can live with. However, it really should be properly documented. I have real-world texts in at least 5 languages with Strong's number tagging, and multiple translations within some of those languages having such tags, so this isn't just academic.

Implicit marker closure

Implicit closure is nice. Given we always know what is a starting marker and we also know whether something is embedded, it is possible to implicitly close things. For example, the start of a new paragraph implicitly closes all character styles in the previous paragraph. Starting a new character style closes all open character styles including any currently open embedded character styles.

The difficulty is with parsing. Most parsers are based on some notion of recursive descent. This makes actual implicit closure hard and can turn run sequences into embedded runs. For example:

\f + \fr 1:17 \ft This is a footnote\fr*\f*

Is obviously invalid, since the \ft closes the \fr. But if we simply say that \fr* and \ft* are optional and use a typical recursive descent parser, then this example is usually valid and the \ft section is assumed to be embedded within the \fr. Adding support to invalidate this example takes a lot of work in a grammar. One has to say the end of a run is either the closing marker or the start of what might possibly come next. That 'what might possibly come next' can be a tricky and long list to come up with in each context.

Based on this, it is proposed to tighten the USFM specification to remove more implicit closure than has already been removed. The proposed rules are:

All embedded and non-embedded character styles must be explicitly closed.
Notes (footnotes, cross references, etc.) have internal structural markers like \fr and \ft. These must not be explicitly closed and runs are separated by other structural markers or the end note marker, which is required. Note \xt is structural within a cross reference, but may also be used as a character style elsewhere, where it must be explicitly closed. Notes have required explicit closure.
Paragraphs have no end paragraph marker and are implicitly closed by the start of another paragraph or by a chapter milestone.
Table rows are terminated by another table row or by the start of a non-table paragraph marker.
Table cells are terminated by the start of another cell or row or by a non-table paragraph marker. This is problematic if we want to support multi-paragraph table cells.

The astute reader will have caught the implication of rule 1. By explicitly closing character styles, the need for + type markers is removed. While it is planned to remove them (or at least treat them redundantly as equivalent to their non-plussed cousins), this change is not planned as part of the first phase of documenting the USFM standard as it stands.

USFM and USX Validators

For the Scripture Burrito WG, it would be very helpful to have USFM and USX Validators.

BIDI and USFM Markup

Do we need a way to identify text direction or other language characteristics in USFM markup?

Make page on Whitespace handling easier to read

I referred someone to our whitespace page and it was hard for them to understand since we use the term "whitespace" in a limited since for most of the page, but the tendency is to read it as "Unicode Whitespace".

So, seems we need to say "X Whitespace" when we mean the 4 characters defined at the top of the page. I haven't come up with a good name for X.

This will make the text a little longer, but I think it will be clearer.

Do we want partial USFM in our tests?

I modified the Paratext unit tests to write out the USFM, USX and a metadata file with the name of the tests - see the attached .zip file.

There are 130 tests (and maybe more if I look harder), but a lot of them are partial USFM like this in test46:
\p
\v 1 \va 3\va*\vp A\vp*

These will definitely not validate, but I'm not sure we should be expecting others to handle USFM in this way or not. Paratext does a lot of parsing of text ranges, so we need it to work this way
ParatextRoundTripTests.zip
.

Proposal to (formally) allow `\cat` outside footnotes and sidebars

USFM3.0 includes \cat category\cat* for applying special formatting to (extended) footnotes and entire side bars.

I would like to suggest that this is overly restrictive.

In order to differentiate the formatting between, say, the table of contents and some other front-matter tables, PTXprint has (already) extended its application to tables thus:

\tr \cat toc\cat* \tc1 Genesis  \tc2 Gen \tc3 1
\tr \tc1 Exodus \tc2 Exod \tc3 23

We also allow applying \cat to a single paragraph, where there may be some special, distinct purpose for the text, but a full side-bar would be overkill:

\ip A note from the translation team:
\ipi \cat motivation\cat*  We have translated this text not for our own glory or for Earthly riches,
but for the glory of God, taking great care to respect the original text and our own language, 
testing and revising the text many times.  We pray earnestly that our work brings you 
encouragement in your personal daily walk with our saviour, Jesus Christ, and that the Holy 
Spirit uses it to teach you salvation and wisdom. But despite our care, this remains the work 
of (many) human hands; if you find what seems to be a mistake, please do get in contact.

Similarly, it seems to me that there are different contexts for lists, even those enumerating a total, and what is appropriate formatting for the census in Numbers 1 might not be appropriate for contributions to the temple and so on.

I would therefore suggest that the standard also allow \cat ... \cat* to be applied:

To an individual paragraph or section heading (scope: that paragraph/heading)
To a table (scope: from the entire table, until non-table paragraph)
To a list (scope: the entire list, until non-list paragraph)
To all note types

For each of these, I would suggest that the \cat ... \cat* should appear before any content, immediately after the (first) \p, \tr, \lh or \li as applicable.
(I would not suggest permitting this use inside a sidebar, on the basis that sidebars are complicated enough!)

File-level copyright and license notices

USFM / USX do not have a standard place for a copyright / license statement. Several organizations do this in comments but there is no standard way to do this.

Should we add tags for this?

Can paragraph's text come before \v?

Trying to model the USFM specification with formal grammars, one major issue I notice there, related to clearly defining hierarchy is, the two overlapping structures of paragraphs and book-chapter-verse. For instance, as per the paragraph-based structure, \v markers are just character level markups. So, within a chapter is it valid for text to occur before the first \v marker? As per the spec, I don't see any such restrictions being enforced(Or is there?). As per my understanding, it is correct for paragraphs to have texts not bound by \v marker when inside a sidebar(\esb), peripherals etc, but just not within a chapter. It would be good to have the specification clearly define the rules here.

Confusing documentation of attributes

Comparing the documentation for \fig in 3.0:
https://ubsicap.github.io/usfm/characters/index.html#fig-fig

to the new documentation for \fig:
https://docs.usfm.bible/usfm-usx-docs/latest/fig/fig.html

The new documentation has confusing @attribute stuff. For someone new to USFM, what are they supposed to do?

There is no explanation about required vs. optional attributes. In the old documentation, there was a statement:

Required attributes are indicated in the list below with a red asterisk *.

This is missing from the new documentation. There are asterisks after the attribute names, but no explanation that it means they are required. I actually missed the fact that they were there until someone pointed them out.

Scripture Reference Settings needed for roundtrip between USFM and USX

What is the relationship between the Scripture Reference Settings and the USFM standards. I assume these are considered business rules. Nevertheless to convert from USFM to valid USX the USFM verse number must be decoded in order to create a valid sid and eid. In other words round tripping is not possible without the Scripture Reference Settings information. Should this be stated in the specification? Should USFM provide a place for this information in metadata?

Regex in diagrams and Glossary

We have agreed that we want to minimize the amount of regex in diagrams by using Terms as defined in the Glossary of Terms

I also believe we want to simplify the regex wherever possible by using character classes and non Unicode representations for characters. So \n instead of \u000A or

[\t\n\v\f\r\p{Zs}]

instead of

[\u0009-\u000D\u0020\u00A0\u1680\u2000-\u200B\u2028\u2029\u202F\u205F\u3000]

Note: \u200B is a Zero Width Space and is a formatting character not a white space character so I exclude it from the simplified version.

@mhosken says there is a reason we cannot make chapter and verse numbers a Term. However, I would still like to find a way to simplify the definition of the constraints on C:V numbers so we do not need regex in the diagrams

Need word-level identifiers

For our data, we frequently need to point to ranges of words, e.g. the first three words of a given verse. We are currently using an extension of OSIS identifiers that supports this, but want to change to USFM. Can we introduce a syntax that identifies words and ranges of words?

User extensions

https://docs.usfm.bible/usfm-usx-docs/latest/extensions.html

Any 'private-use' marker/style extensions should begin with z. Example: \zMyPara or . Markers/styles in the z namespace are not considered part of the USFM/USX standard. An application or processor may provided support for z extensions, but are not expected to handle the markup or its associated text in any particular way, and are also free to ignore this markup when it is encountered in a text.

This does not tell me how to define a new extension. I can't find the place in the specification that tells me that.

In particular, is there a way to define a new element and say where it can be used?

USFM has an escape problem.

If you wanted to put a \ in Bible text, could you? No. Not even with \.
If you wanted to put a ~ in Bible text, could you? Note: this is in the orthography for one Papua New Guinean language.
If you wanted to put a // in Bible text, could you? This is needed in front matter often when an Internet URL is included.
If you wanted to put a | in Bible text, could you? Note: this is used as punctuation in at least 3 Bible translations.

Sure, there are work-arounds. For the PNG language, I replaced tilde with math operator tilde. Wrong symbol, but it looks right. Bad use of Unicode.
For //, I programmed Haiola to ignore // as a line break when preceded by P: or S: (case insensitive).
For |, I am narrowing the scope of where I recognize this as markup. Does Paratext do it the same way? I doubt it.

Michael Johnson on + Notation

Michael Johnson said this to me in an email response:

The worst problem with USFM is the + notation, and especially the + notation in footnotes. It is incomprehensible to the ordinary working linguist. Although technically, I can unambiguously convert USFM to USFX and back in a similar manner that Paratext converts from USFM to USX and back, far more often than not I have to edit the text to make it comply with the broken standard. (USFX is an alternate way to represent USFM in XML that I invented before USX was invented, and which I still use as an internal hub standard, because it sheds some of the defects of USFM that USX retains.)

Here are the roots of the problem:

At one point in time, no nesting of character attributes was allowed. This was inadequate, but it made it easy to tell when a character attribute ended: either with the next character attribute (which may have been a "revert to normal" tag), or when the paragraph style ended, or when the footnote/cross reference/end note ended.

When conversion to XML was anticipated, we switched to explicit end tags in the main body of the text using the * on the end of the opening tag, like \bk ...\bk* instead of a "return to normal" character tag. Except we didn't do that in footnotes, so the end a character style with the next character style rule still applied.

When it became apparent that nesting of character tags was needed for real (and it is to avoid a huge set of combined tags to do the same thing in a less understandable way), then instead of doing like XML does and explicitly ending all character styles and requiring proper nesting, we introduced the + notation that I hate. I hate it because although it solved a backward compatibility issue (but wasn't the only possible solution), it has wasted many man-hours of my time tediously correcting problems caused by ordinary working linguists not understanding that any non-footnote character style in a footnote needs the "+" even though it doesn't seem to be nested from their point of view, like \f + \fr 1:1 \ft xxx \tl yyy\tl* zzz\f*. The intent is clear, but Paratext chokes on it with a schema check spewing incomprehensible barf because there was no + between the \ and the tl.

Haiola has an option to relax the "+" rules and assume all non-footnote-specific character styles are nested if begun before another is ended, then generate the + syntax on export.

The requirement of the "+" notation is accepted by the Paratext team as a good solution for backward compatibility. I think it is not, because (1) its rules are inconsistent because of the inconsistent handling of footnote vs. main text character styles with respect to end tags, (2) it is not handled automatically by Paratext and most other USFM software, and (3) very few people really understand it.

USFM and USX suffer another technical compatibility issue which seems to be less of a practical problem. Strict nesting of markers is required by XML syntax, but not USFM tag syntax, although Paratext seems to handle enforcing that OK. In other words, \qt \wj XXX\qt* YYY\wj* would make sense to a human, but it should be \wj \qt XXX\qt* YYY\wj* in an ideal world, or \wj +qt XXX+qt* YYY\wj* in the fantasy land where OWLs understand "+" notation.

Now pointing out the problem without suggesting a solution would be obnoxious, right? So I won't do that.

I suggest making the "+" notation optional. Full stop.

Instead of making the linguists (or more likely, publishing personnel, like me) do the disambiguation work, implement some more logic based on class of tag to determine where the XML end tags belong. USX, being XML doesn't suffer from this ambiguity. It is just getting from USFM to XML that is the trick. Here is some logic that could solve backward compatibility issues:

If there are "+" marks in the USFM, they can be used as they are now, or ignored.

Best practice is to always explicitly end character styles in USFM as is done in XML, but this is not always required because of the following rules.

Character styles each have their own class, except for footnote/cross reference/end note character styles, which are all in the same class.

If a character style is begun before another character style of the same class ends, it is assumed to be nested.

If there is no character style explicit end marker before the end of a paragraph style or verse marker, the character style is assumed to end there but a warning is generated.

If a character style of the same class is started before another of the same class ends, the first one is assumed to end before the next one started.

ChapterContent missing description of Figure

On https://docs.usfm.bible/usfm-usx-docs/latest/doc/index.html#doc-book-chapter-content, there is a picture of the grammar which includes Figure:

But the description of the sections below that has all the elements in the picture except for Figure.

Book Numbers - and NT numbering

MARBLE and many other internal numbering systems use 40 to identify Matthew, e.g.

<MARBLELinks xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <MARBLELink Id="04000100100000">
    <ThematicLinks />
    <LexicalLinks />
    <TextualLinks />
    <ImageLinks>
      <ImageLink>tb042403039</ImageLink>
      <ImageLink>Genealogy_of_Jacob</ImageLink>
      <ImageLink>Genealogy-Jesus</ImageLink>
      <ImageLink>JesseTree</ImageLink>
      <ImageLink>AT Map00305</ImageLink>
    </ImageLinks>
    <MapLinks />
    <ArticleLinks />
    <SectionLinks>
      <SectionLink>GNSBUK001639</SectionLink>
    </SectionLinks>
  </MARBLELink>
  <MARBLELink Id="04000100100002">
    <ThematicLinks />
    <LexicalLinks>
      <LexicalLink>SDBG:βίβλος:000002:Communication</LexicalLink>
    </LexicalLinks>
    <TextualLinks />

But the USFM specification says the book code for Matthew is 041. What is the intended use of that number? Is anything broken here?

I am creating a lot of datasets that need to know whether to treat Matthew as 40 or 41 internally, so I'd like to know if this is an issue ...

Well-formed versus Valid USFM

Do we need a concept of well-formed USFM, analogous to well-formed XML? This would be USFM that does not validate against the schema but can be parsed in order to correct errors.

Marker \ipc for centered paragraphs in introductory material

I usually mark additional material (including books like FRT and INT) with introductory codes like \ip, \im and \ib. In this introductory material, itꞌs common to have centered paragraphs, e.g. for the publisherꞌs name on the title page, or various information on the verso page. It would help consistency if a marker \ipc would be available to use in addition to \ip, \im etc.

Liturgical marker (\lit) also needs to be supported in USX in introductions

One of the projects of the Palestinian Bible Society is the Greek Orthodox Version in the Arabic language. A Greek Orthodox Monastery is the owner of this traditional text.

In the introductory texts on each gospel, quotations from the apostolic fathers can be found, next to liturgical remarks like "Needs to be read after sunset". According to me, we could use here the \lit marker - and that is supported in the regular USFM style sheet, but it isn't covered in the USX.

New Features - add a futures tag

Question: What kinds of changes do we consider in scope?

I'm inclined to say this: if making the grammar more orthogonal and less
persnickety allows markup in new settings where it makes sense, I am OK
with that. It's basically improving the grammar and reaping benefits. But
I think we should be very reticent to add new functionality beyond that.

We said that was not in scope for our first release. And I think we should
be careful to avoid scope creep.

But we also need to be prepared to add new features once we have released
our first formal specification.

I have added the label "future" for marking features we will consider after
the next release.

Best practice for marking internal links?

The current demand by Paratext and the DBL to have all human-readable Scripture references conform to a computer-like syntax. While I understand that this is a nice attempt at maintaining backward compatibility with older texts, it also causes all kinds of errors with historical texts, demanding that history be changed or that the text be rejected. The new extension of an h-ref attribute to \xt markup provides a welcome reprieve from that problem, at a cost of breaking all old USFM readers, including mine. I can fix that, but I found another problem in that \xt ...\xt* seems to have taken on part of what \jmp ...\jmp* does, and \xt no longer seems to have anything to do with crossreference text. Right now, this is a confusing mess that I'm not sure what to do with...

Proposal: a subparagraph marker

I propose that a tag be added to indicate subparagraph division.

In some formats, like web pages, good style requires very frequent paragraph breaks in order to avoid a wall of text. (This is a very common web site design guideline.) On the other hand, in a print format, an excessive number of paragraph breaks would make the printed version of a book very long.

If there were a subparagraph marker, the publisher could decide how to interpret it—and might choose to interpret it differently for different publication media. For instance, on a computer screen they could format them as full paragraph breaks since space is not an issue. Or in a printed text they could ignore them altogether to save space. (Or any other way: the Nestle-Aland typesets subparagraph breaks with a large horizontal space, for instance.)

The regular \p tag would be maintained for (full) paragraph breaks.

I envision an empty tag like \subp for USFM or a self-closing tag like <subb/> in USX. I don't there would be a need to create additional hierarchical structure in USX files.

I'm aware that I could use a custom tag, but if it becomes an official USFM tag, then people who write the publishing software would make an effort to implement it.

Can USX and USFM be improved to handle contexts?

One of my biggest bugaboos is the need to flatten text before publishing. The only way to properly style a text in HTML or IDML is to create all of the styes that result from nesting. For example we commonly have Hebrew transliterated words in the Psalm superscriptions: \d xxx \tl yyy\tl*` In publication yyy needs to be regular text, not italic. The need to transition between hierarchical/nested and flattened expressions of the USX should not be overlooked since the final goal of the format is not archiving but publication.

Perhaps there should be a limit on the amount of nesting allowed.
Perhaps the USFM stylesheet should allow one to specify the formatting of certain nested styles if they differ from a straight CSS stylesheet application e.g. :
- \tl in \d
  read Transliterated Word embedded in Hebrew Subtitle.

A second concern is that context matters in formatting. For example: we always add extra vertical space when going from prose to poetry and then back again.

Perhaps the USFM stylesheet should allow one to specify the formatting of certain styles based on context e.g. :
- \q1 following \p
- \q1 following \m
- \p following \q1
- \p following \q2
- \m following \q1
- \m following \q2

This could convert to USX with a context specific attribute.

The desired result is improved formatting both in the PT editing environment and in print and electronic publications.

Issues with List headers, List footers, and List entry totals

The current Documentation does not describe the Biblica use case for these markers.

List header

The current USFM 3.0 documentation for \lh states:

Some lists include an introductory and concluding remark (\lf). They are an integral part of the list content, but are not list items. A list does not require either or both of these elements.

Although these are not list items they are list elements. Biblica calls them List Introductions and List Conclusions, formats them using list indents, and outdents verse numbers to show their integration with the list.
This example from EXO 6:14 illustrates a List Introduction which we mark as \lh followed by a \b

This example from EZR 2:55 illustrates a Sublist Introduction which we mark as \lh. It is not followed by a \b [Note: replace this image from an older typeset with one from a newer typeset that does not have a space following.]

I would like an illustration from the NIV to be used to illustrate proper and allowable uses and formatting of \lh

List footer

The current Documentation for \lf states:

Some lists include an introductory (\lh) and concluding remark. They are an integral part of the list content, but are not list items. A list does not require either or both of these elements.

The current USFM 3.0.4 does not support \lf with \litl ...\litl*. We need to use \lf as a Total line. [Appropriate illustration from EZR 2:55 to be added.] I believe all list types should allow List Totals including Headings. (A translation may want to put the total first.)

I would like the \OccursUnder attribute of \Marker litl to include \lh and \lf

List entry total

The current documentation shows this markup:

\lim1
\v 11 of Pahath-Moab (through the line of Jeshua and Joab) - \litl 2,818\litl*
\lim1
\v 12 of Elam - \litl 1,254\litl*

The hyphen disappears and is replaced by dotted leader.

It is unclear from the documentation why that is. Some of Biblica's projects use dotted leader (Russian) and others do not (English). I am currently using three leader dots in place of the hyphen to represent a dotted leader. I am using there tildes ~~~ to represent no leader dots. This formats reasonably well in Paratext, but I do not think that will communicate to partners using the USX representation for publication.

I would like a formal way of specifying the preferred leader type for a List entry total.

Allow Keyword Value markup to be used in Introductory Lists

The USFM standard does not include introductory lists as valid parents of \lik ...\lik* and \liv# ...\liv#*. These types of lists are needed for Living Bible type introductory text. For example:

\ili1 \lik Autor:\lik* \liv1 Matúš\liv1*
\ili1 \lik Dátum:\lik* \liv1 roky 60–70 n.~1.\liv1*
\ili1 \lik Miesto:\lik* \liv1 pravdepodobne Antiochia\liv1*

I recommend that USFM standard should include introductory lists as valid parents of \lik ...\lik* and \liv# ...\liv#* or else an introductory version of these markers be added to the standard. \ilik ...\ilik* and \iliv# ...\iliv#*

User defined attributes - "may"?

https://docs.usfm.bible/usfm-usx-docs/latest/char/attributes.html#_user_defined_attributes

Using the general syntax above, attributes may be added to any character markers beyond the formalized set in the current version of the USFM/USX specification. These will not be considered strictly canonical, and software supporting USFM/USX may not process user-deined attributes. (Future versions of USFM may formally provide additional attributes within the specification.)

What does "may not process" mean here?

Software must not, or
Software is not required to

Proposal: Markup for non-vernacular words

[Moved here from old site]

While there is \tl that is for transliterated words intended to be pronounceable in the vernacular orthography.
I would like to propose that there also be a \ol for "other language", not written in the vernacular orthography. I briefly considered calling it \wf (word foreign), but my use-case assumption is that at least some readers know the language, and may not consider it as foreign, but it's not the vernacular language of the publication.
It might be in the majority language of the region, a trade language, an international language, or that of a neighbouring area or group.

Summary

Description

Other language (non-vernacalar) text, written in unaltered form, often one known and understood by at least a fraction of the target audience.

Notes

Other language text may be marked for a given language via an attribute, lang, which specifies the source language according to ISO639-1 (2 letter codes) or 639-3 (3 letter codes historically known as ethnologue codes).
Other language text is not transliterated into the vernacular orthography (c.f. \tl), it is instead given in a form that readers of the language find it easiest to understand.
If no language attribute is given, the other language may be assumed to be the national language. However, specifying the language is nevertheless commended.
If the scripture editor checks character inventories or sequences, other language text should either not be included in those, or should be considered separately. Thus other language text may contain letters not permitted in the main text, and should not trigger warnings about unacceptable characters, sequences or spelling (unless the preparation system has appropriate spelling dictionaries available).
Other language text may require an alternative font or presentation. The language attribute and paragraph style should give sufficient information to select the font.
If the typesetting system uses pattern-based hyphenation, other language text should not be hyphenated using patterns developed for another language, (avoiding unfortunate breaks)

Syntax

USFM \ol content \ol*
-or- \ol content |lang="code" \ol*
USX <char style="ol" lang="code"> content</char>

Style type

Character

Valid in

[Section] [Para] [Table] [List] [Footnotes]

Example

\f + \fr 1:1 \fk Circumcised \ft A sign of the Abrahamic covenant.
 Romanian:\+ol tăiat împrejur|lang="ro"\+ol*  \f*

Move \em from Character Styling to Special Text

The emphasized text marker does not inherently specify any particular format, therefore it should not be classified with the pure formatting markers:

\bd …\bd*
\it …\it*
\bdit …\bdit*
\no …\no*
\sc …\sc*
\sup …\sup*

Move it under Special text.

Transition from intro to body

Currently, the USX specification has a strong ordering requirement for introductory paragraphs and then requires a chapter marker to precede any chapter content (or body content) like normal paragraphs and verses. This is problematic in modules, where there is often no chapter markers. I would suggest therefore that we remove the requirement for chapter content to be preceded by a chapter marker and instead say that once you hit the first chapter content, then you are done with the introductory paragraphs. Thus you can have \ip \p rather than requiring \ip \c\ p.

Must \v always be preceded by a space?

Current Normalized or canonical USFM requires \v start on a new line. If the \v is preceded by text, this means a space is always displayed before the verse number and when converted to USX the preceding text span ends with a space. This is not always desired. I have several hundred examples in my data. The following is common markup for the doubtful verse ACT 8:37:
³⁶ Text1 [³⁷ Text2] becomes in standardized/normalized USFM:
\v 36 Text1 [
\v 37 Text2]
which then gets exported in USX with a space before \v 37
<verse number="36" style="v" sid="ACT 8:36" />Text1 [ <verse eid="ACT 8:36" /><verse number="37" style="v" sid="ACT 8:37" />Text2]<verse eid="ACT 8:37" />

I propose that \v is not required to be on a new line in normalized USFM
OR
A verse style marker be defined to include any preceding new line/line feed characters: [\r\n]*\\v\s, which will not be converted into a space in USX.

What are our deliverables?

We need to figure out who is doing what ... and some of that starts with "what". What are our deliverables? I suspect we need at least:

A revised USFM specification
- Include fixes people have asked for and others that come up
- Be more explicit about hierarchy
- Include a formal grammar
- Define the transformation from USFM to USX
A reference implementation
A test suite

In order to finish, I think we should want two independent implementations that report the same results on the test suite.

What do the rest of you think? What do we need to deliver to be able to say we have done our job well? We should also be thinking about who would be willing to do what.

Proposal: Add new milestone style type to represent an anchor point or range for external data

The main use case I have in mind for this proposal is to provide a clear point or selected range of text for someone to attach a comment to some text. Current versions of Paratext simply record text before and after the intended target (e.g., if the word "gamma" is the target inside the text "alpha beta gamma delta epsilon", we could record "alpha beta" as before and "delta epsilon" as after), but that is fragile since and changes to text around the target can cause us to no longer be able to identify the target itself. Custom milestones could be used for this, too, but given that we already have a milestone type that is purely for purposes of helping translation teams and not to be printed (i.e., the ts style type), adding another seems like a reasonable proposal.

Presumably there are others who want to tie metadata to particular phrases in scripture that don't align to verse boundaries, so I don't see this as purely a question about commenting. It's more of a question of how the standard should support this sort of tying/anchoring to data outside of the scripture data.

When looking at what word processing file formats specify for how comments are targeted, I found the following.

The OpenDocument file format specification appears to embed annotations at a given point in a document to denote where an annotation applies. Text within the annotation element is the referenced text.

The Open Office XML (OOXML) format that was proposed by Microsoft and used in Word seems to use IDs embedded in the text. Comments are stored in a separate comments XML file and linked to the main document using the ID tags.

Extend USFM marker \mi to \mi#

Cloning an issue from the legacy usfm repo : ubsicap/usfm#131, from @rfpng

In August 2018 I made a proposal to the Paratext support group to extend the USFM marker \mi to \mi#. Apparently this is now recorded as PTXS-16994 (The original confirmation I received had the number PTXS-17401).

Unfortunately I never heard if my proposal was considered. The issue is still important to me:
In an NT I typeset, \mi1 and \mi2 was used in the glossary. This worked very well but in electronic publishing this causes problems because the process doesn't know about \mi#.

Here is the structure of a glossary entry:
\li1 - Main Entry Headline
\pi1 - Main Entry Description
\mi1 - Indented non-first-line-indented paragraph

The above works fine, but some of the entries also have sub-entries as follows:
\li2 - Sub Entry Headline
\pi2 - Sub Entry Description
\mi2 - Further indented non-first-line-indented paragraph ← Doesn't exist, and that's the problem!

Many paragraph markers have numbered counterparts, but \mi doesn't. It would be very helpful to me if it did!

Thank you very much for considering this.

PS: I'll gladly send you more information about this if needed.

Morphological Segmentation

Translation projects have invented a variety of ways to indicate morphological segmentation in order to support word lists, biblical terms, etc.

Do we need a standard character for this? Perhaps a Unicode joiner?

Milestone diagram includes char marker `fp`

Both the USFM and USX diagrams for a milestone include the char footnote marker fp, see https://docs.usfm.bible/usfm-usx-docs/latest/ms/index.html.

I'm pretty sure that needs to be removed from both.

I'm not sure how to update a diagram otherwise I'd make a PR.

Text Definition

Define Text. I suggest Text is a string of characters not beginning with a White Space but can end with an optional WS. Two or more consecutive WS within a Text are treated (normalized?) as a single regular space (U+0020). Moreover Text cannot contain

tilde ~
backslash \ or
double forward slash //
all of which have a special functions (Text does not provide for escape characters).
Nor can Text contain a tab character or other ASCII/Unicode control characters

USFM canonical form should allow only a single normal space at the end of a Text. Paratext has a tool under Project Menu > Advanced > Standardized whitespace which normalizes WS.

Currently a Paratext allows and retains spaces and paragraph returns in the underlying USFM. This is for the convenience and request of certain power Users who find this feature useful for working on the underlying USFM. Our USFM standard needs to allow this as a non-cannonical form of the USFM. However, normalization of WS should occur before transformation of of USFM to USX. Non-canonical WS will not be round tripped: USFM->USX->USFM