Code Monkey home page Code Monkey logo

cldf's Introduction

CLDF: Cross-linguistic Data Formats

CLDF is a specification of data formats suitable to encode cross-linguistic data in a way that maximizes interoperability and reusability, thus contributing to FAIR Cross-Linguistic Data.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Conformance Levels

CLDF is based on W3C's suite of specifications for CSV on the Web, or short CSVW. Thus, cross-linguistic data in CLDF is modeled as interrelated tabular data. A CLDF dataset is:

The main content of the metadata is the description of the schema of the dataset, i.e. the tables, columns and relations between them, als known as schema objects. The following typographical conventions will be used in the following when refering to schema objects:

  • Properties and property values as used in a CLDF metadata are typeset in a monospaced font.
  • Filenames or column names as they appear in CSV data are typeset in italics.

While the JSON-LD dialect to be used for metadata according to the Metadata Vocabulary for Tabular Data can be edited by hand, this may already be beyond what can be expected by regular users. Thus, CLDF specifies two conformance levels for datasets: metadata-free or extended.

Metadata-free conformance

A dataset can be CLDF conformant without providing a separate metadata description file. To do so, the dataset MUST follow the default specification for the appropriate module regarding:

  • filenames
  • column names (for specified columns)
  • CSV dialect

Thus, rather than not having any metadata, the dataset does not specify any; instead it falls back to using the defaults. Such single-CSV file datasets MAY contain additional columns not specified in the default module descriptions.

The default filenames and column names are described in components. The default CSV dialect is RFC4180 using the UTF-8 character encoding, i.e. the CSV dialect specified as:

{
  "encoding": "utf-8",
  "lineTerminators": ["\r\n", "\n"],
  "quoteChar": "\"",
  "doubleQuote": true,
  "skipRows": 0,
  "commentPrefix": "#",
  "header": true,
  "headerRowCount": 1,
  "delimiter": ",",
  "skipColumns": 0,
  "skipBlankRows": false,
  "skipInitialSpace": false,
  "trim": false
}

For a single CSV file to be a CLDF-compliant dataset without metadata

  • the first line must contain the comma-separated list of column names,
  • and no comment lines are allowed.

Tip

Thus, a minimal metadata-free CLDF StructureDataset will consist of a CSV file named values.csv, with content looking like the example below:

ID,Language_ID,Parameter_ID,Value
1,stan1295,wals-1A,average

Extended conformance

A dataset is CLDF conformant if

  • it contains a metadata file, derived from the default profile for the appropriate module,
  • it contains the minimal set of components (i.e. CSV data files) specified for the module at least.

The metadata MUST contain a dc:conformsTo property with one of the CLDF module URLs as value.

Tip

Thus, a minimal extended CLDF StructureDataset will consist of

  • a JSON file containing the metadata (with a freely chosen name),
  • a CSV file containing the dataset's ValueTable (with a name as specified in the metadata).

Providing a metadata file allows for considerable flexibility in describing the data files, because the following aspects can be customized (within the boundaries of the CSVW specification):

  • the CSV dialect description (possibly per table), e.g. to:
    • allow comment lines (if appropriately prefixed with commentPrefix)
    • omit a header line (if appropriately indicated by "header": false)
    • use tab-separated data files (if appropriately indicated by "delimiter": "\t")
  • the table property url
  • the column property titles
  • the inherited column properties
  • adding common properties,
  • adding foreign keys, to specify relations between tables of the dataset.

Thus, using extended conformance via metadata, a dataset may:

  • use tab-separated data files,
  • use non-default file names,
  • use non-default column names,
  • add metadata describing attribution and provenance of the data,
  • specify relations between multiple tables in a dataset,
  • supply default values for required columns like languageReference, using virtual columns.

In particular, since the metadata description resides in a separate file, it is often possible to retrofit existing CSV files into the CLDF framework by adding a metadata description.

Thus, conformant CLDF processing software MUST implement support for the CSVW specification to the extent necessary.

Tip

So, the minimal example from the previous section may consist of the following two files under extended conformance: A metadata description file cldf-metadata.json:

{
  "@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}],
  "dc:conformsTo": "http://cldf.clld.org/v1.0/terms.rdf#StructureDataset",
  "dialect": {"commentPrefix": "#", "delimiter":  ";"},
  "tables": [
    {
      "url": "data.csv",
      "dc:conformsTo": "http://cldf.clld.org/v1.0/terms.rdf#ValueTable",
      "tableSchema": {
        "columns": [
          {"name": "No", "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#id"},
          {"name": "LID", "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#languageReference"},           
          {"name": "PID", "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#parameterReference"},
          {"name": "Val", "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#value"}
        ]
      }
    }
  ]
}

and ValueTable in a file data.csv:

No;LID;PID;Val
# Comments are allowed now!
1;stan1295;wals-1A;average

CLDF Ontology

CLDF metadata uses terms from the CLDF Ontology, as specified in the file terms.rdf, to mark

While many of these properties are similar (or identical) to properties defined elsewhere - most notably in the General Ontology for Linguistic Description - GOLD - we opted for inclusion to avoid ambiguity, but made sure to reference the related properties in the ontology.

Important

The CLDF-specific meaning of tables and columns in a dataset is determined by the ontology terms they are associated with, i.e. URLs specified as dc:conformsTo property for tables or as propertyUrl property for columns in the metadata file. The filenames and the column names of the CSV files are only used to connect metadata and actual data. Thus, while it is possible (and intentionally easy) to use CLDF data in a CLDF-agnostic way (e.g. importing data files of a CLDF dataset into a spreadsheet program), CLDF conformant tools MUST reference CLDF tables and columns by ontology terms and not by file or column name.

Note

Ontology terms are the values for the rdf:about property of rdf:Class and rdf:Property objects in terms.rdf. Often we refer to ontology terms using just the URL fragment or local name, rather than the full URL.

Note

While filenames and column names in CLDF datasets (with metadata) can be freely chosen, the ontology recommends defaults for these as values of the csvw:url and csvw.name properties in terms.rdf.

Caution

In an ill-advised attempt to version the ontology, v1.0 has been baked into the term URIs. While this may be a good idea in case of incompatible changes (e.g. if the semantics of a term changed), it presents an obstacle for interoperability in case of backwards-compatible changes. So starting with CLDF 1.1, we will keep http://cldf.clld.org/v1.0/terms.rdf as namespace for all versions of the 1.x series, and specify the particular version when a term was introduced using dc:hasVersion properties per term.

Tip

For better human readability the CLDF Ontology should be visited with a browser capable of rendering XSLT - such as Firefox.

CLDF Dataset

CLDF Metadata file

A CLDF dataset is described with metadata provided as JSON file following the Metadata Vocabulary for Tabular Data. To make tooling simpler, we restrict the metadata specification as follows:

  • Metadata files MUST specify a tables property on top-level, i.e. MUST describe a TableGroup. While this adds a bit of verbosity to the metadata description, it makes it possible to describe multiple tables in one metadata file.
  • The common property dc:conformsTo of the TableGroup is used to indicate the CLDF module, e.g. "dc:conformsTo": "http://cldf.clld.org/v1.0/terms.rdf#Wordlist"
  • The common property dc:conformsTo of a Table is used to associate tables with a particular role in a CLDF module using appropriate classes from the CLDF Ontology.
  • If each row in the data file corresponds to a resource on the web (i.e. a resource identified by a dereferenceable HTTP URI), the tableSchema property SHOULD provide an aboutUrl property.
  • If individual cells in a row correspond to resources on the web, the corresponding column specification SHOULD provide a valueUrl property.

Each dataset SHOULD provide a dataset distribution description using the DCAT vocabulary. This will make it easy to
catalog cross-linguistic datasets. In particular, each dataset description SHOULD include these properties:

Thus, an example for a CLDF dataset description could look as follows:

{
  "@context": "http://www.w3.org/ns/csvw",
  "dc:conformsTo": "http://cldf.clld.org/v1.0/terms.rdf#StructureDataset",
  "dc:title": "The Dataset",
  "dc:bibliographicCitation": "Cite me like this!",
  "dc:license": "http://creativecommons.org/licenses/by/4.0/",
  "null": "?",
  "tables": [
    {
      "url": "ds1.csv",
      "dc:conformsTo": "http://cldf.clld.org/v1.0/terms.rdf#ValueTable",
      "tableSchema": {
        "columns": [
          {
            "name": "ID",
            "datatype": "string",
            "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#id"
          },
          {
            "name": "Language_ID",
            "datatype": "string",
            "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#languageReference",
            "valueUrl": "http://glottolog.org/resource/languoid/id/{Language_ID}"
          },
          {
            "name": "Parameter_ID",
            "datatype": "string",
            "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#parameterReference"
          },
          {
            "name": "Value",
            "datatype": "string",
            "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#value"
          },
          {
            "name": "Comment",
            "datatype": "string",
            "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#comment"
          },
          {
            "name": "Source",
            "datatype": "string",
            "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#source"
          },
          {
            "name": "Glottocode",
            "virtual": true,
            "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#glottocode",
            "valueUrl": "{Language_ID}"
          }
        ],
        "aboutUrl": "http://example.org/valuesets/{ID}",
        "primaryKey": "ID"
      }
    }
  ]
}

CLDF Data files

It is possible to add any kind of CSV files to a CLDF dataset (by virtue of being an extension of CSVW). While the CLDF standard recognizes (and attaches specified semantics) to tables described with a common property dc:conformsTo with one of the component URIs of the CLDF Ontology as value, additional tables lacking this property in their metadata are acceptable.

Similarly, while CLDF semantics can be assigned to individual columns by assigning one of the property URIs defined in the CLDF Ontology as propertyUrl, additional columns - also in CLDF components - are acceptable. CLDF conformant software MUST detect CLDF-specific columns by matching the propertyUrl to CLDF Ontology terms and NOT by matching column name to default column names recommended in the ontology.

Column specifications

  • CLDF column properties are assumed to have a complete row (or rather the entity a row stores data about) as scope; e.g. a source column is assumed to link to source information for any piece of data in the row. Thus, each property can be used only once per table, which makes processing simpler.
  • More generally, CLDF assumes column names (not just propertyUrls) in a table to be unique.
  • Since CLDF is designed to enable data reuse, data creators should assume that schema information like table or column names ends up in all sorts of environments, e.g. as names in SQL databases or as parts of URLs of a web application. Thus, it is RECOMMENDED to stick to ASCII characters in such names and avoid usage of punctuation other than :._-.
  • Cardinality: CSVW allows specifying columns as "multivalued", i.e. as containing a list of values (of the same datatype), using the separator property. Thus, CLDF consumers MUST consult a column's separator property, to figure out whether the value must be interpreted as list or not. Note that this also applies to foreign keys. However, CLDF may restrict the cardinality as follows:
    • The specification of a property in the ontology MAY contain a dc:extent property with value singlevalued or multivalued, fixing cardinality of any instance of such a column in any table.
    • The specification of a column in the default metadata of a component MAY contain a dc:extent property with value singlevalued or multivalued, fixing cardinality.

Identifier

Each CLDF data table SHOULD contain a column which uniquely identifies a row in the table. This column SHOULD be marked using:

  • a propertyUrl of http://cldf.cld.org/v1.0/terms.rdf#id
  • the column name ID in the case of metadata-free conformance.

To allow usage of identifiers as path components of URIs and ensure they are portable across systems, identifiers SHOULD be composed of alphanumeric characters, underscore _ and hyphen - only, i.e. match the regular expression [a-zA-Z0-9\-_]+ (see RFC 3986).

Following our design goal to reference rather than duplicate data, identifiers may be used to reference existing entities (e.g. Glottolog languages, WALS features, etc.). This can be done as follows:

  • If the identifier can be interpreted as link to another entity, e.g. using the WALS three-lettered language codes to identify languages, this should be indicated by assigning the column an appropriate valueUrl property, e.g. http://wals.info/languoid/lect/wals_code_{ID}
  • If the identifier follows a specified identification scheme, e.g. ISO 639-3 for languages, this can be indicated by adding a virtual column with a suitable propertyUrl to the table's list of columns.

Missing data

Data creators often want to distinguish two kinds of missing data, in particular when the data is extracted from sources:

  1. data that is missing/unknown because it was never extracted from the source,
  2. data that is indicated in the source as unknown.

The CSVW data model can be used to express this difference as follows:

  • Case 1 can be modeled by not including the relevant data as row at all.
  • Case 2 can be modeled using the null property of the relevant column specification (defaulting to the empty string) as value in a data row.

Sources

Considering that any single step in collecting (cross-)linguistic data involves some amount of analysis and judgement calls, it is essential to make it easy to trace assertions back to their source.

Each CLDF data table may contain a column listing sources for the data asserted in the row. This column - if present - MUST be marked using:

  • a propertyUrl of http://cldf.cld.org/v1.0/terms.rdf#source
  • the column name Source in the case of metadata-free conformance.

Sources are specified as semicolon-separated (unless the metadata specifies a different separator) source specifications, of the form source_ID[source context], e.g. meier2015[3-12] where meier2015 is a citation key in the accompanying sources file.

Foreign keys

Often cross-linguistic data is relational, e.g. cognate judgements group forms into cognate sets, creating a many-to-many relationship between a FormTable and a CognatesetTable.

To make such relations explicit, the CLDF Ontology provides a set of reference properties.

Reference properties MUST be interpreted as foreign keys, e.g. a propertyUrl http://cldf.clld.org/v1.0/terms.rdf#languageReference specified for column Col1 of a table with url table1.csv is equivalent to a CSVW foreign key constraint

  "foreignKeys": [
       {
           "columnReference": "Col1",
           "reference": {
               "resource": "languages.csv",
               "columnReference": "ID"
           }
       }
   ]

assuming that the LanguageTable component has url languages.csv and a column ID with propertyUrl http://cldf.clld.org/v1.0/terms.rdf#id.

While spelling out foreign key constraints may feel cumbersome, it is still RECOMMENDED that metadata creators do so, to make the data compatible with CSVW tools. The foreign key constraints MUST be specified explicitly if the referenced column does not have a propertyUrl http://cldf.clld.org/v1.0/terms.rdf#id.

Note

Columns for reference properties may still be "nullable", i.e. contain null values, to allow for rows where no reference can be specified.

Sources reference file

References to sources can be supplied as part of a CLDF dataset as an UTF-8 encoded BibTeX file (with the citation keys serving as local source identifiers). The filename of this BibTeX file MUST be either:

  • sources.bib in case of metadata-free conformance
  • or specified as path relative to the metadata file given for top-level common property dc:source in the dataset's metadata.

Compressed data or reference files

CLDF datasets may contain large data or reference files which may be inconvenient (e.g. because the size exceeds GitHub's 100MB filesize limit). In such cases, the dataset creator may compress individual files using the ZIP format (which works really well on CSV and BibTeX files). The resulting ZIP archive MUST contain the zipped file and nothing else and MUST be named after the original file, adding .zip as filename extension. The filename references in the metadata MUST be kept unchanged.

CLDF processing software MAY implement zip-file discovery, i.e. if a filename referenced in the metadata cannot be found, but a file filename_with_extension.zip is found, processing MUST proceed with the unzipped content of filename_with_extension.zip.

If CLDF processing software does not support zip-file discovery, it should signal the corresponding error in a transparent way. I.e. it should be clear for the user that the ZIP archive should be unzipped before running the processing software.

CLDF Modules

Much like Dublin Core Application Profiles, CLDF Modules group terms of the CLDF Ontology into tables. Thus, CLDF module specifications are recommendations for groups of tables modeling typical cross-linguistic datatypes. Currently, the CLDF specification recognizes the following modules:

In addition, a CLDF dataset can be specified as Generic, imposing no requirements on tables or columns. Thus, Generic datasets are a way to evolve new data types (to become recognized modules), while already providing (generic) tool support.

In the CLDF Ontology modules are modeled as subclasses of dcat:Distribution, thus additional metadata as recommended in the DCAT specification SHOULD be provided.

CLDF Components

Some types of cross-linguistic data may be part of different CLDF modules. These types are specified as CLDF components in a way that can be re-used across modules (typically as table descriptions, which can be appended to the tables property of a module's metadata). A component is a CSVW table description with a dc:conformsTo property having one of the component terms in the CLDF Ontology as value. Each component listed below is described in a README and specified by the default metadata file in the respective directory.

A component corresponds to a certain type of data. Thus, to make sure all instances of such a type have the same set of properties, we allow at most one component for each type in a CLDF dataset.

Extensions

In addition to the specification of the CLDF data model and its representation on disk, several "mini-specifications" extend the scope of the CLDF specification, by describing best practices and recommendations for common usage patterns of CLDF data.

Reference implementation

In order to be able to assess the validity of CLDF datasets, i.e. to check datasets for CLDF conformance, a reference implementation of CLDF is available as Python package pycldf.

This package provides commandline functionality to validate datasets as well as a Python API to programmatically read and write CLDF datasets.

Compatibility

Using CSV as basic format for data files ensures compatibility of CLDF with many off-the-shelf computing tools.

  • Using UTF-8 as character encoding means editing these files with MS Excel is not completely trivial, because Excel assumes cp1252 as default character encoding - Libre Office Calc on the other hand handles these files just fine.
  • The tool support for CSV files is getting better and better due to increasing interest in data science. Some particularly useful tools are:

Versioning

Changes to the CLDF specification will be released as new versions, using a Semantic Versioning number scheme. While older versions can be accessed via releases of this repository or from ZENODO, where releases will be archived, the latest released version is also reflected in the master branch of this repository, i.e. whatever you see navigating the directory tree at https://github.com/cldf/cldf reflects the latest released version of the specification.

History

Work on this proposal for a cross-linguistic data format was triggered by the LANCLID 2 workshop held in April 2015 in Leipzig - in particular by Harald Hammarström's presentation A Proposal for Data Interface Formats for Cross-Linguistic Data.

cldf's People

Contributors

anaphory avatar ar-jan avatar bambooforest avatar blurks avatar chrzyki avatar cysouw avatar lingulist avatar simongreenhill avatar xrotwang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cldf's Issues

How to handle the CLDF ontology?

What I did now is just adding markdown-files with some automatically parsable structure (hopefully) to cldf in the folder ontology. The question is: are there any style guides for this kind of endeavour? We should define some relations, etc., of which I am aware, but I don't know how to code them in a most principled way, etc.

Wordlist: Value with multiple Forms

I am still somewhat confused what to make of multiple comma-separated forms in the word list Value column. Let's say my source says

[de] Uncle: Onkel, Ohm/Oheim

What would be the natural way to put an entry like that in CLDF, where some forms are obvious variants of each other and other forms are not obviously related?

(1)

ID,Language_ID,Parameter_ID,Value,Form
1,stan1295,1327,"Onkel,Ohm,Oheim",Onkel
1,stan1295,1327,"Onkel,Ohm,Oheim",Ohm
1,stan1295,1327,"Onkel,Ohm,Oheim",Oheim

(2)

ID,Language_ID,Parameter_ID,Value,Form
1,stan1295,1327,"Onkel,Ohm,Oheim",Onkel
2,stan1295,1327,"Onkel,Ohm,Oheim",Ohm
3,stan1295,1327,"Onkel,Ohm,Oheim",Oheim

(3)

ID,Language_ID,Parameter_ID,Value,Form
1,stan1295,1327,"Onkel",Onkel
2,stan1295,1327,"Ohm,Oheim",Ohm
2,stan1295,1327,"Ohm,Oheim",Oheim

(4)

ID,Language_ID,Parameter_ID,Value,Form
1,stan1295,1327,"Onkel",Onkel
2,stan1295,1327,"Ohm,Oheim",Ohm
3,stan1295,1327,"Ohm,Oheim",Oheim

(5) Something else?

Provide URIs for all properties used in the CLDF standard in the Ontology

All properties used in tables specified by CLDF should have URIs in the CLDF Ontology - possibly suitable subPropertyOf existing properties in other vocabularies. This will make sure, our tools can recognize these properties unambiguously and vice versa other tools cannot mistake these properties.

This should include ID, forcing a format similar to the one for column names (in addition to adding a primaryKey property implicitely). To mark IDs as URIs, referencing entities defined outside of the dataset, an appropriate aboutUrl should be used.

Slicing, indexing, subsetting: counting!

Proposal

Concerning slicing as used here:

This is yet another "tab vs. comma" discussion, as there are at least two different approaches to counting: first the slicing from Python/C, where the counter starts at zero and the end is not included (basically you are counting boundaries here), and second the indexing/subsetting from Matlab/R, where the counter starts at one and the end is included (basically you are counting elements here).

Please note: I am perfectly aware that the two approach are equivalent, and I also know that this is very much a question of what you are used to. The crucial point is which approach will more quickly lead to misunderstanding when somebody uses the data without reading the documentation.

I think it important to realize that the CLDF is not a programming language, but a data description. So, it should be easily understandable for people outside a particular coding framework. I think that the 'element-counting' Matlab/R approach is much easier to grasp intuitively than the 'boundary-counting' Python/C approach.

If you have seven elements (with six explicit boundaries coded in the data), and you want to refer to the third element, then using "3" seems more natural than using "2:3". The same if you want to refer to elements 1 through 4, then "1:4" seems more natural than "0:4" (the more so as the boundary at "0" does not exist in the data).

So: my proposal is to not use "slicing" but "element counting" in the CLDF.

Drop subSequence property

The subSequence property isn't used anywhere and should be obsolete and replaced by specific slice properties refereing to a sequence property.

Addresses #54

Specify components

Components could be specified as orthogonal aspect to modules. I.e. pieces of specification which can be reused across modules. Candidates include:

  • examples in IGT
  • forms (could be re-used by dictionary, wordlist, and potentially even structure-dataset)

In terms of architecture, components could just be Table descriptions - as opposed to modules, which are TableGroup descriptions.

Diving deeper

May I point you to a more profound way to deal with tabular data?
Core idea: column storage with support for sparse data.
Step one: define anchor points for the most granular units in your corpus.
Step two: define nodes for all subsets of anchor points that you want to annotate.
Step three: annotate in the form of mappings from nodes to feature values.
Every feature is 1 column of data, implicitly linked to the set of anchor points and nodes.
I have used this model extensively for the Hebrew Bible (400,000 anchor points, 1,000,000 extra nodes, 20,000,000 feature values), and it enables very flexible explorative data processing, e.g. in a Jupyter notebook.
By isolating columns, it becomes easier to produce them in a distributed fashion, and share them easily.
Have a look at the model

Specs need copyediting

The specs contain several inconsistencies and apparent mistakes.

  • You specify nowhere what columns are required generically (Word lists have 4, but readme states “While the file may contain any number of columns”, while implicitly also making at least ID and Language_ID required)
  • There are half sentences: “Value: the word form, the main value, for the given language and the given concept (may have different degrees of ” (wordlist.md) – I think I saw another one, but can't retrace it right now.
  • structure_dataset.md has a word list example even though it speaks of “(often typological)” values, and the description on multi-dimensional features is very short

Macroarea as subtype of a more general “named area” property?

I have named regions in my database, and I think that might be a data type worth including in future CLDF releases. Macroareas (https://github.com/glottobank/cldf/blob/ff33e40dbccabaf1191f08eb8f486a13343a6f5b/terms.rdf#L272) are then an obvious, semantically well-defined subtype of named areas. Countries could be another type of named area.

While that type of data can be programatically extracted from the geo-coordinates using some set of covering polygons, it does have value in itself, I think.

Drop rdfs:label property in Ontology

Since we already have two more name-like attributes for each property, csvw:name and the local name in the CLDF namespace, an additional rdfs:label seems superfluous, considering that the other names are fairly human readable as well.

Addresses #54

Component architecture

We should think about some sort of component model tying together the core format with domain-specific components. Components that come to mind are

  • Interlinear glossed text (IGT)
  • cognate judgements

Specify the refered table for reference properties

Currently, a weird hack involving the csvw:name of reference properties is used to determine the component which is referenced. This should be replaced by explicitly specifying the referenced component in the Ontology. With this in place, we should also regularize the csv:name of reference properties (see #54 ).

General Representation of Cognate Sets (including partial cognacy)

the more I think about it and the more I talk about it with colleagues, it becomes evident that cognacy needs to expanded by the aspect of partial cognacy. In Indo-European, we have only spurious accounts, but in Tukano, Sino-Tibetan, and many other families, compounds are not just some little noise but constitute the basis of the lexical structure of the languages (30% in the nouns of Chinese dialects, but also pervasive in other families, like Tupi-Guarani).

Now, for partial cognacy, there is a straightforward representation. If we have words like "she-goat" and "he-goat", we can assign two IDs, that would render the stuff as "1 2" and "1 3". Note that we have cases in which we have the opposite order in the same language family, that is, we have "she-goat" in langauge A and "he-goat" in language B. Here, we would write "1 2" and "2 1". Note also that we need to keep the information in one cell, since we want to store the order of the elements, and keep track of it for the alignmetns, which ideally would only be made for identical parts of the compounds. So we would have one alignment for cogid 1 and one for cogid 2, etc. In a file, we could represent this in a column by keeping the order of the original tokens, but adding a separator for morphemes:


TOKENS         COGID  ALIGNMENT           
sh e + g oa t       1 2         sh e + g oa t
h e + g oa t         1 3         h e + g oa t
g oa t + h e         3 1         g oa t + h e

So the alignment is no longer immediately visible when looking at the text-file, but automatically accessible and it can be displayed with tools that offer the service of comparing the cogid column with the column of segments (TOKENS). We loose transparency for those who want to read text-files, but we gain power for representing cognacy truthfully.

Furthermore, the format is consistent with unique-ID representations of cognacy:

  • in a strict version, we require identity of compound structure, which is nothing else than a check for identical strings in the cogid-representation which gives multiple numerical IDs in one cell separated by a space.
  • in a loose version, we can make a connected component analysis of the underlying network and label the respective items as cognate, which share direct or indirect links with other items

My most recent idea was, that this should essentially be the basis, since it is so frequent in many language families, and also useful where it is less frequent (but never underestimate the degree of partial cognacy, also for IE languages, once you start being strict with morphemes!). And since those datasets where partial cognacy doesn't play a role would still simply show one ID for a cognate set, it would be largely consistent with the rest.

But in danger of repeating myself here: we cannot ignore the problem of handling partial cognates. They are just too massively present in too many languages, and can also soon be fully handled by automatic cognate detection algorithms (provided, morpheme boundaries are known: I have recent highly promising results for lingpy).

How to handle partial cognate sets and cognate sets with alignments in general?

In lexibank, we have an extra file for cognate sets, with foreign keys (Word_ID). It seems like this can be specified in a straightforward way in CSV metadata. However, there are a few open questions:

  1. we put cognate sets in an extra file to be able to handle different sources for cognate sets (e.g., one automatic version vs. one non-automatic version, which would be two extra columns in a flat csv in LingPy-format), but this somehow feels strange, as we would have the potential to assign one word to multiple cognate sets, but we would only allow this if the source is the same? Somehow, too much depends on the source as being the discriminating factor here, as with two different series of cognate judgments in one file, the only way to distinguish cognate sets would be to look at the source when evaluating and hoping that no word_id with identical source is assigned to multiple cognate sets (which would be the fuzzy annotation which is often used in ABVD and difficult to interpret)

  2. in partial cognates, we assign one word form to multiple cognate sets, and the order is important, so in principle, we could either put one word_id in multiple rows, one for each partial cognate set, but we would have to report the index, or we accept that the morpheme structure of the "Segments" follows the order of the list of cognate-identifiers in the partial cognate set (which is space-separated).

Apart from the fact that I'd like to preserve some base-functionality to re-use the columns in flat csv files, if cognate sets do not require specific source annotation, it seems we need to add at least a Cognate_Set_Type column indicating whether a cognate set is a) fuzzy (ABVD-style, assigning one word ID to multiple cognate sets), b) unique, or c) partial.

But I'm having some trouble in making this a clear-cut example now that would reflect all things neatly, specifically, since the fuzzy, unique, partial distinction would require that based on that distinction the content of the field Cognate_Set changes (string vs. list, or integer vs. list in case of LingPy/EDICTOR).

Column with unset data type

I noticed that I accidentally created a table without specifying the datatype for some column.
Is that valid?

CLDF does not say anything, cldf validate lets it pass, but pycldf.db assumes that column.datatype.base is defined.

Interlinear glossed text

Cross-linguistic datasets often contain examples as interlinear glossed text (IGT) according to the Leipzig Glossing Rules (LGR). While this kind of data clearly can be modelled as tabular data, thus would fit into csv files, this format would probably go against the idea of having formats that can be easily edited by hand. To satisfy this requirement, a format that groups examples as blocks of aligned IGT constituents might be better suited. One candidate for such a format is the Toolbox variant exported by ELAN:

\utterance_id ...
\ELANBegin 5.708
\ELANEnd 8.974
\ELANParticipant
\utterance <text in source language>
\gramm_units <morphemes split by whitespace>
\rp_gloss <glosses split by whitespace>
\ft <translation>
\comment

Release

  • fill in CHANGELOG.md
  • draft release 1.0.0
  • archive on zenodo

Describe use cases

An explicit list of use cases should help in motivating design decisions and making an evaluation of a proposed standard easier.

New module "ParallelText"

Proposal

I would like to propose a new module ParallelText for the CLDF, which uses two components: forms (i.e. FormTable) and a new component for functional equivalents (FunctionalEquivalentsTable).

The structure of parallel texts are very similar to wordlists in that there are parallel instances ("translations") for the same content, so they basically can be encoded using a FormTable.

The analysis of a parallel text is somewhat similar to the analysis of cognates, in that parts of the strings are to be aligned. However, these alignments are not cognates, but functional equivalents. Also, the alignments are mostly not nicely linear as with sound alignments, so it is mostly not possible to solve the multialignment by inserting gaps (there is also a lot of many-to-many alignments). Because of this I would like to propose a new component, the FunctionalEquivalentsTable (any improvements on this cumbersome name are welcome!).

The FunctionalEquivalentsTable encodes the parts from the different parallel texts that are functionally equivalent. In its most general form, this table is a list of counts/slices from the parallel texts (separated by spaces) that are aligned into FunctionalEquivalentSets.

{
    "url": "functionalEquivalents.csv", 
    "dc:conformsTo": "TODO", 
    "tableSchema": {
        "columns": [
            {
                "name": "ID", 
                "required": true, 
                "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#id", 
                "datatype": {
                    "base": "string", 
                    "format": "[a-zA-Z0-9_\\-]+"
                }
            }, 
            {
                "name": "Form_ID",
                "required": true, 
                "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#formReference", 
                "datatype": "string"
            }, 
            {
                "name": "Slice/Count", 
                "required": true, 
                "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#slice", 
                "datatype": {
                    "base": "string", 
                    "format": "\\d+(:\\d+)?"
                }, 
                "separator": ","
            }, 
            {
                "name": "FunctionalEquivalentSet", 
                "required": true, 
                "propertyUrl": "TODO", 
                "datatype": "string"
            }
        ]
    }
}

See for example this multialignment of Germanic bibles (Note that the format here is no yet in accordance with the CLDF). Also nicely illustrated there is the non-linearity of the alignments, e.g. in this visualisation.

Annotation of secondary columns

Proposal

Given a column in a CSV-file that uses secondary separators, sometimes these separated parts will be aligned, i.e. each string has the same number of chunks, and the chunks have some kind of equivalent status in each row. Such an multi-alignment is for example used in the component cognates.

In such a situation I would like to be able to refer to such a column to add extra information to such a column. This might be solved by adding a alignmentAnnotationTable to the CLDF.

  • representation in standard orthography (STANDARD)
  • mark columns as uninteresting for further analysis, because the data is not consistently encoded (IGNORE)
  • mark sets of columns as "identical", which is necessary in case of metathesis (MERGE)
  • mark sets of columns as "complex segment", which happens in situations like a multi-alignment of Germanic school: It is probably best to align two columns for the onset here (with languages having /sk/ or /sx/), but in some languages these two columns are merged into one (e.g. /ʃ/). (COMPLEX)
  • etcetera

Technically it not problematic to make such annotations using column counting/slicing, for example like this:

{
    "url": "alignmentAnnotation.csv", 
    "dc:conformsTo": "TODO", 
    "tableSchema": {
        "columns": [
            {
                "name": "ID", 
                "required": true, 
                "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#id", 
                "datatype": {
                    "base": "string", 
                    "format": "[a-zA-Z0-9_\\-]+"
                }
            }, 
            {
                "name": "Form_ID", 
                "required": true, 
                "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#formReference", 
                "datatype": "string"
            }, 
            {
                "name": "Slice/Count", 
                "required": true, 
                "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#slice", 
                "datatype": {
                    "base": "string", 
                    "format": "\\d+(:\\d+)?"
                }, 
                "separator": ","
            }, 
            {
                "name": "AnnotationKind", 
                "required": true, 
                "propertyUrl": "TODO", 
                "datatype": "string"
            }
            {
                "name": "AnnotationContent", 
                "required": false, 
                "propertyUrl": "TODO", 
                "datatype": "string"
            }
        ]
    }
}

Examples of the kind of information that I would like to add to each column are the following (see for example here, though note that this is not according to CLDF yet!).

check to which degree phoible describes existing datasets

With the benchmark database of phonetic alignments and the benchmark data for cognate detection there are two large datasets which provide phonetically segmented data. The question, also with respect to the idea that we might link individual sounds to phoible, is now: to which degree covers phoible the sounds in those and maybe also other datasets?

By carrying out this small analysis, we would have a better estimate on how difficult it is to use things like phoible for reference in our future endeavours (or whether we need to come up with an alternative system of tracking sounds ourself, that, eventually, may built upon phoible).

One further important point on mapping things to phoible is, that we should not underestimate the problems of IPA-variation. IPA has, for example, versions for /ts/ both as single character, as two characters, and as two characters with a combining character. There are also two versions of g, different ways of describing tone, and all other kinds of problems. Now, in order to use phoible as a reference, we need to come up with a mapping of all these variants to the respective phoible symbols, and preferably also with a mapping of non-standard IPA variants, like /th/ for aspirated t, which we will frequently encounter in the lexical data. This could turn out to be quite tedious, but also useful. Anyway, the starting point is to check to which degree we find agreement between the above-mentioned databases and phoible, but potentially also Paul's sound comparison databases.

Annotating graph-like structures (word families)

We can annotate motivation of compound structures now, but we can't annotate word families yet. I am thinking of things like "walk" vs. "walker", etc., which are as a default best handled as directed graphs, where a given word form is annotated by adding a source, a relation, and a target. Source and target serve as identifiers for the node in the network.

ID Segments Cognate Set Source Target Relation
1 w a l k 1 walk
2 w a l k + e r 1 walk walker er-nominalization
3 j u m p 2 jump
4 j u m p + e r jump jumper er-nominalization
5 j u m p + e r + s jumper jumpers plural

This may seem similar to the compound motivation we use and which I gave examples for in #21, but it is essentially different in so far as the relation between source and target form may not be linear (think of Umlaut, Ablaut, ellipsis, etc.). So the source form defines an origin of the derivation, which is then rendered as node-ID in the directed network. We could ignore the target if we just use the Word_ID, but I think that the user-defined language-internal IDs are more easy for annotation, also in terms of readability.

As a rule, validation of these relations would require derivation rules, which are usually pretty language-specific and cannot really be handled cross-linguistically. But ideally, an application would have code to derive potential targets.

Working examples for this will be produced soon in CALC, but this is probably nothing we need to consider for the first publication of the CLDF specs.

Allow for a second separator inside csv-cells

Allow for a second separator inside csv-cells (the semicolon is already proposed as such for the Source column). It makes files more powerful, but also more complex. If we decide on a second-level separator, then I would propose a highly unusual UTF-8 symbol, e.g. \u2005 (FOUR-PER-EM SPACE) or \u204F (REVERSED SEMICOLON).

Specify MediaTable component

Media files should be referenced via URI, which would allow to even include them directly using data URIs.

In addition to the MediaTable component, we'd need a mediaReference property in the ontology.

The CLDF name property - if assigned to a column in a media table - is interpreted as filename, thus can be used to derive a filename extension, etc. alternatively, this can be derived from the return value of Dataset.get_row_url(...).

Splitting/merging files in CLDF

Proposal

Depending on the data, it is sometimes easier to add multiple datasets into a single file, and use a specific column to mark the different datasets. For example, one might either make a different alignment file for each cognate set, or one might combine many different alignment into one file and add a column like "COGNATE ID". Both options have their use-cases. For example, in dialect data with thousands of cognates for a specific meaning it often maked more sense to make separate files for each cognateset.

It is of course even possible to mix and match those two options, i.e. splitting data into different files in some situations, but using internal separation into groups in other files.

I could not find any mentioning of this in https://www.w3.org/TR/tabular-metadata/ (maybe it is too trivial?). It might make sense for CLDF to specify either:

  • to always merge files with identical structure into one file (which might lead to very large files: I have actual use-cases that easily lead to files of multiple GB)
  • to always split files into subsets (which might lead to very many files: I have use cases that lead to ten-thousands of files)
  • allow both approaches and specify for the many-files usage to make sub-directories in a CLDF-directory named according to the component, e.g. a sub-directory cognates with many files for the different cognatesets. (which I would propose)

Note that this decision for the CLDF is independent of the practical work with those files, as it should be trivial to either merge multiple files into one or to split a large files into many separate files.

Should I add something about this to the section "Data Files" in the readme.md?

How do I cite CLDF in general?

I am preparing a paper on BEASTling for submission to PLoS One, who have a software paper category. I plan to mention that BEASTling supports the CLDF file format (as part of a general theme in the paper of pushing the merits of sharing linguistic data by referring to languages using standard identifiers (ISO or Glottocode), storing data in simple, standardised formats (CSV, and CLDF in particular), and publishing data under permissive copyright licenses (CC in particular). How should I best cite the CLDF concept? Should I ideally describe it as part of Glottobank? Or part of CLLD? Is CLLD formally affiliated with MPI-EVA? Is there a public URL for CLDF?

Chunking of strings

Questions

Within strings in a cell in a CSV-file, it is possible to use specific symbols to specify chunking of the string. The explanation in the FAQ is either badly formulated or confused. Or maybe I am confused :-).

The given example there is the following:

A third level of structured content can be achieved for cells of datatype string by specifying a regular expression which the values of a cell must match. E.g. the following column specification

{
    "name": "segments",
    "separator": "+",
    "datatype": {
        "base": "string",
        "format": "([\\S]+)( [\\S]+)*"
    }
}

This will make sure cell content of the form `abc def+geh` can be split into the nested lists `[['abc', 'def'], ['geh']]`.

In my understanding, this describes the plus sign as a secondary separator, and within each of the substrings there is an optional space (which is not defined as a separator in this simple example yet, it is just implicitly assumed in the regex). Just this declaration would analyse the string abc def+geh into ['abc def', 'geh'].

Actually, for the a tertiary separator to work, it would need something like this (I doubt that this is formally correct: please improve if necessary!):

{
    "name": "primary segments",
    "separator": "+",
    "datatype": {
        "base": "string",
        "format": "([\\S]+)( [\\S]+)*"
    }
    {
        "name": "secondary segments",
        "separator": " ",
        "datatype": "string"
    }
}

With this description the string abc def+geh will be split into the nested lists [['abc', 'def'], ['geh']].

When I'm not confused, I'll slightly reformulate the FAQ.

Minor issues in the ontology

I'm going through the ontology, a few comments, please advice:

  • the name Concepticon_ID for a Concept Set: I think we should only use the _ID for Reference properties. So maybe this property should be of dc:type='reference-property'?
  • both Sound Sequence and Subsequence have the name Segments. Is that intentional?
  • the description of Source is the only one that describes the secondary separator in the text.
  • the name glottocode is the only one without capitalisation :-).

Overall, there are three names attached to each property: the rdfs:label, the csvw:name and the identifier in the rdf:about. In very many cases these three are identical (except for regular camelCase/noSpace/Underscore replacements). However, there are exceptions and I find that confusing. Some are clearly problematic, but others seem to be unnecessary shortcuts.

  • Problematic: ISO 639-3 code: this label is rendered as iso639P3Code, apparently following lexvo.org. The csvw:name is Iso, which I find strange. I would propose:

    • rdfs:label ISO 639 part 3 code
    • rdf: about iso639P3Code
    • csvw:name ISO_639_Part_3_Code
  • Probably difficult, because already in use are these mismatches:

    • rdfs:label Lexical Unit, but csvw:name Form. I think the label is too restrictive, and a label Form would be much more suitable.
    • rdfs:label Sound Sequence, but csvw:name Segments. I would urge to decide for one or the other. There is no reason to confuse people even more ;-).
  • Unnecessary:

    • rdfs:label Primary Text: why not consquently csvw:name Primary_Text
    • rdfs:label Analyzed Word: why not consquently csvw:name Analyzed_Word
    • rdfs:label Translated Text: why not consquently csvw:name Analyzed_Text
    • rdfs:label Motivation Structure: why not consquently csvw:name Motivation_Structure
    • rdfs:label Prosodic Structure: why not consquently csvw:name Prosodic_Structure
  • there is an almost regular replacement of Reference into _ID for the csvw:name, but why not

    • Meta-language Reference -> Meta_Language_ID (currently: Language_ID_Meta)
    • Source Form Reference -> Source_Form_ID (currently: Form_ID_Source)
    • Target Form Reference -> Target_Form_ID (currently: Form_ID_Target)
  • Finally: the link between label Concept Set and csvw:name Concepticon_ID is confusing. Can we come up with something better here?

Keep naming of columns similar to names in ontology

This just took me some time to figure out, and left me with quite some confusion, so maybe it would be good to change this:

The names of columns in the description of the tables (e.g. formTable) are linked to properties in the ontology in terms.rdf. For clarity, I would propose to use the ontology names as column names to avoid confusion.

For example, in formTable:

  • "name" = "Form" can simply be changed to "name" = "LexicalUnit"
  • "name" = "Segments" can simply be changed to "name" = "soundSequence"

As a general rule-of-thumb, we can keep _ID instead of Reference, so we use "name" = "Language_ID" instead of the literal copy from the ontology, which would be "name" = "languageReference".

Of course these are just names, and they have to be explicitly referenced in the meta-data of each dataset, so you can change it to whatever you want. However, in the description of the CLDF I think it is better to make this easier to grasp immediately.

If you agree, I'll make a pull request with such changes

It is unclear (also from pycldf implementation) whether metadatafree datasets may have additional columns

Followup on cldf/pycldf#50 from the formal side:

The specs state that

A dataset can be CLDF conformant without providing a separate metadata description file. To do so, the dataset must follow the default specification for the appropriate module regarding

  • file names
  • CSV dialect
  • column names
    exactly. Thus, rather than not having any metadata, the dataset does not specify any; and instead falls back to using the defaults, i.e. "free" as in "beer" not as in "gluten-free".

A close reading of that seems to imply that additional columns outside those defined in the corresponding module .json should not be permitted.

However, several other bits of the specs (including the Wordlists.md with its talk about optional columns) give me the impression that additional columns should be permitted, and cldf validate does not complain, either.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.