We found a case of a dataset where the GNU sort is not working as expected because it's not taking the fieldsEnclosedBy
character into account.
I'm explaining the issue first for context:
This is a sampling event dataset and it has 2 extensions: occurrence and measurementOrFact (it's in DEV only https://registry.gbif-dev.org/dataset/10fd6b56-99fd-49e1-863e-09480dfb67c9).
Most of the IDs of the dataset are like these:
"urn:catalog:NSW Dept of Planning, Industry and Environment:BioNet Atlas of NSW Wildlife:SPJGI4418097:event"
"urn:catalog:NSW Dept of Planning, Industry and Environment:BioNet Atlas of NSW Wildlife:SPJGI4418098:event"
...
and there are 2 that are:
The occurrences are linked only to the events with IDs like:
"urn:catalog:NSW Dept of Planning, Industry and Environment:BioNet Atlas of NSW Wildlife:SPJGI4418097:event"
"urn:catalog:NSW Dept of Planning, Industry and Environment:BioNet Atlas of NSW Wildlife:SPJGI4418098:event"
...
And the records in the measurementOrFact extension are only linked to the events:
When we are reading the archive and it's sorted using the GNU sort the "urn:catalog:...
IDs are sorted always the first and the other 2 are the last. And I think it's because it's taking the quotes into account. Then when we parse the measurementOrFact
extension in java(the occurrence extension is parsed correctly), the records with the "urn:catalog:...
come first and they can't find any match in the extension making the extension iterator reach the end and when the records with the other ids come the iterator doesn't have more values. So the measurementOrFact extension is always empty for all the records when reading the archive.
In other words, all the extension records fall in this if
because it starts comparing the extension ids(2087 and CPJGI0057476) with the urn:catalog:...
ids first:
https://github.com/gbif/dwca-io/blob/7cd05e21ebbc0dece62c9e73be41e2e898959073/src/main/java/org/gbif/dwc/StarRecordIterator.java#L126
} else if (id.compareTo(extId) > 0) {
// this extension id is smaller than the core id and should have been picked up by a core record already
// seems to have no matching core record, so lets skip it
it.next();
extensionRecordsSkipped.put(rowType, extensionRecordsSkipped.get(rowType) + 1);
}
I tested it using the java sort and it works as expected since it takes the urn ids as Strings and it doesn't contain the quotes.
So before considering other options, I was wondering if it's possible not to take the quotes (or whatever character defined in fieldsEnclosedBy
) into account when doing the GNU sort?