gbif / gbif-common Goto Github PK

View Code? Open in Web Editor NEW

1.0 19.0 1.0 11.06 MB

Utility classes

License: Apache License 2.0

Java 100.00%

gbif-common's Introduction

gbif-common

The gbif-common shared library provides:

Utility classes for files (Compression, Charset, Iterator, Properties, ...)
Utility classes for collections (Arrays, compact HashSet, ...)
Utility classes for text (String, Email, line, ...)

To build the project

mvn clean install

Note on Jackson 2

This project will shade Jackson 2 into its own artifact. gbif-common is used in projects where other third-party dependencies are compiled against previous version of Jackson 2. We decided to shade it to avoid those conflicts knowing that the final jar will be larger (still < 4 MB).

Change Log

Documentation

JavaDocs

gbif-common's People

Contributors

Stargazers

Watchers

Forkers

djtfmartin

gbif-common's Issues

Information Request: Resource usage (esp. disk) for `sortInJava`?

Description:

We are contemplating using dwca-io to read DWCA files. However, our file processing pipeline is limited in processing power/time, memory and also disk availability (AWS Lambda has max 15m runtime, 10G RAM and 10G disk). It looks like the dwca-io package sorts the files in the archive to facilitate a "join" operation which we will need. Preferably, the sort operation can be InJava, since vanilla Lambda doesn't have access to GNU sort. Reading the method, it seems as though the method may make up to 2 copies of a file on disk. As an example, the Taxon.tsv file in the DWCA for the GBIF Backbone is currently 2.1G unzipped. 3 of these would then be 6.3G, which is within the 10G limit, but doesn't leave a lot of wiggle room.

I see that there is logging information in this method that reports on disk usage. I was hoping that perhaps you all might be able to provide more information on disk usage from existing logs that you might have.

Request:

Is my x3 disk usage estimate approximately accurate?
In your experience, are there other DWCA core files that are significantly larger than the 2.1G GBIF Backbone that I should be worried about?
If you have any other statistics about runtime and memory usage and disk usage that would also be greatly appreciated.

I understand if you don't have readily available information, and I can run my own experiments. I was just hoping that maybe y'all had vast troves of data on this subject just lying around already. :) Thanks in advance for any information at all.

GNU sort using the fieldsEnclosedBy character

We found a case of a dataset where the GNU sort is not working as expected because it's not taking the fieldsEnclosedBy character into account.

I'm explaining the issue first for context:

This is a sampling event dataset and it has 2 extensions: occurrence and measurementOrFact (it's in DEV only https://registry.gbif-dev.org/dataset/10fd6b56-99fd-49e1-863e-09480dfb67c9).

Most of the IDs of the dataset are like these:

"urn:catalog:NSW Dept of Planning, Industry and Environment:BioNet Atlas of NSW Wildlife:SPJGI4418097:event"
"urn:catalog:NSW Dept of Planning, Industry and Environment:BioNet Atlas of NSW Wildlife:SPJGI4418098:event"
...

and there are 2 that are:

2087
CPJGI0057476

The occurrences are linked only to the events with IDs like:

"urn:catalog:NSW Dept of Planning, Industry and Environment:BioNet Atlas of NSW Wildlife:SPJGI4418097:event"
"urn:catalog:NSW Dept of Planning, Industry and Environment:BioNet Atlas of NSW Wildlife:SPJGI4418098:event"
...

And the records in the measurementOrFact extension are only linked to the events:

2087
CPJGI0057476

When we are reading the archive and it's sorted using the GNU sort the "urn:catalog:... IDs are sorted always the first and the other 2 are the last. And I think it's because it's taking the quotes into account. Then when we parse the measurementOrFact extension in java(the occurrence extension is parsed correctly), the records with the "urn:catalog:... come first and they can't find any match in the extension making the extension iterator reach the end and when the records with the other ids come the iterator doesn't have more values. So the measurementOrFact extension is always empty for all the records when reading the archive.

In other words, all the extension records fall in this if because it starts comparing the extension ids(2087 and CPJGI0057476) with the urn:catalog:... ids first:

https://github.com/gbif/dwca-io/blob/7cd05e21ebbc0dece62c9e73be41e2e898959073/src/main/java/org/gbif/dwc/StarRecordIterator.java#L126

} else if (id.compareTo(extId) > 0) {
    // this extension id is smaller than the core id and should have been picked up by a core record already
    // seems to have no matching core record, so lets skip it
    it.next();
    extensionRecordsSkipped.put(rowType, extensionRecordsSkipped.get(rowType) + 1);
  }

I tested it using the java sort and it works as expected since it takes the urn ids as Strings and it doesn't contain the quotes.

So before considering other options, I was wondering if it's possible not to take the quotes (or whatever character defined in fieldsEnclosedBy) into account when doing the GNU sort?

TabularDataFileReader.close should throw IOException

TabularDataFileReader currently defines its close method without the ability to throw an IOException, but it also overrides java.io.Closeable, which is otherwise allowed to throw an IOException from its close method.

void close();

It would be simpler if TabularDataFileReader didn't change the close method definition. The reasoning is that close method implementations will either need to silently swallow/log (not ideal), "sneaky rethrow" (not ideal), or wrap exceptions as RuntimeException (won't be handled in the same way as IOException by client code).

The reasoning in the AutoCloseable documentation is that implementors should feel free to drop the checked exception from their throws clause if they know that they will not be thrown. ( https://docs.oracle.com/javase/7/docs/api/java/lang/AutoCloseable.html ) However, TabularDataFileReader's will be dealing with IO in almost all situations, except for in-memory implementations, and hence should be free to simply let their IOException's propagate as expected to somewhere that the final user can handle.

Move extractCsvMetadata from CSVReaderFactory

CSVReaderFactory and CSVReader will be soon be deprecated (TabularFiles is the replacement).
The function extractCsvMetadata should be moved since it is used and useful.

Once moved, TabularFiles should make use of it and offer a static method like newTabularFileReader(Reader).

CSVReader.Next() method isn't in conformance with RFC 4180

Reading a CSV file while using Next() will fail to read any field with \n inside. While is acceptable most softwares are using https://tools.ietf.org/html/rfc4180 recommendations.
LibreOffice and Excel seems to accept \n if field is quoted.
Also you should skip empty CSV lines instead of just when row.length() == 0 is true:

"",""

and

,

are valid empty lines with two fields.
Can you add support to this? IPT use this class to read input source data and some users are complaining.
I'm reporting here because another gbif tool can show the same behavior.

TabularDataFileReader read method should throw ParseException

Currently, TabularDataFileReader's read method only throws IOException but if a line can be read correctly but can't be parsed (the only example I know is when quoting is used) a RuntimeException is thrown. This is problematic for other projects (e.g. the gbif-data-validator) since the exception is too general.

The suggestion is to change the read method to also throw java.text.ParseException.

TabularFiles instances shall support escape of quote char

Currently it is impossible to include the quote char and indicate to the TabularDataFileReader to escape it.
"2", " is it ""expected"""

rfc4180 says:

If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote.

Backslash character could also be used so we might want to offer the option.

Shade and upgrade Guava

The use of Google Guava as an explicit dependency is limiting reuse of gbif-common by other GBIF libraries and third-parties such as ALA by enforcing a large testing regime before verifying that new guava versions with bug fixes and new features can be used here.

Guava should be shaded in the same way as Jackson is shaded currently, removing it as an obstacle to reuse of this library.

Potential for encoding corruption, depending on the environment

Reported in gbif/portal-feedback#3191, but also affecting other datasets.

Mac OS sets the locale environment variable LC_CTYPE=UTF-8, which is not recognized on Linux. Linux would use en_US.UTF-8 or similar, or leave it unset and use LANG.

When Java starts up on Linux with the Mac OS LC_CTYPE=UTF-8, the Charsets.defaultCharset() is US-ASCII. This causes problems wherever the default character set is used: System.out, I/O streams without a specified character set, convenience classes like FileReader and FileWriter, etc.

In the case above, a FileWriter is used to output sorted DWCA data. With the mixed environment variables, that leads to the file being written in ASCII, and corrupted data.

In other words, gbif-common assumes a correctly configured UTF-8 environment.

UnknownDelimitersException - make it a checked exception?

@cgendreau @mdoering

UnknownDelimitersException is an unchecked exception in gbif-common (a duplicate copy UnkownDelimitersException exists in dwca-io and can likely be removed).

This exception gets thrown when a CSV file is inspected for its delimiter, however, the delimiter cannot be determined. This exception can be cause when reading an empty CSV file for example.

extractCsvMetadata() inside CSVReaderFactory specifies UnknownDelimitersException in its signature. Various other build() methods inside this class also call extractCsvMetadata(), however, do NOT specify it in their own signature.

Since the file has been considered invalid, why not declare it in the signature and give the client the chance to recover from it? This is the way that UnsupportedArchiveException is handled in dwca-io, which is a similar unchecked exception that always appears in the method signature.

Apart from this issue of consistency, I wonder if UnknownDelimitersException and UnsupportedArchiveException wouldn't be better made into checked exceptions?

Fix BomSafeInputStreamWrapperTest failure under Java-9

BomSafeInputStreamWrapperTest relies on ClassLoader.class.getResourceAsStream, which is broken by default (by design possibly) under Java-9. Changing it to getClass().getResourceAsStream() fixes the issue.

Allow temporary working directory for GNU sorts to passed through

Its not currently possible to specify a temporary working directory other that /tmp/ for the GNU sort because of this section of code that clears all current environment variables before starting the process

https://github.com/gbif/gbif-common/blob/master/src/main/java/org/gbif/utils/file/FileUtils.java#L825

This affects the loading of large DwCAs.

As /tmp usually resides on a root partition which are typically small (20GB) in comparison to other mounted block devices, it would be useful to allow users to use another area.

Tabular lines with first column empty mistakenly skipped

In the following example the second line will be skipped by the TabularDataFileReader:

1|a
|b
3|c

Normalise CSV data to NFC

All unicode data read by the JacksonCsvFileReader should at least be converted to the NFC format, see https://www.unicode.org/reports/tr15