pastaplus / dex Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 0.0 921 KB

Explore and subset CSV tables using associated EML metadata

License: Apache License 2.0

Shell 0.73% Python 67.28% CSS 6.29% JavaScript 15.07% HTML 10.64%

eml csv data-science tabular-data metadata data-visualization data-analysis

dex's Introduction

DeX

Caching

Handle initial retrieval and caching of the source CSV and EML docs that can be processed by DeX.

Processing in DeX is always based on a single URL which references a CSV file in a PASTA Data Package. When DeX receives a request to open a URL that it has not opened previously, it requires the contents of the referenced CSV file along with the contents of its associated EML file.

The CSV is available at the provided URL, and the EML is available at a predefined location relative to the CSV, so both can be downloaded directly. However, the two files may also be available in the local filesystem, and if so, should be used since it improves performance.

During the initial request for a new CSV, DeX may need to refer to the CSV and EML documents multiple times, so they should be kept cached locally for that time if they had to be downloaded. On the other hand, if the files were already available in the local filesystem, we want to use them from where they are, without copying them into a local cache first.

During the initial request, the CSV and EML docs are processed into a number of other objects which are cashed on disk. Only these derived objects are required for serving later requests, so there is no longer a need for the original files.

Conda

Managing the Conda environment in a production environment

Start and stop the dex service as root:

# systemctl start dex.service
# systemctl stop dex.service

Remove and rebuild the dex venv:

conda env remove --name dex
conda env create --file environment-min.yml

Update the dex venv in place:

conda env update --file environment-min.yml --prune

Activate and deactivate the dex venv:

conda activate dex
conda deactivate

Managing the Conda environment in a development environment

Update the environment-min.yml:

conda env export --no-builds > environment-min.yml

Update Conda itself:

conda update --name base conda

Update all packages in environment:

conda update --all

Create or update the requirements.txt file (for use by GitHub Dependabot, and for pip based manual installs):

pip list --format freeze > requirements.txt

Procedure for updating the Conda environment and all dependencies

conda update -n base -c conda-forge conda
conda activate dex
conda update --all
conda env export --no-builds > environment.yml
pip list --format freeze > requirements.txt

If Conda base won't update to latest version, try:

conda update -n base -c defaults conda --repodata-fn=repodata.json

API

Flush cached objects for a given PackageID

DELETE /<packageId>

Example:

Flush all cached objects for the package with the ID https://pasta-d.lternet.edu/package/data/eml/edi/748/2:

curl -X DELETE https://dex-d.edirepository.org/https%3A%2F%2Fpasta-d.lternet.edu%2Fpackage%2Fdata%2Feml%2Fedi%2F748%2F2

Note that the package ID is URL-encoded and that package scope, identifier and revision are all required, and separated by slashes.

Open DeX with external data and metadata

DeX can be opened by providing links to a data table in CSV format, along with its associated metadata in EML format. DeX will download the data and metadata from the provided locations.

This API is used by posting a JSON document with the required information to dex/api/preview. DeX will return an identifier, which the browser can then use to form the complete URL to open.

Example JavaScript event handler for a button that opens DeX:

window.onload = function () {
  document.getElementById('open-dex').addEventListener('click', function () {
    // Base URL for the DeX instance to use
    let dexBaseUrl = 'https://dex-d.edirepository.org';
    let data = {
      // Link to metadata document in EML format
      eml: 'https://pasta-s.lternet.edu/package/metadata/eml/edi/5/1',
      // Link to data table in CSV (or closely related) format
      csv: 'https://pasta-s.lternet.edu/package/data/eml/edi/5/1/88e508f7d25a90aa25b0159608187076',
      // As a single EML may contain metadata for multiple CSVs, this value is required and must
      // match the physical/distribution/online/url of the section in the EML which describes the
      // table.
      dist: 'https://pasta-s.lternet.edu/package/data/eml/edi/5/1/88e508f7d25a90aa25b0159608187076',
    };

    let options = {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json'
      },
      body: JSON.stringify(data),
    };

    fetch(`${dexBaseUrl}/dex/api/preview`, options)
        .then(response => {
          return response.text()
        })
        .then(body => {
          // Open DeX in new tab
          window.open(`${dexBaseUrl}/dex/profile/${body}`);
        })
        .catch(error => alert(error))
    ;
  });
};

dex's People

Contributors

Stargazers

Watchers

dex's Issues

Record original link that CSV was opened from and use it for redirect back

Don't assume an original location for the CSV.

Add ability to select multiple columns for plot

Fix ISO to C date format string conversion for 3 character month

D-MMM-YY currently becomes D-%mM-%y. Should become something like %day-%3-char-month-%year.

Exception when CSV has duplicate column names

E.g., this CSV has two columns named "TN":

https://pasta.lternet.edu/package/data/eml/knb-lter-luq/75/11034157/841789827e5c7abc50502f8aa26de080

Properly formatted time column (hh:mm) displays as 1900-01-01 hh:mm

https://dex.edirepository.org/https%3A%2F%2Fpasta-s.lternet.edu%2Fpackage%2Fdata%2Feml%2Fedi%2F748%2F2%2Fe7209e56ec3548d0f5e5b7d4939018de

There are missing values in the tables, but they are correctly noted in the metadata

Change (no value) to (empty) in "Filter by query" table

Create GitHub issue for missing values section, show missing instead of present values

First row of data does not appear in "Sample" table

Example dataset here, but I am also seeing it with other datasets here.

Profile fails to load with 504 Gateway Time-out

When attempting to load the profile for https://dex.edirepository.org/https%3A%2F%2Fpasta-d.lternet.edu%2Fpackage%2Fdata%2Feml%2Fknb-lter-ble%2F17%2F5%2Fa43ae92c34e99b8b03a0ff28b08a916a, the result is a 504 Gateway Time-out even though there are only 45 rows in this CSV file.

Apply query filter in the filtered subset download

Filter by category returns incorrect subset

Filtering by category returns an incorrect number of rows.

For example, when working with Dex https://dex.edirepository.org/https%3A%2F%2Fpasta-d.lternet.edu%2Fpackage%2Fdata%2Feml%2Fedi%2F104%2F1%2F222171e0623f59c7dc6b1148d917931f, subsetting by category using LAKENAME with category value BIRCH returns a subset of 515 rows containing multiple LAKENAME categories, not just BIRCH.

Remove pandas profiling from profile view

Pandas profiling is a great tool for analyzing a raw data table, but often results in conflicting data types as compared to what is explicitly declared in the EML metadata for the given data table (see https://dex.edirepository.org/https%3A%2F%2Fpasta-d.lternet.edu%2Fpackage%2Fdata%2Feml%2Fedi%2F1%2F1%2Fcba4645e845957d015008e7bccf4f902 as an example). For this reason, pandas profiling should be removed completely from Dex and be replaced with a profile that is rooted in the table description as declared within the EML metadata. A simple head/tail view and describe of numeric columns would also be useful. In addition, perhaps a declared column attribute name and found column attribute view would be helpful.

Plot by filter

Provide for the ability to plot (scatter, XY, time-series) data that have been filtered. If these data are categorical, then allow each category to be plotted on a separate graph line.

Include subset provenance metadata when using the subset interface

Include subset provenance metadata when using the subset interface to download a subsetted data table. The provenance metadata (TBD) should include the filtering properties that had been used to create the subset.

Add checkbox for drawing lines between points in plot

Large CSV tables result in "Out of Memory" exceptions on server

Large CSV tables result in "Out of Memory" exceptions on the server. Specifically, this table in DeX loads into Pandas Profile, Subset, but results in an exception when attempting to view the plot.

The corresponding data package in the EDI Data Portal (staging) is: https://portal-s.edirepository.org/nis/mapbrowse?packageid=edi.143.11

Fix datetime parsing of ISO timezone (%z)

Gracefully hande CSV parsing error

Display an error message with any information we have about what went wrong during parsing, and a button to close the tab. The portal opens Dex in a new tab.

Subsetted data table missing header row

Subsetted data table does not contain the header row showing column names. This results in a mismatch between the index number (of the index column) and the first data line - see attached image. The header row should be included in all subsetted data tables.

Some valid PASTA data URLs give "not a valid URL" errors

Example URL: https://pasta.lternet.edu/package/data/eml/knb-lter-nin/1/1/DailyWaterSample-NIN-LTER-1978-1992

Support map plot if table has lat/long

Generate new EML metadata for filtered data

Generate new EML metadata for any filtered data including, but not limited to, provenance information of what was clipped. This metadata would use the original package metadata as a basis for any new metadata generated.

Modify column names in Profiling to point to the rendered EML for the columns in PASTA

Fix bug where category filter remains disabled even when there are valid categories

Bug: Error when selecting/deselecting columns for subset

Observed the error but wasn't able to immediately reproduce it. May be when combining individual and all select/deselect on a freshly loaded page.

Investigate if performant zoom into plot can be supported without subsample on large tables

Exception in pandas.read_csv()

Traceback (most recent call last):
File "/home/dahl/miniconda3/envs/dex/lib/python3.9/site-packages/flask/app.py", line 2091, in call
return self.wsgi_app(environ, start_response)
File "/home/dahl/miniconda3/envs/dex/lib/python3.9/site-packages/flask/app.py", line 2076, in wsgi_app
response = self.handle_exception(e)
File "/home/dahl/miniconda3/envs/dex/lib/python3.9/site-packages/flask/app.py", line 2073, in wsgi_app
response = self.full_dispatch_request()
File "/home/dahl/miniconda3/envs/dex/lib/python3.9/site-packages/flask/app.py", line 1518, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/home/dahl/miniconda3/envs/dex/lib/python3.9/site-packages/flask/app.py", line 1516, in full_dispatch_request
rv = self.dispatch_request()
File "/home/dahl/miniconda3/envs/dex/lib/python3.9/site-packages/flask/app.py", line 1502, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
File "/home/dahl/dev/dex/dex/views/profile.py", line 26, in profile
csv_df, raw_df, eml_ctx = dex.csv_parser.get_parsed_csv_with_context(rid)
File "/home/dahl/dev/dex/dex/csv_parser.py", line 29, in get_parsed_csv_with_context
csv_df = get_parsed_csv(rid, eml_ctx)
File "/home/dahl/dev/dex/dex/csv_parser.py", line 149, in get_parsed_csv
return _get_csv(rid, eml_ctx, do_parse=True)
File "/home/dahl/dev/dex/dex/csv_parser.py", line 261, in _get_csv
csv_df = pd.read_csv(**arg_dict)
File "/home/dahl/miniconda3/envs/dex/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/home/dahl/miniconda3/envs/dex/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "/home/dahl/miniconda3/envs/dex/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 581, in _read
return parser.read(nrows)
File "/home/dahl/miniconda3/envs/dex/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1250, in read
index, columns, col_dict = self._engine.read(nrows)
File "/home/dahl/miniconda3/envs/dex/lib/python3.9/site-packages/pandas/io/parsers/python_parser.py", line 273, in read
conv_data = self._convert_data(data)
File "/home/dahl/miniconda3/envs/dex/lib/python3.9/site-packages/pandas/io/parsers/python_parser.py", line 331, in _convert_data
return self._convert_to_ndarrays(
File "/home/dahl/miniconda3/envs/dex/lib/python3.9/site-packages/pandas/io/parsers/base_parser.py", line 564, in _convert_to_ndarrays
values = lib.map_infer(values, conv_f)
File "pandas/_libs/lib.pyx", line 2870, in pandas._libs.lib.map_infer

File "/home/dahl/dev/dex/dex/csv_parser.py", line 116, in float_parser
return float(x)
TypeError: float() argument must be a string or a number, not 'NoneType'

Add note on Plot page if plot will be from a subsample due to large table

Include modified EML in subset downloads

Create EML:

Copy complete source EML, then remove DataTable elements other than the one used for the CSV.
Add or update: Number of rows, size, checksum (authentication element)
For any column that is removed, remove the corresponding attribute branch
Mark will provide a template for provenance info (pointing back to the source dataset)

Normalize metadata that describes a CSV subset

When downloading a subset, the download includes a JSON file that describes the filters that were applied in order to create the subset. The JSON file format and content should be normalized to whatever we decide (it currently reflects the format used internally by Dex).

Filter by category fails to show available categories

When processing https://dex.edirepository.org/https%3A%2F%2Fpasta-d.lternet.edu%2Fpackage%2Fdata%2Feml%2Fknb-lter-ble%2F17%2F5%2Fa43ae92c34e99b8b03a0ff28b08a916a, the subse/filter_by_category does not show categories even though column label shows three categories:

Cells incorrectly flagged as invalid in "Filter by Query"

Wrong parser/formatter selected for ratio field

The result is that the columns don't show up for plots.

Add vertical cell dividers to "Filter by query" table

Move from clevercsv to EML for finding header row and CSV dialect

File extensions on downloads

Downloading subsets of data packages from dex returns a zip of "csv", "eml" and "json". It would be more convenient on the user end to receive "csv.csv" and "eml.xml" so they open in their respective editors without extra fuss.

See if Pandas Profiling can be made to not gray out any columns

Grayed out columns look like the contain invalid data, while they may contain valid data that just can't be used for generating any stats. We'd like for PP to show those columns as normal, just without any profiling info.

If a filter is disabled, add a brief note explaining the reason

Include CSV filename in the menu bar

Use the objectName field in EML.

Right justified: Package ID: knb-lter-luq.75.11034157 Table: FERNBLEOCC.csv

Fix representation of only date and only time of day

These currently become datetimes with date 1900-01-1 and time 00:00:00.

Add disclaimer on Profile page

Disclaimer: This analysis is not based on information from the EML metadata.

Remove Missing Value bar graph from Profile

Add special case list of date format codes

Sub-setting should be based on a simplified query language

The data table sub-setting functionality should be based on a simple query syntax language (/Backus–Naur form context free grammar) that can easily be executed by the underlying Pandas Python package in lieu of multiple selection tables.

As it turns out, Pandas already supports a query language that may be used for data table filtering. An example of such a query is against the Data Carpentries Python ecology surveys.csv table:

df.query("(year == 1990 | year == 1991) & sex == 'M' & (species_id == 'BA' | species_id == 'RM' | species_id == 'DO')")

Check for and flag apparent mismatches between EML and actual content

For the initial implementation, we're targeting only ISO-like datetimes that are not declared as dates in the EML.

Subsetting for columns not working

Column subsetting using the checkbox interface is not working. Download returns all columns of the table.

Dex fails to load for entity in ver 2 of staged data package

I get the 502 bad gateway message when I try to access Dex from the link in the resources section of this ver 2 package in the staging environment (only the first table, which was changed from ver1 of the staged package, fails to open.) There were no problems opening it from the ver 1 package. Date and time columns were reformatted between versions, and missing value codes + explanations were added to the metadata.

package: https://portal-s.edirepository.org/nis/mapbrowse?scope=edi&identifier=748
problematic entity: https://pasta-s.lternet.edu/package/data/eml/edi/748/2/e7209e56ec3548d0f5e5b7d4939018de

Notes:

I tried putting the package in portal-d and it still doesnt work.
I try loading the table from dex.edirepository.org using the entity url and it still doesnt work

Change EML to link to PASTA's EML rendering engine

Sample URL:

https://portal-d.edirepository.org/nis/metadataviewer?packageid=knb-lter-cap.664.1

Pass list of codes for missing values to Pandas Profiling

Hopefully, that will fix the issue we see in Profiling, where it does not see the missing values as such, and classifies the codes as a category.

Package ID link in header should point to Data Portal landing page

The Package ID link should point to the data package landing page on the corresponding Data Portal (production, staging, or development). It current points to the data file PURL.

For example, the Dex link https://dex.edirepository.org/https%3A%2F%2Fpasta-d.lternet.edu%2Fpackage%2Fdata%2Feml%2Fedi%2F104%2F1%2F222171e0623f59c7dc6b1148d917931f contains the following Package ID link: https://pasta-d.lternet.edu/package/data/eml/edi/104/1/222171e0623f59c7dc6b1148d917931f. It should point to the development Data Portal at https://portal-d.edirepository.org/nis/mapbrowse?scope=edi&identifier=104&revision=1.

In addition, only the package identifier (edi.104.1) should be hyperlinked, not the Package ID: text.