Code Monkey home page Code Monkey logo

dex's Introduction

DeX

Caching

Handle initial retrieval and caching of the source CSV and EML docs that can be processed by DeX.

Processing in DeX is always based on a single URL which references a CSV file in a PASTA Data Package. When DeX receives a request to open a URL that it has not opened previously, it requires the contents of the referenced CSV file along with the contents of its associated EML file.

The CSV is available at the provided URL, and the EML is available at a predefined location relative to the CSV, so both can be downloaded directly. However, the two files may also be available in the local filesystem, and if so, should be used since it improves performance.

During the initial request for a new CSV, DeX may need to refer to the CSV and EML documents multiple times, so they should be kept cached locally for that time if they had to be downloaded. On the other hand, if the files were already available in the local filesystem, we want to use them from where they are, without copying them into a local cache first.

During the initial request, the CSV and EML docs are processed into a number of other objects which are cashed on disk. Only these derived objects are required for serving later requests, so there is no longer a need for the original files.

Conda

Managing the Conda environment in a production environment

Start and stop the dex service as root:

# systemctl start dex.service
# systemctl stop dex.service

Remove and rebuild the dex venv:

conda env remove --name dex
conda env create --file environment-min.yml

Update the dex venv in place:

conda env update --file environment-min.yml --prune

Activate and deactivate the dex venv:

conda activate dex
conda deactivate

Managing the Conda environment in a development environment

Update the environment-min.yml:

conda env export --no-builds > environment-min.yml

Update Conda itself:

conda update --name base conda

Update all packages in environment:

conda update --all

Create or update the requirements.txt file (for use by GitHub Dependabot, and for pip based manual installs):

pip list --format freeze > requirements.txt

Procedure for updating the Conda environment and all dependencies

conda update -n base -c conda-forge conda
conda activate dex
conda update --all
conda env export --no-builds > environment.yml
pip list --format freeze > requirements.txt

If Conda base won't update to latest version, try:

conda update -n base -c defaults conda --repodata-fn=repodata.json

API

Flush cached objects for a given PackageID

DELETE /<packageId>

Example:

Flush all cached objects for the package with the ID https://pasta-d.lternet.edu/package/data/eml/edi/748/2:

curl -X DELETE https://dex-d.edirepository.org/https%3A%2F%2Fpasta-d.lternet.edu%2Fpackage%2Fdata%2Feml%2Fedi%2F748%2F2

Note that the package ID is URL-encoded and that package scope, identifier and revision are all required, and separated by slashes.

Open DeX with external data and metadata

DeX can be opened by providing links to a data table in CSV format, along with its associated metadata in EML format. DeX will download the data and metadata from the provided locations.

This API is used by posting a JSON document with the required information to dex/api/preview. DeX will return an identifier, which the browser can then use to form the complete URL to open.

Example JavaScript event handler for a button that opens DeX:

window.onload = function () {
  document.getElementById('open-dex').addEventListener('click', function () {
    // Base URL for the DeX instance to use
    let dexBaseUrl = 'https://dex-d.edirepository.org';
    let data = {
      // Link to metadata document in EML format
      eml: 'https://pasta-s.lternet.edu/package/metadata/eml/edi/5/1',
      // Link to data table in CSV (or closely related) format
      csv: 'https://pasta-s.lternet.edu/package/data/eml/edi/5/1/88e508f7d25a90aa25b0159608187076',
      // As a single EML may contain metadata for multiple CSVs, this value is required and must
      // match the physical/distribution/online/url of the section in the EML which describes the
      // table.
      dist: 'https://pasta-s.lternet.edu/package/data/eml/edi/5/1/88e508f7d25a90aa25b0159608187076',
    };

    let options = {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json'
      },
      body: JSON.stringify(data),
    };

    fetch(`${dexBaseUrl}/dex/api/preview`, options)
        .then(response => {
          return response.text()
        })
        .then(body => {
          // Open DeX in new tab
          window.open(`${dexBaseUrl}/dex/profile/${body}`);
        })
        .catch(error => alert(error))
    ;
  });
};

dex's People

Contributors

rogerdahl avatar servilla avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

dex's Issues

Profile fails to load with 504 Gateway Time-out

When attempting to load the profile for https://dex.edirepository.org/https%3A%2F%2Fpasta-d.lternet.edu%2Fpackage%2Fdata%2Feml%2Fknb-lter-ble%2F17%2F5%2Fa43ae92c34e99b8b03a0ff28b08a916a, the result is a 504 Gateway Time-out even though there are only 45 rows in this CSV file.

Filter by category returns incorrect subset

Filtering by category returns an incorrect number of rows.

For example, when working with Dex https://dex.edirepository.org/https%3A%2F%2Fpasta-d.lternet.edu%2Fpackage%2Fdata%2Feml%2Fedi%2F104%2F1%2F222171e0623f59c7dc6b1148d917931f, subsetting by category using LAKENAME with category value BIRCH returns a subset of 515 rows containing multiple LAKENAME categories, not just BIRCH.

Remove pandas profiling from profile view

Pandas profiling is a great tool for analyzing a raw data table, but often results in conflicting data types as compared to what is explicitly declared in the EML metadata for the given data table (see https://dex.edirepository.org/https%3A%2F%2Fpasta-d.lternet.edu%2Fpackage%2Fdata%2Feml%2Fedi%2F1%2F1%2Fcba4645e845957d015008e7bccf4f902 as an example). For this reason, pandas profiling should be removed completely from Dex and be replaced with a profile that is rooted in the table description as declared within the EML metadata. A simple head/tail view and describe of numeric columns would also be useful. In addition, perhaps a declared column attribute name and found column attribute view would be helpful.

Plot by filter

Provide for the ability to plot (scatter, XY, time-series) data that have been filtered. If these data are categorical, then allow each category to be plotted on a separate graph line.

Gracefully hande CSV parsing error

Display an error message with any information we have about what went wrong during parsing, and a button to close the tab. The portal opens Dex in a new tab.

Subsetted data table missing header row

Subsetted data table does not contain the header row showing column names. This results in a mismatch between the index number (of the index column) and the first data line - see attached image. The header row should be included in all subsetted data tables.

image

Generate new EML metadata for filtered data

Generate new EML metadata for any filtered data including, but not limited to, provenance information of what was clipped. This metadata would use the original package metadata as a basis for any new metadata generated.

Exception in pandas.read_csv()

Traceback (most recent call last):
File "/home/dahl/miniconda3/envs/dex/lib/python3.9/site-packages/flask/app.py", line 2091, in call
return self.wsgi_app(environ, start_response)
File "/home/dahl/miniconda3/envs/dex/lib/python3.9/site-packages/flask/app.py", line 2076, in wsgi_app
response = self.handle_exception(e)
File "/home/dahl/miniconda3/envs/dex/lib/python3.9/site-packages/flask/app.py", line 2073, in wsgi_app
response = self.full_dispatch_request()
File "/home/dahl/miniconda3/envs/dex/lib/python3.9/site-packages/flask/app.py", line 1518, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/home/dahl/miniconda3/envs/dex/lib/python3.9/site-packages/flask/app.py", line 1516, in full_dispatch_request
rv = self.dispatch_request()
File "/home/dahl/miniconda3/envs/dex/lib/python3.9/site-packages/flask/app.py", line 1502, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
File "/home/dahl/dev/dex/dex/views/profile.py", line 26, in profile
csv_df, raw_df, eml_ctx = dex.csv_parser.get_parsed_csv_with_context(rid)
File "/home/dahl/dev/dex/dex/csv_parser.py", line 29, in get_parsed_csv_with_context
csv_df = get_parsed_csv(rid, eml_ctx)
File "/home/dahl/dev/dex/dex/csv_parser.py", line 149, in get_parsed_csv
return _get_csv(rid, eml_ctx, do_parse=True)
File "/home/dahl/dev/dex/dex/csv_parser.py", line 261, in _get_csv
csv_df = pd.read_csv(**arg_dict)
File "/home/dahl/miniconda3/envs/dex/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/home/dahl/miniconda3/envs/dex/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "/home/dahl/miniconda3/envs/dex/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 581, in _read
return parser.read(nrows)
File "/home/dahl/miniconda3/envs/dex/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1250, in read
index, columns, col_dict = self._engine.read(nrows)
File "/home/dahl/miniconda3/envs/dex/lib/python3.9/site-packages/pandas/io/parsers/python_parser.py", line 273, in read
conv_data = self._convert_data(data)
File "/home/dahl/miniconda3/envs/dex/lib/python3.9/site-packages/pandas/io/parsers/python_parser.py", line 331, in _convert_data
return self._convert_to_ndarrays(
File "/home/dahl/miniconda3/envs/dex/lib/python3.9/site-packages/pandas/io/parsers/base_parser.py", line 564, in _convert_to_ndarrays
values = lib.map_infer(values, conv_f)
File "pandas/_libs/lib.pyx", line 2870, in pandas._libs.lib.map_infer

File "/home/dahl/dev/dex/dex/csv_parser.py", line 116, in float_parser
return float(x)
TypeError: float() argument must be a string or a number, not 'NoneType'

Include modified EML in subset downloads

Create EML:

  • Copy complete source EML, then remove DataTable elements other than the one used for the CSV.
  • Add or update: Number of rows, size, checksum (authentication element)
  • For any column that is removed, remove the corresponding attribute branch
  • Mark will provide a template for provenance info (pointing back to the source dataset)

Normalize metadata that describes a CSV subset

When downloading a subset, the download includes a JSON file that describes the filters that were applied in order to create the subset. The JSON file format and content should be normalized to whatever we decide (it currently reflects the format used internally by Dex).

Filter by category fails to show available categories

When processing https://dex.edirepository.org/https%3A%2F%2Fpasta-d.lternet.edu%2Fpackage%2Fdata%2Feml%2Fknb-lter-ble%2F17%2F5%2Fa43ae92c34e99b8b03a0ff28b08a916a, the subse/filter_by_category does not show categories even though column label shows three categories:

image

File extensions on downloads

Downloading subsets of data packages from dex returns a zip of "csv", "eml" and "json". It would be more convenient on the user end to receive "csv.csv" and "eml.xml" so they open in their respective editors without extra fuss.

Sub-setting should be based on a simplified query language

The data table sub-setting functionality should be based on a simple query syntax language (/Backus–Naur form context free grammar) that can easily be executed by the underlying Pandas Python package in lieu of multiple selection tables.

As it turns out, Pandas already supports a query language that may be used for data table filtering. An example of such a query is against the Data Carpentries Python ecology surveys.csv table:

df.query("(year == 1990 | year == 1991) & sex == 'M' & (species_id == 'BA' | species_id == 'RM' | species_id == 'DO')")

Dex fails to load for entity in ver 2 of staged data package

I get the 502 bad gateway message when I try to access Dex from the link in the resources section of this ver 2 package in the staging environment (only the first table, which was changed from ver1 of the staged package, fails to open.) There were no problems opening it from the ver 1 package. Date and time columns were reformatted between versions, and missing value codes + explanations were added to the metadata.

package: https://portal-s.edirepository.org/nis/mapbrowse?scope=edi&identifier=748
problematic entity: https://pasta-s.lternet.edu/package/data/eml/edi/748/2/e7209e56ec3548d0f5e5b7d4939018de

Notes:

I tried putting the package in portal-d and it still doesnt work.
I try loading the table from dex.edirepository.org using the entity url and it still doesnt work

Package ID link in header should point to Data Portal landing page

The Package ID link should point to the data package landing page on the corresponding Data Portal (production, staging, or development). It current points to the data file PURL.

For example, the Dex link https://dex.edirepository.org/https%3A%2F%2Fpasta-d.lternet.edu%2Fpackage%2Fdata%2Feml%2Fedi%2F104%2F1%2F222171e0623f59c7dc6b1148d917931f contains the following Package ID link: https://pasta-d.lternet.edu/package/data/eml/edi/104/1/222171e0623f59c7dc6b1148d917931f. It should point to the development Data Portal at https://portal-d.edirepository.org/nis/mapbrowse?scope=edi&identifier=104&revision=1.

In addition, only the package identifier (edi.104.1) should be hyperlinked, not the Package ID: text.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.