podaac / data-subscriber Goto Github PK

View Code? Open in Web Editor NEW

80.0 13.0 27.0 1.59 MB

Subscribe and bulk download collections of data at PO.DAAC

License: Apache License 2.0

Python 100.00%

data-access development python tva

data-subscriber's People

Contributors

Stargazers

Watchers

data-subscriber's Issues

Allow subsetting capabilities to subscriber

Currently the subscriber uses the bounding box information to search for granules of interest. With the addition of subset capabilities in harmony and our use of that, it would be great if the bounding box capabilities was enhanced to call harmony and do the actual subset on the data and only return data of interest WITHIN the bounding box.

from Jinbo:
"data_subscriber" can be a potential one-stop tool for all 'download and analysis' groups and scenarios. To me as a user, subsetting capability like this will complete this 'swiss army knife'.

Update 4/11/2022
Needs:

Validation of BBOX required for Subsetter
Adding the subsetting capability to "subscriber" not just "downloader"
Testing against large requests (hundreds to thousands of jobs)
Better Error Handling - print out links to harmony workflow-ui or jobs requests, remove the job from the .harmony file when it fails
Need to work with harmony team on adding the "fail on error" -> false through harmony py

don't be so uptight with date formats

Dates currently need a format like 2002-07-04T00:00:00Z

as in

podaac-data-subscriber -c MODIS_A-JPL-L2P-v2019.0 -d ./data/MODIS_A-JPL-L2P-v2019.0 --start-date 2002-07-04T00:00:00Z

Let's be a little more relaxed in allowing the user to specify times:

2002-07-04 --> 2002-07-04T00:00:00Z
2002-07-04T00:00:00.000Z -> 2002-07-04T00:00:00Z

now available on conda-forge

Hi! I took the liberty of putting this package on conda-forge:

https://anaconda.org/conda-forge/podaac-data-subscriber

I hope someone else finds it helpful too!

Scroll for result sizes > 2000

Subscriber wasn't written for massive bulk downloads in mind, so there is no concept of scrolling through large result sets.

Utilize the CMR scrolling mechanism to fetch listings of files, though this could be millions of files if the search criteria are large enough.

non intuitive-syntax with supposedly OPTIONAL arguments

I use python, I use the command line. installed data-subscriber. ran podaac-data-subscriber -h, quickly read the syntax, tried to download all files for a single ECCO dataset into the current directory. According to the docs, the -sd STARTDATE and -ed ENDATE are OPTIONAL, but just try to run without them -- failure:

Also failing (not shown) is only using the -ed flag to get ALL available files from the beginning to the specified end date.

The only combination that works is when both -sd and -ed are specified.

$ podaac-data-subscriber -c ECCO_L4_ATM_STATE_05DEG_DAILY_V4R4 -ed 1993-01-01T00:00:00Z -d newDirectory

NOTE: Making new data directory at newDirectory(This is the first run.)
Traceback (most recent call last):
  File "/home/ifenty/anaconda3/envs/ecco/bin/podaac-data-subscriber", line 8, in <module>
    sys.exit(run())
  File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/site-packages/subscriber/podaac_data_subscriber.py", line 345, in run
    with urlopen(url) as f:
  File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request

`machine` name in `.netrc`

.netrc in following examples lists two different names for machine. What should it be? Or, are they the same?

Step 2 lists as: machine urs.earthdata.nasa.gov
Notebook has uat. added as in: machine uat.urs.earthdata.nasa.gov

Thanks!

add ".tiff" to the list of default extensions to support OPERA products

Currently we need to specify the -e .tiff parameter to successfully download Opera products. We should add this to the defaults list.

"list index out of range" with "PODAAC" provider

I'm not sure if it's possible to use this tool for datasets not on the cloud, but there are some nice features I'd like to use. I specified the provider as PODAAC and got an error, where it seemed to do nothing with the default POCLOUD provider. I can download the files it lists as FAILURE through a web browser, which suggests a bug in the download code.

With default provider:

$ ./podaac_data_subscriber.py -c SMAP_JPL_L2B_SSS_CAP_V5 --start-date 2021-11-07T00:00:00Z -d test -dydoy
DEBUG:root:Log level set to DEBUG
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): cmr.earthdata.nasa.gov:443
DEBUG:urllib3.connectionpool:https://cmr.earthdata.nasa.gov:443 "POST /legacy-services/rest/tokens HTTP/1.1" 201 None
WARN: No .update in the data directory. (Is this the first run?)
Downloaded: 0 files

Files Failed to download:0

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): cmr.earthdata.nasa.gov:443
DEBUG:urllib3.connectionpool:https://cmr.earthdata.nasa.gov:443 "DELETE /legacy-services/rest/tokens/F7305582-9214-29AC-2F57-BE0E2C226EFB HTTP/1.1" 204 0
CMR token successfully deleted

With PODAAC provider

$ ./podaac_data_subscriber.py -c SMAP_JPL_L2B_SSS_CAP_V5 --start-date 2021-11-07T00:00:00Z -d test -dydoy -p "PODAAC"

DEBUG:root:Log level set to DEBUG
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): cmr.earthdata.nasa.gov:443
DEBUG:urllib3.connectionpool:https://cmr.earthdata.nasa.gov:443 "POST /legacy-services/rest/tokens HTTP/1.1" 201 None
WARN: No .update in the data directory. (Is this the first run?)
2021-11-11 15:30:37.061956 FAILURE: https://podaac-tools.jpl.nasa.gov/drive/files/allData/smap/L2/JPL/V5.0/2021/313/SMAP_L2B_SSS_36180_20211109T091710_R18240_V5.0.h5
list index out of range
2021-11-11 15:30:37.062248 FAILURE: https://podaac-tools.jpl.nasa.gov/drive/files/allData/smap/L2/JPL/V5.0/2021/313/SMAP_L2B_SSS_36179_20211109T073843_R18240_V5.0.h5
list index out of range
...
2021-11-11 15:30:37.068847 FAILURE: https://podaac-tools.jpl.nasa.gov/drive/files/allData/smap/L2/JPL/V5.0/2021/311/SMAP_L2B_SSS_36146_20211107T012930_R18240_V5.0.h5
list index out of range
Downloaded: 0 files

Files Failed to download:32

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): cmr.earthdata.nasa.gov:443
DEBUG:urllib3.connectionpool:https://cmr.earthdata.nasa.gov:443 "DELETE /legacy-services/rest/tokens/7485FF97-5B2A-1150-65BE-0B6AE0D84E1A HTTP/1.1" 204 0
CMR token successfully deleted

Add start/stop date range for bulk downloading

instead of 'subscribing' we should allow the ability to set a start/stop time and download all files within that temporal window.

--limit option doesn't work for downloader

Running the command $ podaac-data-downloader -c GRACEFO_L2_CSR_MONTHLY_0060 -sd 2020-01-01T00:00:00Z -ed 2020-01-02T00:00:00Z -d ~/grace --limit 1 --verbose -e 00 causes 8 files to be downloaded even with --limit 1

The --limit option used to work, by setting the page_size in the CMR request. This worked because the downloader would only make 1 CMR request.

But now that the downloader/subscriber have implemented paging, setting page_size only results in multiple CMR requests.

Subscriber doesn't have --limit option, so it's not affected.

Ensure subscriber can detect and re-download updated granules based on collection redeliveries

Update CMR query to use updated_since rather than created_at in the CMR query:

    params = {
        'scroll': "true",
        'page_size': 2000,
        'sort_key': "-start_date",
        'provider': 'POCLOUD',
        'ShortName': short_name,
        'created_at': data_within_last_timestamp,
        'token': token,
        'bounding_box': bounding_extent,
    }

https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#c-updated-since

here:
https://github.com/podaac/data-subscriber/blob/main/subscriber/podaac_data_subscriber.py#L298

as well as here:
https://github.com/podaac/data-subscriber/blob/main/subscriber/podaac_data_subscriber.py#L310

Notify users of updated versions

Users may not be using the awesome new functionality of the subscriber! let them know that a new version exists when applicable. Will require a user to grab the version with this capability before they get notified...

Assumptions
release tags will always use semantic versioning: 1.7.0, 1.7.1, 1.9.0, etc

Github releases API
curl -H "Accept: application/vnd.github.v3+json" https://api.github.com/repos/podaac/data-subscriber/releases

Semantic versioning comparison
as seen here: https://stackoverflow.com/questions/11887762/how-do-i-compare-version-numbers-in-python

>>> from packaging import version
>>> version.parse("2.3.1") < version.parse("10.1.2")

Create a UMM-T entry for the subscriber

Allow mapping and usage of the tool to collections.

Changes to .update file

Say, user is changing the search parameters the update file might conflict and not download any files. If update file is present, then we need to compare it to the search parameters and make some decisions on whether we want to use the .update file or show a warning. For example -- say

"There are X number of files in the search parameter used , but all occur outside of the latest date in the .update date of YR-MO-DY. Delete update file if you wish to proceed,”

Other option is to add a flag that says, ignore .update file.

Helpful error message if a collection doesn't exist

Not much guidance from CMR if a collection doesn't exist. We should add a check if 0 files are found to ensure that a collection exists. This could save time for a user using this for a non-PO.DAAC provider

Subscriber and Dowload warns not all granules will be downloaded, but all granules get downloaded

Example log

A warning is printed to the log when downloading many granules that says only the first 2000 will be downloaded. But all granules get downloaded.

Limits were removed in #65 so this warning should likely be removed.

 1 ============== Wed Aug  3 22:24:01 UTC 2022 ===============
      2 [2022-08-03 22:24:01,555] {podaac_data_subscriber.py:165} INFO - NOTE: Making new data directory at /cloud/ghrsst/open/data/GDS2/L3U/AVHRRMTC/STAR/v2.80(This is the first run.)
      3 [2022-08-03 22:24:01,558] {podaac_data_subscriber.py:206} INFO - Temporal Range: 2022-07-01T00:00:00Z,2022-08-03T22:24:01Z
      4 [2022-08-03 22:24:01,558] {podaac_data_subscriber.py:212} INFO - Provider: POCLOUD
      5 [2022-08-03 22:24:01,558] {podaac_data_subscriber.py:213} INFO - Updated Since: 2022-07-01T00:00:00Z
      6 [2022-08-03 22:24:01,558] {podaac_access.py:301} INFO - https://cmr.earthdata.nasa.gov/search/granules.umm_json?page_size=2000&sort_key=-start_date&provider=POC        LOUD&updated_since=2022-07-01T00%3A00%3A00Z&ShortName=AVHRRF_MC-STAR-L3U-v2.80&temporal=2022-07-01T00%3A00%3A00Z%2C2022-08-03T22%3A24%3A01Z&token=D5A7A608-AFCD-719D-7998-B46207622CB1
      7 [2022-08-03 22:24:06,112] {podaac_data_subscriber.py:228} INFO - 4850 new granules found for AVHRRF_MC-STAR-L3U-v2.80 since 2022-07-01T00:00:00Z
>>      8 [2022-08-03 22:24:06,277] {podaac_data_subscriber.py:254} WARNING - Only the most recent 2000 granules will be downloaded; try adjusting your search criteria (suggestion: reduce time period or spatial region of search) to ensure you retrieve all granules.
      9 [2022-08-03 22:24:06,283] {podaac_data_subscriber.py:270} INFO - Found 4850 total files to download
     10 [2022-08-03 22:24:06,284] {podaac_data_subscriber.py:272} INFO - Downloading files with extensions: ['.nc']
     11 [2022-08-03 22:24:10,666] {podaac_data_subscriber.py:299} INFO - 2022-08-03 22:24:10.666259 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/AVHRRF_MC-STAR-L3U-v2.80/2022/215/20220803195000-STAR-L3U_GHRSST-SSTsubskin-AVHRRF_MC-ACSPO_V2.80-v02.0-fv01.0.nc
...
   4928 [2022-08-04 00:13:32,702] {podaac_data_subscriber.py:299} INFO - 2022-08-04 00:13:32.702847 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/AVHRRF_MC-STAR-L3U-v2.80/2022/182/20220701000000-STAR-L3U_GHRSST-SSTsubskin-AVHRRF_MC-ACSPO_V2.80-v02.0-fv01.0.nc
>>   4929 [2022-08-04 00:13:32,703] {podaac_data_subscriber.py:314} INFO - Downloaded Files: 4848
   4930 [2022-08-04 00:13:32,703] {podaac_data_subscriber.py:315} INFO - Failed Files:     2
   4931 [2022-08-04 00:13:32,703] {podaac_data_subscriber.py:316} INFO - Skipped Files:    0
   4932 [2022-08-04 00:13:33,051] {podaac_access.py:118} INFO - CMR token successfully deleted
   4933 [2022-08-04 00:13:33,052] {podaac_data_subscriber.py:318} INFO - END

Downloader does not use EDL Token for non-cycle based collections

This bug only affects restricted collections.

CMR Search parameters defined at https://github.com/podaac/data-subscriber/blob/main/subscriber/podaac_data_downloader.py#L175 do not include the generated EDL Token required for restricted searches.

Resume

https://podaac.jpl.nasa.gov/forum/viewtopic.php?f=6&t=1418

Good evening!

We download a lot of data from PODAAC and occasionally something goes wrong partway through the download (real-world stuff like the a bad network connection the system going down at the wrong moment).

Does the cloud data subscriber script have mechanisms to deal with these cases? Ideally it should try to get the failed download again but quickly "give up" if the problem persists.

Subscriber should be able to 'resume' during a download failure. Currently, if any of the downloads fail during a subscriber run, the subscriber "exits" without updating its last run, and the next time it runs, it will attempt to download all files from the previous, "failed" run, even if only one out of N files actually failed.

[Errno 17] File exists

Ran into an error when testing phase 2 dataset downloads in dev.

Because we have multiple collections that share a similar destination path (found this error with MODIS_AQUA_L3_SST_THERMAL_DAILY_4KM_NIGHTTIME_V2019.0), there is a chance that two instances of data-subscriber both try to call makedirs for the same directory at the same time. This results in one of them being successful and the other one throws an error [Errno 17] File exists.

Easy fix is to use exist_ok=True parameter to ignore an already existing directory.

Create better '.update' file name for allowing multiple subscribers to download to the same directory.

if a user uses the same directory (-d) for multiple collections, then there is no ability for it to know which collection the ".update" file belongs to. We should probably update this to use the collection name instead of the ".update" file name to ensure the same directory specified doesn't break functionality.

.update_collectionshortname file

Originally posted by @mike-gangl in #38 (comment)

Download data by cycle and pass

Instead of asking for data between DATE-1 and DATE-2 some users may want data for CYCLE-23 to CYCLE-26. Expand subscriber to search/download by cycle as well. This would really be helpful for JASON series missions. Other users may want data by pass, too. For missions that don't include pass and cycle appropriate warning messages have to be displayed.

add YEAR output directory structure

Similar to:
-dc Flag to use cycle number for directory where data products will be downloaded.
-dydoy Flag to use start time (Year/DOY) of downloaded data for directory where data products will be downloaded.
-dymd Flag to use start time (Year/Month/Day) of downloaded data for directory where data products will be downloaded.

Add new -dy flag for just YEAR as the output directory structure.

Make provider configurable

the provider for the tool is hardcoded to POCLOUD currently. We should make this configurable so that other DAACs/users can use this tool. if no provider is given, we should utilize a default of POCLOUD. The need for a provider is to prevent issues where the same collection name is used for multiple providers.

Add a --dry-run option to the subscriber and downloader tools

Users would like to see just how much data they are going to download, adding the --dry-run option would run the search and create the download lists and let a user know how many files to expect.

Got Error but download started without problem

I got "Error getting the token - check user name and password". My username and password are correct in .netrc. The download was not affected and started anyways.

podaac-data-subscriber -c SWOT_SIMULATED_L2_KARIN_SSH_ECCO_LLC4320_CALVAL_V1 -d ./ --start-date 2011-11-25T00:00:00Z -b="-140,20,-110,40"
Error getting the token - check user name and password
WARN: No .update in the data directory. (Is this the first run?)
Warning: only the most recent 2000 granules will be downloaded; try adjusting your search criteria (suggestion: reduce time period or spatial region of search) to ensure you retrieve all granules.
2021-11-16 07:10:54.856155 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/SWOT_SIMULATED_L2_KARIN_SSH_ECCO_LLC4320_CALVAL_V1/SWOT_L2_LR_SSH_Expert_368_011_20121111T230805_20121111T235910_DG10_01.nc
2021-11-16 07:10:56.091361 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/SWOT_SIMULATED_L2_KARIN_SSH_ECCO_LLC4320_CALVAL_V1/SWOT_L2_LR_SSH_Expert_368_010_20121111T221659_20121111T230804_DG10_01.nc

global bounding box is not supported by some datasets

if a bounding box is specified or not, we still send the 'bounding_box' parameter to CMR. This is not supported by some collections, for example:

podaac-data-subscriber -c AU_Ocean_NRT_R01 -sd 2022-03-01T00:00:00Z -ed 2022-04-30T23:59:59Z -e '.nc' -p LANCEAMSR2 -d data --verbose

with a CMR query of:

https://cmr.earthdata.nasa.gov/search/granules.umm_json?scroll=true&page_size=2000&sort_key=-start_date&provider=LANCEAMSR2&updated_since=2022-03-01T00%3A00%3A00Z&ShortName=AU_Ocean_NRT_R01&temporal=2022-03-01T00%3A00%3A00Z%2C2022-04-30T23%3A59%3A59Z&bounding_box=-180%2C-90%2C180%2C90

Yields 0 results.

if we take that same CMR query, and remove the bounding box parameter, we get results:

https://cmr.earthdata.nasa.gov/search/granules.umm_json?scroll=true&page_size=2000&sort_key=-start_date&provider=LANCEAMSR2&updated_since=2022-03-01T00%3A00%3A00Z&ShortName=AU_Ocean_NRT_R01&temporal=2022-03-01T00%3A00%3A00Z%2C2022-04-30T23%3A59%3A59Z

...

{
"hits": 370,
"took": 507,
"items": [
{
"meta": {
"concept-type": "granule",
"concept-id": "G2250539839-LANCEAMSR2",
"revision-id": 1,
"native-id": "AMSR_U2_L2_Ocean_R01_202204120051_D.he5",
"provider-id": "LANCEAMSR2",
"format": "application/echo10+xml",
"revision-date": "2022-04-12T03:23:40.993Z"

Even though this is not a PO.DAAC dataset, we should support this query. This is because the granule uses the OrbitCalculatedSpatialDomains, and not a geodectic or horizontal spatial domain.

Add docs that cover running scripts after pip install --user

Had user feedback that after podaac-data-subscriber was installed for them on a Unix machine, the module showed up as installed, but the commands "podaac-data-subscriber" and "podaac-data-downloader" could not be found.

Was fixed by adding $HOME/.local/bin to PATH.

This situation could come up for anyone doing a pip install --user podaac-data-subscriber as well. Would be good to have something in the README covering this scenario.

Here's one potential reference to use: https://packaging.python.org/en/latest/tutorials/installing-packages/#installing-to-the-user-site

Download files by matching filename portion

And since I’m emailing, I may as well ask about the problem I expected to meet. The GRACE series has four files per month:

https://urldefense.us/v3/__https:/archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/GRACEFO_L2_CSR_MONTHLY_0060/GSM-2_2018152-2018181_GRFO_UTCSR_BA01_0600__;!!PvBDto6Hs4WbVuu7!fkCBBvOS_eUQeDZNU3GiZ6z2fwP1D3v2U1KrurcGsD0Ymkrt3az62vBuTQaAxr1Zaap4SSzUtw$

https://urldefense.us/v3/__https:/archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/GRACEFO_L2_CSR_MONTHLY_0060/GAC-2_2018152-2018181_GRFO_UTCSR_BC01_0600__;!!PvBDto6Hs4WbVuu7!fkCBBvOS_eUQeDZNU3GiZ6z2fwP1D3v2U1KrurcGsD0Ymkrt3az62vBuTQaAxr1Zaaqs4JY7SQ$

https://urldefense.us/v3/__https:/archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/GRACEFO_L2_CSR_MONTHLY_0060/GSM-2_2018152-2018181_GRFO_UTCSR_BB01_0600__;!!PvBDto6Hs4WbVuu7!fkCBBvOS_eUQeDZNU3GiZ6z2fwP1D3v2U1KrurcGsD0Ymkrt3az62vBuTQaAxr1Zaap1qjG6fA$

https://urldefense.us/v3/__https:/archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/GRACEFO_L2_CSR_MONTHLY_0060/GAD-2_2018152-2018181_GRFO_UTCSR_BC01_0600__;!!PvBDto6Hs4WbVuu7!fkCBBvOS_eUQeDZNU3GiZ6z2fwP1D3v2U1KrurcGsD0Ymkrt3az62vBuTQaAxr1ZaapwXWrIyQ$

Is there any way to use this downloader to only download, say, the GSM “BA” format and the GAC file, as I used to via FTP/Drive? (GRACE is a small enough data series that this isn’t crucial, but I thought I’d ask.)

data-subscriber fails when dataset has no real start date and end date

An ECCO dataset of ancillary files has no natural 'start date' and 'end date'. users shouldn't be required to specify them to download.

$ podaac-data-subscriber -c ECCO_L4_ANCILLARY_DATA_V4R4  -d anc
WARN: No .update in the data directory. (Is this the first run?)
Downloaded: 0 files

Files Failed to download:0

CMR token successfully deleted

Oh, it also fails when I specify a start date and end date that spans the entire ECCO period:

$ podaac-data-subscriber -c ECCO_L4_ANCILLARY_DATA_V4R4  -d anc -sd 1990-01-01T00:00:00Z -ed 2021-01-01T01:01:01Z
NOTE: .update found in the data directory. (The last run was at 2022-01-27T20:07:48Z.)
Downloaded: 0 files

Files Failed to download:0

CMR token successfully deleted

move away from CMR tokens in favor of EDL tokens

ECHO-Token Deprecation Notice
CMR Legacy Services' ECHO tokens will be deprecated soon. Please use EDL tokens and send them with the Authorization header. This document contains many mentions of ECHO-Tokens, which will soon be out of date. Instructions on how to generate an EDL token are here

we will need to update the subscriber code base to either accept a EDL token or go and generate one. Generate one fits most seamlessly with the current way of working.

add new optional argument

Is it possible to add a new optional argument that allows users to search available datasets from CMR based on keywords and return short_names?

start-date and end-date descriptions are backwards

The descriptions for --start-date and --end-date options are not correct in the online help and the md files.

Currently shows:
-sd STARTDATE, --start-date STARTDATE
The ISO date time before which data should be retrieved. For Example, --start-date 2021-01-14T00:00:00Z
-ed ENDDATE, --end-date ENDDATE
The ISO date time after which data should be retrieved. For Example, --end-date 2021-01-14T00:00:00Z.

Assist Users in creating "netrc" files if one doesn't exist

Check the existence of .netrc (_netrc) file. If it does not exist, prompt a message with two options (1) if one has an EDL, ask for username and password then generate .netrc file for the user (2) if no EDL, redirect users to apply an EDL then come back to the data-subscriber.

Add option to enable search on "updated_since" CMR time

Current subscriber only looks for the 'created_at' time which is fine for initial ingest. This will miss updates to data files.

v1.9.0 fails because of missing dependency

Installed version 1.9.0 with pip but it did not install the required tenacity dependency.

pip install --force --user -U podaac-data-subscriber
Collecting podaac-data-subscriber
  Using cached podaac_data_subscriber-1.9.0-py3-none-any.whl (22 kB)
Installing collected packages: podaac-data-subscriber
  Attempting uninstall: podaac-data-subscriber
    Found existing installation: podaac-data-subscriber 1.9.0
    Uninstalling podaac-data-subscriber-1.9.0:
      Successfully uninstalled podaac-data-subscriber-1.9.0
Successfully installed podaac-data-subscriber-1.9.0

If I try to run the new version:

Traceback (most recent call last):
  File "/home/DEV-cloud/.local/bin/podaac-data-subscriber", line 5, in <module>
    from subscriber.podaac_data_subscriber import main
  File "/home/DEV-cloud/.local/lib/python3.6/site-packages/subscriber/podaac_data_subscriber.py", line 25, in <module>
    from subscriber import podaac_access as pa
  File "/home/DEV-cloud/.local/lib/python3.6/site-packages/subscriber/podaac_access.py", line 21, in <module>
    import tenacity
ModuleNotFoundError: No module named 'tenacity'
Traceback (most recent call last):
  File "/home/DEV-cloud/.local/bin/podaac-data-subscriber", line 5, in <module>
    from subscriber.podaac_data_subscriber import main
  File "/home/DEV-cloud/.local/lib/python3.6/site-packages/subscriber/podaac_data_subscriber.py", line 25, in <module>
    from subscriber import podaac_access as pa
  File "/home/DEV-cloud/.local/lib/python3.6/site-packages/subscriber/podaac_access.py", line 21, in <module>
    import tenacity
ModuleNotFoundError: No module named 'tenacity'

add -offset flag for timestamp shift when creating DOY folder

Addition of helper flag to be used for shifting timestamp used on a collection by collection basis when creating DOY folder.

not-intuitive syntax with REQUIRED arguments

I use python, I use the command line. installed data-subscriber. ran podaac-data-subscriber -h, quickly read the syntax, tried to download a single ECCO dataset into the current directory -d ./ Note, ./ is a perfectly normal syntax meaning current directory. Error message is not helpful.

$ podaac-data-subscriber -c ECCO_L4_ATM_STATE_05DEG_DAILY_V4R4 -ed 1993-01-01 -d .

$ podaac-data-subscriber -c ECCO_L4_ATM_STATE_05DEG_DAILY_V4R4 -ed 1993-01-01T00:00:00Z -d ./
WARN: No .update in the data directory. (Is this the first run?)
Traceback (most recent call last):
  File "/home/ifenty/anaconda3/envs/ecco/bin/podaac-data-subscriber", line 8, in <module>
    sys.exit(run())
  File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/site-packages/subscriber/podaac_data_subscriber.py", line 345, in run
    with urlopen(url) as f:
  File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request

Took a while to discover that I need to specify -d DIRECTORY_NAME and that the DIRECTORY_NAME will be created.

searching non-spatial types results in no results

run ing the following command:

podaac-data-subscriber -c AU_Ocean_NRT_R01 -sd 2021-06-10T00:00:00Z  -e '.he5' -p LANCEAMSR2 -d ./LANCE --verbose

we get no search results. the underlying query is setting a default bounding box of 'global', which returns 0 results because the underlying granules do not use horizontal spatial domains, they use the orbit number, crossing time, and crossing latitude for search.

Until we can support that search, we should remove the 'bounding box' from default searches.

Not finding data to download in GRACEFO_L2_CSR_MONTHLY_0060 dataset

I spoke too soon. The downloader is still working flawlessly for the Sentinel-6 data (thank you!). With optimism in mind, I moved to switch over my GRACE monthly spherical harmonic downloads to PODAAC as well. And… can’t get it to work.

Here’s the command I typed in:
podaac-data-downloader -c GRACEFO_L2_CSR_MONTHLY_0060 -d /Volumes/DataDisk/GRACE_RL06/CSR_SPHARM_60 -sd 2018-01-01T00:00:00Z -ed 2018-12-31T00:00:00Z

I’m looking for this data (which is supposed to be “cloud enabled” now – I checked this time!)
https://podaac.jpl.nasa.gov/dataset/GRACEFO_L2_CSR_MONTHLY_0060

And downloading by year, so for the first run, I was looking for all the data from 2018-1-1 to 2018-12-31.

When I called that, all the output I got was:
Found 0 total files to download
Downloaded: 0 files
Files Failed to download:0

display the total number of found granules

It will be helpful to display the total number of found granules, especially when the warning "Warning: only the most recent 2000 granules will be downloaded;" is shown.

download to a user's s3 bucket in us-west-2

Allow the option to download a subscribed event to a users bucket in S3. no data should leave the cloud when this happens.

Set "download all to a new folder named by short name" as default if no time information or output directory were given.

These are two defaults. It simplifies the use of the tool to an extreme "data-subscriber -c shortname". This will not affect experienced users but significantly lower the level of effort for new users.

Improve/fix functionality of download by extension in podaac-data-downloader

The downloader does not seem to identify granules correctly during a request with '-e', for specific extensions. In contrast the subscriber identifies and downloads granules using all the same parameters. Examples below:

Requesting .nc files using downloader
podaac-data-downloader -c JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F -d S6_L2_HR_STD_NRT -e .nc -sd 2021-06-01T00:46:02Z -ed 2021-06-01T03:00:00Z
[2022-09-02 16:02:12,020] {podaac_data_downloader.py:242} INFO - Found 0 total files to download
[2022-09-02 16:02:12,021] {podaac_data_downloader.py:284} INFO - Downloaded Files: 0
[2022-09-02 16:02:12,025] {podaac_data_downloader.py:285} INFO - Failed Files: 0
[2022-09-02 16:02:12,029] {podaac_data_downloader.py:286} INFO - Skipped Files: 0
[2022-09-02 16:02:12,297] {podaac_access.py:118} INFO - CMR token successfully deleted
[2022-09-02 16:02:12,297] {podaac_data_downloader.py:288} INFO - END

Requesting .nc files using subscriber
podaac-data-subscriber -c JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F -d S6_L2_HR_STD_NRT -e .nc -sd 2021-06-01T00:46:02Z -ed 2021-06-01T03:00:00Z
[2022-09-02 16:01:06,815] {podaac_data_subscriber.py:179} WARNING - No .update__JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F in the data directory. (Is this the first run?)
[2022-09-02 16:01:07,953] {podaac_data_subscriber.py:270} INFO - Found 10 total files to download
[2022-09-02 16:01:12,063] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:12.063932 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_181_20210601T025238_20210601T025438_F02.nc
[2022-09-02 16:01:13,701] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:13.701817 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_181_20210601T024238_20210601T025238_F02.nc
[2022-09-02 16:01:15,301] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:15.301483 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_181_20210601T023308_20210601T024238_F02.nc
[2022-09-02 16:01:16,771] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:16.769997 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_181_20210601T022320_20210601T022637_F02.nc
[2022-09-02 16:01:18,283] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:18.283086 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_180_20210601T020158_20210601T020254_F02.nc
[2022-09-02 16:01:20,296] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:20.296783 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_180_20210601T014632_20210601T015135_F02.nc
[2022-09-02 16:01:21,884] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:21.884365 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_180_20210601T013541_20210601T013645_F02.nc
[2022-09-02 16:01:23,508] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:23.508012 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_180_20210601T012546_20210601T013428_F02.nc
[2022-09-02 16:01:25,340] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:25.340850 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_179_20210601T005400_20210601T010128_F02.nc
[2022-09-02 16:01:27,031] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:27.031012 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_179_20210601T004400_20210601T005400_F02.nc
[2022-09-02 16:01:27,032] {podaac_data_subscriber.py:314} INFO - Downloaded Files: 10
[2022-09-02 16:01:27,035] {podaac_data_subscriber.py:315} INFO - Failed Files: 0
[2022-09-02 16:01:27,036] {podaac_data_subscriber.py:316} INFO - Skipped Files: 0
[2022-09-02 16:01:27,358] {podaac_access.py:118} INFO - CMR token successfully deleted
[2022-09-02 16:01:27,364] {podaac_data_subscriber.py:318} INFO - END

Support start/end dates without times

It would be convenient if I could optionally provide dates without times for --start-date and --end-date. If no time is provided, T00:00:00Z/T23:59:59Z would be automatically added to the provided start and end dates. EDSC does this as well as the C2C CLI tooling. The C2C CLI code can probably be reused here

User created .update file resulting in 'HTTP 400' error

User posted an issue on the po.daac forum.

meteo@BOIRA:~/PROJECTES/SST/NCEI/DATA/SST/NC2$ podaac-data-subscriber -c AVHRR_OI-NCEI-L4-GLOB-v2.1 -d ./ --verbose
NOTE: .update found in the data directory. (The last run was at 2021-10-29T00:05:03Z
.)
Provider: POCLOUD
Updated Since: 2021-10-29T00:05:03Z

https://cmr.earthdata.nasa.gov/search/granules.umm_json?scroll=true&page_size=2000&sort_key=-start_date&provider=POCLOUD&ShortName=AVHRR_OI-NCEI-L4-GLOB-v2.1&updated_since=2021-10-29T00%3A05%3A03Z%0A&token=****&bounding_box=-180%2C-90%2C180%2C90
Traceback (most recent call last):
  File "/home/meteo/.local/bin/podaac-data-subscriber", line 11, in <module>
    load_entry_point('podaac-data-subscriber==1.6.0', 'console_scripts', 'podaac-data-subscriber')()
  File "/home/meteo/.local/lib/python3.6/site-packages/subscriber/podaac_data_subscriber.py", line 339, in run
    with urlopen(url) as f:
  File "/usr/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/usr/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request

the trailing %0A in the 'updated_since' was causing an issue. Suggest we 'strip' any white space characters from the read of a .update file.

add retry to 503 error in downloads

saw this during regression testing:

WARNING  root:podaac_data_subscriber.py:307 2022-08-04 14:20:03.485885 FAILURE: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_LR_STD_OST_NRT_F/S6A_P4_2__LR_STD__NR_042_083_20220101T104242_20220101T123506_F04.nc
Traceback (most recent call last):
  File "/Users/runner/work/data-subscriber/data-subscriber/subscriber/podaac_data_subscriber.py", line 302, in run
    urlretrieve(f, output_path)
  File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 239, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 214, in urlopen
    return opener.open(url, data, timeout)
  File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 523, in open
    response = meth(req, response)
  File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 632, in http_response
    response = self.parent.error(
  File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 555, in error
    result = self._call_chain(*args)
  File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 494, in _call_chain
    result = func(*args)
  File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 747, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 523, in open
    response = meth(req, response)
  File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 632, in http_response
    response = self.parent.error(
  File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 561, in error
    return self._call_chain(*args)
  File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 494, in _call_chain
    result = func(*args)
  File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 641, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 503: Service Unavailable

we should catch the 503 and retry- this could happen for any number of reasons, but we're interested in addressing transient issues that happen occasionally.

some more information can be found here: https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/http-503-service-unavailable.html

Allow directory layout based on start time of downloaded data

Add an option to the subscriber to enable non-flat output directories- that is, some default layouts:

Cycle
Year/DOY
Year/Month/Day

podaac-data-downloader --verbose -c GRACEFO_L2_CSR_MONTHLY_0060 -d ./podaac_csr -sd 2018-01-01T00:00:00Z -ed 2022-06-14T16:11:58Z -e "00"
[2022-06-13 11:02:44,494] {podaac_data_downloader.py:158} INFO - NOTE: Making new data directory at ./podaac_csr(This is the first run.)
[2022-06-13 11:02:44,699] {podaac_data_downloader.py:192} INFO - Temporal Range: 2018-01-01T00:00:00Z,2022-06-14T16:11:58Z
[2022-06-13 11:02:44,699] {podaac_data_downloader.py:195} INFO - Provider: POCLOUD
[2022-06-13 11:02:44,700] {podaac_access.py:300} INFO - https://cmr.earthdata.nasa.gov/search/granules.umm_json?page_size=2000&sort_key=-start_date&provider=POCLOUD&ShortName=GRACEFO_L2_CSR_MONTHLY_0060&temporal=2018-01-01T00%3A00%3A00Z%2C2022-06-14T16%3A11%3A58Z&token=5896F157-242C-A41B-F04D-45D86713C6ED&bounding_box=-180%2C-90%2C180%2C90
[2022-06-13 11:02:46,826] {podaac_data_downloader.py:209} INFO - 176 granules found for GRACEFO_L2_CSR_MONTHLY_0060
[2022-06-13 11:02:46,827] {podaac_data_downloader.py:249} INFO - Found 176 total files to download
[2022-06-13 11:02:46,827] {podaac_data_downloader.py:251} INFO - Downloading files with extensions: ['00']
[2022-06-13 11:02:52,528] {podaac_data_downloader.py:278} INFO - 2022-06-13 11:02:52.528421 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/GRACEFO_L2_CSR_MONTHLY_0060/GAD-2_2022060-2022090_GRFO_UTCSR_BC01_0600
[2022-06-13 11:02:54,344] {podaac_data_downloader.py:278} INFO - 2022-06-13 11:02:54.344359 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/GRACEFO_L2_CSR_MONTHLY_0060/GSM-2_2022060-2022090_GRFO_UTCSR_BB01_0600
[2022-06-13 11:02:56,177] {podaac_data_downloader.py:278} INFO - 2022-06-13 11:02:56.177678 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/GRACEFO_L2_CSR_MONTHLY_0060/GSM-2_2022060-2022090_GRFO_UTCSR_BA01_0600
[2022-06-13 11:02:58,036] {podaac_data_downloader.py:278} INFO - 2022-06-13 11:02:58.036421 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/GRACEFO_L2_CSR_MONTHLY_0060/GAC-2_2022060-2022090_GRFO_UTCSR_BC01_0600
[2022-06-13 11:03:00,256] {podaac_data_downloader.py:278} INFO - 2022-06-13 11:03:00.256610 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/GRACEFO_L2_CSR_MONTHLY_0060/GSM-2_2022032-2022059_GRFO_UTCSR_BB01_0600
[2022-06-13 11:03:02,585] {podaac_data_downloader.py:278} INFO - 2022-06-13 11:03:02.585296 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/GRACEFO_L2_CSR_MONTHLY_0060/GAD-2_2022032-2022059_GRFO_UTCSR_BC01_0600

running the same command again, we get an error:

WARNING - 2022-06-13 11:03:32.626580 FAILURE: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/GRACEFO_L2_CSR_MONTHLY_0060/GAD-2_2022060-2022090_GRFO_UTCSR_BC01_0600
Traceback (most recent call last):
  File "/Users/gangl/miniconda3/lib/python3.8/site-packages/subscriber/podaac_data_downloader.py", line 271, in run
    if(exists(output_path) and not args.force and pa.checksum_does_match(output_path, checksums)):
  File "/Users/gangl/miniconda3/lib/python3.8/site-packages/subscriber/podaac_access.py", line 418, in checksum_does_match
    computed_checksum = make_checksum(file_path, checksum["Algorithm"])
  File "/Users/gangl/miniconda3/lib/python3.8/site-packages/subscriber/podaac_access.py", line 431, in make_checksum
    hash_alg = getattr(hashlib, algorithm.lower())()
AttributeError: module 'hashlib' has no attribute 'sha-512'

podaac / data-subscriber Goto Github PK

data-subscriber's People

Contributors

Stargazers

Watchers

Forkers

data-subscriber's Issues

Recommend Projects

Recommend Topics

Recommend Org