Code Monkey home page Code Monkey logo

data-subscriber's People

Contributors

carygeo avatar frankinspace avatar jjmcnelis avatar mgangl avatar mike-gangl avatar sappjw-noaa avatar skorper avatar wveit avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-subscriber's Issues

Allow subsetting capabilities to subscriber

Currently the subscriber uses the bounding box information to search for granules of interest. With the addition of subset capabilities in harmony and our use of that, it would be great if the bounding box capabilities was enhanced to call harmony and do the actual subset on the data and only return data of interest WITHIN the bounding box.

from Jinbo:
"data_subscriber" can be a potential one-stop tool for all 'download and analysis' groups and scenarios. To me as a user, subsetting capability like this will complete this 'swiss army knife'.

Update 4/11/2022
Needs:

  • Validation of BBOX required for Subsetter
  • Adding the subsetting capability to "subscriber" not just "downloader"
  • Testing against large requests (hundreds to thousands of jobs)
  • Better Error Handling - print out links to harmony workflow-ui or jobs requests, remove the job from the .harmony file when it fails
  • Need to work with harmony team on adding the "fail on error" -> false through harmony py

don't be so uptight with date formats

Dates currently need a format like 2002-07-04T00:00:00Z

as in

podaac-data-subscriber -c MODIS_A-JPL-L2P-v2019.0 -d ./data/MODIS_A-JPL-L2P-v2019.0 --start-date 2002-07-04T00:00:00Z

Let's be a little more relaxed in allowing the user to specify times:

2002-07-04 --> 2002-07-04T00:00:00Z
2002-07-04T00:00:00.000Z -> 2002-07-04T00:00:00Z

Scroll for result sizes > 2000

Subscriber wasn't written for massive bulk downloads in mind, so there is no concept of scrolling through large result sets.

Utilize the CMR scrolling mechanism to fetch listings of files, though this could be millions of files if the search criteria are large enough.

non intuitive-syntax with supposedly OPTIONAL arguments

I use python, I use the command line. installed data-subscriber. ran podaac-data-subscriber -h, quickly read the syntax, tried to download all files for a single ECCO dataset into the current directory. According to the docs, the -sd STARTDATE and -ed ENDATE are OPTIONAL, but just try to run without them -- failure:

Also failing (not shown) is only using the -ed flag to get ALL available files from the beginning to the specified end date.

The only combination that works is when both -sd and -ed are specified.

$ podaac-data-subscriber -c ECCO_L4_ATM_STATE_05DEG_DAILY_V4R4 -ed 1993-01-01T00:00:00Z -d newDirectory
NOTE: Making new data directory at newDirectory(This is the first run.)
Traceback (most recent call last):
  File "/home/ifenty/anaconda3/envs/ecco/bin/podaac-data-subscriber", line 8, in <module>
    sys.exit(run())
  File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/site-packages/subscriber/podaac_data_subscriber.py", line 345, in run
    with urlopen(url) as f:
  File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request

`machine` name in `.netrc`

.netrc in following examples lists two different names for machine. What should it be? Or, are they the same?

  • Step 2 lists as: machine urs.earthdata.nasa.gov
  • Notebook has uat. added as in: machine uat.urs.earthdata.nasa.gov

Thanks!

"list index out of range" with "PODAAC" provider

I'm not sure if it's possible to use this tool for datasets not on the cloud, but there are some nice features I'd like to use.  I specified the provider as PODAAC and got an error, where it seemed to do nothing with the default POCLOUD provider. I can download the files it lists as FAILURE through a web browser, which suggests a bug in the download code.

With default provider:

$ ./podaac_data_subscriber.py -c SMAP_JPL_L2B_SSS_CAP_V5 --start-date 2021-11-07T00:00:00Z -d test -dydoy
DEBUG:root:Log level set to DEBUG
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): cmr.earthdata.nasa.gov:443
DEBUG:urllib3.connectionpool:https://cmr.earthdata.nasa.gov:443 "POST /legacy-services/rest/tokens HTTP/1.1" 201 None
WARN: No .update in the data directory. (Is this the first run?)
Downloaded: 0 files

Files Failed to download:0

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): cmr.earthdata.nasa.gov:443
DEBUG:urllib3.connectionpool:https://cmr.earthdata.nasa.gov:443 "DELETE /legacy-services/rest/tokens/F7305582-9214-29AC-2F57-BE0E2C226EFB HTTP/1.1" 204 0
CMR token successfully deleted

With PODAAC provider

$ ./podaac_data_subscriber.py -c SMAP_JPL_L2B_SSS_CAP_V5 --start-date 2021-11-07T00:00:00Z -d test -dydoy -p "PODAAC"

DEBUG:root:Log level set to DEBUG
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): cmr.earthdata.nasa.gov:443
DEBUG:urllib3.connectionpool:https://cmr.earthdata.nasa.gov:443 "POST /legacy-services/rest/tokens HTTP/1.1" 201 None
WARN: No .update in the data directory. (Is this the first run?)
2021-11-11 15:30:37.061956 FAILURE: https://podaac-tools.jpl.nasa.gov/drive/files/allData/smap/L2/JPL/V5.0/2021/313/SMAP_L2B_SSS_36180_20211109T091710_R18240_V5.0.h5
list index out of range
2021-11-11 15:30:37.062248 FAILURE: https://podaac-tools.jpl.nasa.gov/drive/files/allData/smap/L2/JPL/V5.0/2021/313/SMAP_L2B_SSS_36179_20211109T073843_R18240_V5.0.h5
list index out of range
...
2021-11-11 15:30:37.068847 FAILURE: https://podaac-tools.jpl.nasa.gov/drive/files/allData/smap/L2/JPL/V5.0/2021/311/SMAP_L2B_SSS_36146_20211107T012930_R18240_V5.0.h5
list index out of range
Downloaded: 0 files

Files Failed to download:32

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): cmr.earthdata.nasa.gov:443
DEBUG:urllib3.connectionpool:https://cmr.earthdata.nasa.gov:443 "DELETE /legacy-services/rest/tokens/7485FF97-5B2A-1150-65BE-0B6AE0D84E1A HTTP/1.1" 204 0
CMR token successfully deleted

--limit option doesn't work for downloader

Running the command $ podaac-data-downloader -c GRACEFO_L2_CSR_MONTHLY_0060 -sd 2020-01-01T00:00:00Z -ed 2020-01-02T00:00:00Z -d ~/grace --limit 1 --verbose -e 00 causes 8 files to be downloaded even with --limit 1

The --limit option used to work, by setting the page_size in the CMR request. This worked because the downloader would only make 1 CMR request.

But now that the downloader/subscriber have implemented paging, setting page_size only results in multiple CMR requests.

Subscriber doesn't have --limit option, so it's not affected.

Ensure subscriber can detect and re-download updated granules based on collection redeliveries

Update CMR query to use updated_since rather than created_at in the CMR query:

    params = {
        'scroll': "true",
        'page_size': 2000,
        'sort_key': "-start_date",
        'provider': 'POCLOUD',
        'ShortName': short_name,
        'created_at': data_within_last_timestamp,
        'token': token,
        'bounding_box': bounding_extent,
    }

https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#c-updated-since

here:
https://github.com/podaac/data-subscriber/blob/main/subscriber/podaac_data_subscriber.py#L298

as well as here:
https://github.com/podaac/data-subscriber/blob/main/subscriber/podaac_data_subscriber.py#L310

Notify users of updated versions

Users may not be using the awesome new functionality of the subscriber! let them know that a new version exists when applicable. Will require a user to grab the version with this capability before they get notified...

Assumptions
release tags will always use semantic versioning: 1.7.0, 1.7.1, 1.9.0, etc

Github releases API
curl -H "Accept: application/vnd.github.v3+json" https://api.github.com/repos/podaac/data-subscriber/releases

Semantic versioning comparison
as seen here: https://stackoverflow.com/questions/11887762/how-do-i-compare-version-numbers-in-python

>>> from packaging import version
>>> version.parse("2.3.1") < version.parse("10.1.2")

Changes to .update file

Say, user is changing the search parameters the update file might conflict and not download any files. If update file is present, then we need to compare it to the search parameters and make some decisions on whether we want to use the .update file or show a warning. For example -- say

"There are X number of files in the search parameter used , but all occur outside of the latest date in the .update date of YR-MO-DY. Delete update file if you wish to proceed,”

Other option is to add a flag that says, ignore .update file.

Helpful error message if a collection doesn't exist

Not much guidance from CMR if a collection doesn't exist. We should add a check if 0 files are found to ensure that a collection exists. This could save time for a user using this for a non-PO.DAAC provider

Subscriber and Dowload warns not all granules will be downloaded, but all granules get downloaded

Example log

A warning is printed to the log when downloading many granules that says only the first 2000 will be downloaded. But all granules get downloaded.

Limits were removed in #65 so this warning should likely be removed.

 1 ============== Wed Aug  3 22:24:01 UTC 2022 ===============
      2 [2022-08-03 22:24:01,555] {podaac_data_subscriber.py:165} INFO - NOTE: Making new data directory at /cloud/ghrsst/open/data/GDS2/L3U/AVHRRMTC/STAR/v2.80(This is the first run.)
      3 [2022-08-03 22:24:01,558] {podaac_data_subscriber.py:206} INFO - Temporal Range: 2022-07-01T00:00:00Z,2022-08-03T22:24:01Z
      4 [2022-08-03 22:24:01,558] {podaac_data_subscriber.py:212} INFO - Provider: POCLOUD
      5 [2022-08-03 22:24:01,558] {podaac_data_subscriber.py:213} INFO - Updated Since: 2022-07-01T00:00:00Z
      6 [2022-08-03 22:24:01,558] {podaac_access.py:301} INFO - https://cmr.earthdata.nasa.gov/search/granules.umm_json?page_size=2000&sort_key=-start_date&provider=POC        LOUD&updated_since=2022-07-01T00%3A00%3A00Z&ShortName=AVHRRF_MC-STAR-L3U-v2.80&temporal=2022-07-01T00%3A00%3A00Z%2C2022-08-03T22%3A24%3A01Z&token=D5A7A608-AFCD-719D-7998-B46207622CB1
      7 [2022-08-03 22:24:06,112] {podaac_data_subscriber.py:228} INFO - 4850 new granules found for AVHRRF_MC-STAR-L3U-v2.80 since 2022-07-01T00:00:00Z
>>      8 [2022-08-03 22:24:06,277] {podaac_data_subscriber.py:254} WARNING - Only the most recent 2000 granules will be downloaded; try adjusting your search criteria (suggestion: reduce time period or spatial region of search) to ensure you retrieve all granules.
      9 [2022-08-03 22:24:06,283] {podaac_data_subscriber.py:270} INFO - Found 4850 total files to download
     10 [2022-08-03 22:24:06,284] {podaac_data_subscriber.py:272} INFO - Downloading files with extensions: ['.nc']
     11 [2022-08-03 22:24:10,666] {podaac_data_subscriber.py:299} INFO - 2022-08-03 22:24:10.666259 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/AVHRRF_MC-STAR-L3U-v2.80/2022/215/20220803195000-STAR-L3U_GHRSST-SSTsubskin-AVHRRF_MC-ACSPO_V2.80-v02.0-fv01.0.nc
...
   4928 [2022-08-04 00:13:32,702] {podaac_data_subscriber.py:299} INFO - 2022-08-04 00:13:32.702847 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/AVHRRF_MC-STAR-L3U-v2.80/2022/182/20220701000000-STAR-L3U_GHRSST-SSTsubskin-AVHRRF_MC-ACSPO_V2.80-v02.0-fv01.0.nc
>>   4929 [2022-08-04 00:13:32,703] {podaac_data_subscriber.py:314} INFO - Downloaded Files: 4848
   4930 [2022-08-04 00:13:32,703] {podaac_data_subscriber.py:315} INFO - Failed Files:     2
   4931 [2022-08-04 00:13:32,703] {podaac_data_subscriber.py:316} INFO - Skipped Files:    0
   4932 [2022-08-04 00:13:33,051] {podaac_access.py:118} INFO - CMR token successfully deleted
   4933 [2022-08-04 00:13:33,052] {podaac_data_subscriber.py:318} INFO - END

Resume

https://podaac.jpl.nasa.gov/forum/viewtopic.php?f=6&t=1418

Good evening!

We download a lot of data from PODAAC and occasionally something goes wrong partway through the download (real-world stuff like the a bad network connection the system going down at the wrong moment).

Does the cloud data subscriber script have mechanisms to deal with these cases? Ideally it should try to get the failed download again but quickly "give up" if the problem persists.

Subscriber should be able to 'resume' during a download failure. Currently, if any of the downloads fail during a subscriber run, the subscriber "exits" without updating its last run, and the next time it runs, it will attempt to download all files from the previous, "failed" run, even if only one out of N files actually failed.

[Errno 17] File exists

Ran into an error when testing phase 2 dataset downloads in dev.

Because we have multiple collections that share a similar destination path (found this error with MODIS_AQUA_L3_SST_THERMAL_DAILY_4KM_NIGHTTIME_V2019.0), there is a chance that two instances of data-subscriber both try to call makedirs for the same directory at the same time. This results in one of them being successful and the other one throws an error [Errno 17] File exists.

Easy fix is to use exist_ok=True parameter to ignore an already existing directory.

Create better '.update' file name for allowing multiple subscribers to download to the same directory.

if a user uses the same directory (-d) for multiple collections, then there is no ability for it to know which collection the ".update" file belongs to. We should probably update this to use the collection name instead of the ".update" file name to ensure the same directory specified doesn't break functionality.

.update_collectionshortname file

Originally posted by @mike-gangl in #38 (comment)

Download data by cycle and pass

Instead of asking for data between DATE-1 and DATE-2 some users may want data for CYCLE-23 to CYCLE-26. Expand subscriber to search/download by cycle as well. This would really be helpful for JASON series missions. Other users may want data by pass, too. For missions that don't include pass and cycle appropriate warning messages have to be displayed.

add YEAR output directory structure

Similar to:
-dc Flag to use cycle number for directory where data products will be downloaded.
-dydoy Flag to use start time (Year/DOY) of downloaded data for directory where data products will be downloaded.
-dymd Flag to use start time (Year/Month/Day) of downloaded data for directory where data products will be downloaded.

Add new -dy flag for just YEAR as the output directory structure.

Make provider configurable

the provider for the tool is hardcoded to POCLOUD currently. We should make this configurable so that other DAACs/users can use this tool. if no provider is given, we should utilize a default of POCLOUD. The need for a provider is to prevent issues where the same collection name is used for multiple providers.

Got Error but download started without problem

I got "Error getting the token - check user name and password". My username and password are correct in .netrc. The download was not affected and started anyways.

podaac-data-subscriber -c SWOT_SIMULATED_L2_KARIN_SSH_ECCO_LLC4320_CALVAL_V1 -d ./ --start-date 2011-11-25T00:00:00Z -b="-140,20,-110,40"
Error getting the token - check user name and password
WARN: No .update in the data directory. (Is this the first run?)
Warning: only the most recent 2000 granules will be downloaded; try adjusting your search criteria (suggestion: reduce time period or spatial region of search) to ensure you retrieve all granules.
2021-11-16 07:10:54.856155 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/SWOT_SIMULATED_L2_KARIN_SSH_ECCO_LLC4320_CALVAL_V1/SWOT_L2_LR_SSH_Expert_368_011_20121111T230805_20121111T235910_DG10_01.nc
2021-11-16 07:10:56.091361 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/SWOT_SIMULATED_L2_KARIN_SSH_ECCO_LLC4320_CALVAL_V1/SWOT_L2_LR_SSH_Expert_368_010_20121111T221659_20121111T230804_DG10_01.nc

global bounding box is not supported by some datasets

if a bounding box is specified or not, we still send the 'bounding_box' parameter to CMR. This is not supported by some collections, for example:

podaac-data-subscriber -c AU_Ocean_NRT_R01 -sd 2022-03-01T00:00:00Z -ed 2022-04-30T23:59:59Z -e '.nc' -p LANCEAMSR2 -d data --verbose

with a CMR query of:

https://cmr.earthdata.nasa.gov/search/granules.umm_json?scroll=true&page_size=2000&sort_key=-start_date&provider=LANCEAMSR2&updated_since=2022-03-01T00%3A00%3A00Z&ShortName=AU_Ocean_NRT_R01&temporal=2022-03-01T00%3A00%3A00Z%2C2022-04-30T23%3A59%3A59Z&bounding_box=-180%2C-90%2C180%2C90

Yields 0 results.

if we take that same CMR query, and remove the bounding box parameter, we get results:

https://cmr.earthdata.nasa.gov/search/granules.umm_json?scroll=true&page_size=2000&sort_key=-start_date&provider=LANCEAMSR2&updated_since=2022-03-01T00%3A00%3A00Z&ShortName=AU_Ocean_NRT_R01&temporal=2022-03-01T00%3A00%3A00Z%2C2022-04-30T23%3A59%3A59Z

...

{
"hits": 370,
"took": 507,
"items": [
{
"meta": {
"concept-type": "granule",
"concept-id": "G2250539839-LANCEAMSR2",
"revision-id": 1,
"native-id": "AMSR_U2_L2_Ocean_R01_202204120051_D.he5",
"provider-id": "LANCEAMSR2",
"format": "application/echo10+xml",
"revision-date": "2022-04-12T03:23:40.993Z"

Even though this is not a PO.DAAC dataset, we should support this query. This is because the granule uses the OrbitCalculatedSpatialDomains, and not a geodectic or horizontal spatial domain.

Add docs that cover running scripts after pip install --user

Had user feedback that after podaac-data-subscriber was installed for them on a Unix machine, the module showed up as installed, but the commands "podaac-data-subscriber" and "podaac-data-downloader" could not be found.

Was fixed by adding $HOME/.local/bin to PATH.

This situation could come up for anyone doing a pip install --user podaac-data-subscriber as well. Would be good to have something in the README covering this scenario.

Here's one potential reference to use: https://packaging.python.org/en/latest/tutorials/installing-packages/#installing-to-the-user-site

Download files by matching filename portion

data-subscriber fails when dataset has no real start date and end date

An ECCO dataset of ancillary files has no natural 'start date' and 'end date'. users shouldn't be required to specify them to download.

$ podaac-data-subscriber -c ECCO_L4_ANCILLARY_DATA_V4R4  -d anc
WARN: No .update in the data directory. (Is this the first run?)
Downloaded: 0 files

Files Failed to download:0

CMR token successfully deleted

Oh, it also fails when I specify a start date and end date that spans the entire ECCO period:

$ podaac-data-subscriber -c ECCO_L4_ANCILLARY_DATA_V4R4  -d anc -sd 1990-01-01T00:00:00Z -ed 2021-01-01T01:01:01Z
NOTE: .update found in the data directory. (The last run was at 2022-01-27T20:07:48Z.)
Downloaded: 0 files

Files Failed to download:0

CMR token successfully deleted

move away from CMR tokens in favor of EDL tokens

ECHO-Token Deprecation Notice
CMR Legacy Services' ECHO tokens will be deprecated soon. Please use EDL tokens and send them with the Authorization header. This document contains many mentions of ECHO-Tokens, which will soon be out of date. Instructions on how to generate an EDL token are here

we will need to update the subscriber code base to either accept a EDL token or go and generate one. Generate one fits most seamlessly with the current way of working.

add new optional argument

Is it possible to add a new optional argument that allows users to search available datasets from CMR based on keywords and return short_names?

start-date and end-date descriptions are backwards

The descriptions for --start-date and --end-date options are not correct in the online help and the md files.

Currently shows:
-sd STARTDATE, --start-date STARTDATE
The ISO date time before which data should be retrieved. For Example, --start-date 2021-01-14T00:00:00Z
-ed ENDDATE, --end-date ENDDATE
The ISO date time after which data should be retrieved. For Example, --end-date 2021-01-14T00:00:00Z.

Assist Users in creating "netrc" files if one doesn't exist

Check the existence of .netrc (_netrc) file. If it does not exist, prompt a message with two options (1) if one has an EDL, ask for username and password then generate .netrc file for the user (2) if no EDL, redirect users to apply an EDL then come back to the data-subscriber.

v1.9.0 fails because of missing dependency

Installed version 1.9.0 with pip but it did not install the required tenacity dependency.

pip install --force --user -U podaac-data-subscriber
Collecting podaac-data-subscriber
  Using cached podaac_data_subscriber-1.9.0-py3-none-any.whl (22 kB)
Installing collected packages: podaac-data-subscriber
  Attempting uninstall: podaac-data-subscriber
    Found existing installation: podaac-data-subscriber 1.9.0
    Uninstalling podaac-data-subscriber-1.9.0:
      Successfully uninstalled podaac-data-subscriber-1.9.0
Successfully installed podaac-data-subscriber-1.9.0

If I try to run the new version:

Traceback (most recent call last):
  File "/home/DEV-cloud/.local/bin/podaac-data-subscriber", line 5, in <module>
    from subscriber.podaac_data_subscriber import main
  File "/home/DEV-cloud/.local/lib/python3.6/site-packages/subscriber/podaac_data_subscriber.py", line 25, in <module>
    from subscriber import podaac_access as pa
  File "/home/DEV-cloud/.local/lib/python3.6/site-packages/subscriber/podaac_access.py", line 21, in <module>
    import tenacity
ModuleNotFoundError: No module named 'tenacity'
Traceback (most recent call last):
  File "/home/DEV-cloud/.local/bin/podaac-data-subscriber", line 5, in <module>
    from subscriber.podaac_data_subscriber import main
  File "/home/DEV-cloud/.local/lib/python3.6/site-packages/subscriber/podaac_data_subscriber.py", line 25, in <module>
    from subscriber import podaac_access as pa
  File "/home/DEV-cloud/.local/lib/python3.6/site-packages/subscriber/podaac_access.py", line 21, in <module>
    import tenacity
ModuleNotFoundError: No module named 'tenacity'

not-intuitive syntax with REQUIRED arguments

I use python, I use the command line. installed data-subscriber. ran podaac-data-subscriber -h, quickly read the syntax, tried to download a single ECCO dataset into the current directory -d ./ Note, ./ is a perfectly normal syntax meaning current directory. Error message is not helpful.

$ podaac-data-subscriber -c ECCO_L4_ATM_STATE_05DEG_DAILY_V4R4 -ed 1993-01-01 -d .
$ podaac-data-subscriber -c ECCO_L4_ATM_STATE_05DEG_DAILY_V4R4 -ed 1993-01-01T00:00:00Z -d ./
WARN: No .update in the data directory. (Is this the first run?)
Traceback (most recent call last):
  File "/home/ifenty/anaconda3/envs/ecco/bin/podaac-data-subscriber", line 8, in <module>
    sys.exit(run())
  File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/site-packages/subscriber/podaac_data_subscriber.py", line 345, in run
    with urlopen(url) as f:
  File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request

Took a while to discover that I need to specify -d DIRECTORY_NAME and that the DIRECTORY_NAME will be created.

searching non-spatial types results in no results

run ing the following command:

podaac-data-subscriber -c AU_Ocean_NRT_R01 -sd 2021-06-10T00:00:00Z  -e '.he5' -p LANCEAMSR2 -d ./LANCE --verbose

we get no search results. the underlying query is setting a default bounding box of 'global', which returns 0 results because the underlying granules do not use horizontal spatial domains, they use the orbit number, crossing time, and crossing latitude for search.

Until we can support that search, we should remove the 'bounding box' from default searches.

Not finding data to download in GRACEFO_L2_CSR_MONTHLY_0060 dataset

I spoke too soon. The downloader is still working flawlessly for the Sentinel-6 data (thank you!). With optimism in mind, I moved to switch over my GRACE monthly spherical harmonic downloads to PODAAC as well. And… can’t get it to work.

Here’s the command I typed in:
podaac-data-downloader -c GRACEFO_L2_CSR_MONTHLY_0060 -d /Volumes/DataDisk/GRACE_RL06/CSR_SPHARM_60 -sd 2018-01-01T00:00:00Z -ed 2018-12-31T00:00:00Z

I’m looking for this data (which is supposed to be “cloud enabled” now – I checked this time!)
https://podaac.jpl.nasa.gov/dataset/GRACEFO_L2_CSR_MONTHLY_0060

And downloading by year, so for the first run, I was looking for all the data from 2018-1-1 to 2018-12-31.

When I called that, all the output I got was:
Found 0 total files to download
Downloaded: 0 files
Files Failed to download:0

display the total number of found granules

It will be helpful to display the total number of found granules, especially when the warning "Warning: only the most recent 2000 granules will be downloaded;" is shown.

Improve/fix functionality of download by extension in podaac-data-downloader

The downloader does not seem to identify granules correctly during a request with '-e', for specific extensions. In contrast the subscriber identifies and downloads granules using all the same parameters. Examples below:

Requesting .nc files using downloader
podaac-data-downloader -c JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F -d S6_L2_HR_STD_NRT -e .nc -sd 2021-06-01T00:46:02Z -ed 2021-06-01T03:00:00Z
[2022-09-02 16:02:12,020] {podaac_data_downloader.py:242} INFO - Found 0 total files to download
[2022-09-02 16:02:12,021] {podaac_data_downloader.py:284} INFO - Downloaded Files: 0
[2022-09-02 16:02:12,025] {podaac_data_downloader.py:285} INFO - Failed Files: 0
[2022-09-02 16:02:12,029] {podaac_data_downloader.py:286} INFO - Skipped Files: 0
[2022-09-02 16:02:12,297] {podaac_access.py:118} INFO - CMR token successfully deleted
[2022-09-02 16:02:12,297] {podaac_data_downloader.py:288} INFO - END

Requesting .nc files using subscriber
podaac-data-subscriber -c JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F -d S6_L2_HR_STD_NRT -e .nc -sd 2021-06-01T00:46:02Z -ed 2021-06-01T03:00:00Z
[2022-09-02 16:01:06,815] {podaac_data_subscriber.py:179} WARNING - No .update__JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F in the data directory. (Is this the first run?)
[2022-09-02 16:01:07,953] {podaac_data_subscriber.py:270} INFO - Found 10 total files to download
[2022-09-02 16:01:12,063] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:12.063932 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_181_20210601T025238_20210601T025438_F02.nc
[2022-09-02 16:01:13,701] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:13.701817 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_181_20210601T024238_20210601T025238_F02.nc
[2022-09-02 16:01:15,301] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:15.301483 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_181_20210601T023308_20210601T024238_F02.nc
[2022-09-02 16:01:16,771] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:16.769997 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_181_20210601T022320_20210601T022637_F02.nc
[2022-09-02 16:01:18,283] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:18.283086 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_180_20210601T020158_20210601T020254_F02.nc
[2022-09-02 16:01:20,296] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:20.296783 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_180_20210601T014632_20210601T015135_F02.nc
[2022-09-02 16:01:21,884] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:21.884365 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_180_20210601T013541_20210601T013645_F02.nc
[2022-09-02 16:01:23,508] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:23.508012 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_180_20210601T012546_20210601T013428_F02.nc
[2022-09-02 16:01:25,340] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:25.340850 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_179_20210601T005400_20210601T010128_F02.nc
[2022-09-02 16:01:27,031] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:27.031012 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_179_20210601T004400_20210601T005400_F02.nc
[2022-09-02 16:01:27,032] {podaac_data_subscriber.py:314} INFO - Downloaded Files: 10
[2022-09-02 16:01:27,035] {podaac_data_subscriber.py:315} INFO - Failed Files: 0
[2022-09-02 16:01:27,036] {podaac_data_subscriber.py:316} INFO - Skipped Files: 0
[2022-09-02 16:01:27,358] {podaac_access.py:118} INFO - CMR token successfully deleted
[2022-09-02 16:01:27,364] {podaac_data_subscriber.py:318} INFO - END

Support start/end dates without times

It would be convenient if I could optionally provide dates without times for --start-date and --end-date. If no time is provided, T00:00:00Z/T23:59:59Z would be automatically added to the provided start and end dates. EDSC does this as well as the C2C CLI tooling. The C2C CLI code can probably be reused here

User created .update file resulting in 'HTTP 400' error

User posted an issue on the po.daac forum.

meteo@BOIRA:~/PROJECTES/SST/NCEI/DATA/SST/NC2$ podaac-data-subscriber -c AVHRR_OI-NCEI-L4-GLOB-v2.1 -d ./ --verbose
NOTE: .update found in the data directory. (The last run was at 2021-10-29T00:05:03Z
.)
Provider: POCLOUD
Updated Since: 2021-10-29T00:05:03Z

https://cmr.earthdata.nasa.gov/search/granules.umm_json?scroll=true&page_size=2000&sort_key=-start_date&provider=POCLOUD&ShortName=AVHRR_OI-NCEI-L4-GLOB-v2.1&updated_since=2021-10-29T00%3A05%3A03Z%0A&token=****&bounding_box=-180%2C-90%2C180%2C90
Traceback (most recent call last):
  File "/home/meteo/.local/bin/podaac-data-subscriber", line 11, in <module>
    load_entry_point('podaac-data-subscriber==1.6.0', 'console_scripts', 'podaac-data-subscriber')()
  File "/home/meteo/.local/lib/python3.6/site-packages/subscriber/podaac_data_subscriber.py", line 339, in run
    with urlopen(url) as f:
  File "/usr/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/usr/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request

the trailing %0A in the 'updated_since' was causing an issue. Suggest we 'strip' any white space characters from the read of a .update file.

add retry to 503 error in downloads

saw this during regression testing:

WARNING  root:podaac_data_subscriber.py:307 2022-08-04 14:20:03.485885 FAILURE: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_LR_STD_OST_NRT_F/S6A_P4_2__LR_STD__NR_042_083_20220101T104242_20220101T123506_F04.nc
Traceback (most recent call last):
  File "/Users/runner/work/data-subscriber/data-subscriber/subscriber/podaac_data_subscriber.py", line 302, in run
    urlretrieve(f, output_path)
  File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 239, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 214, in urlopen
    return opener.open(url, data, timeout)
  File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 523, in open
    response = meth(req, response)
  File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 632, in http_response
    response = self.parent.error(
  File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 555, in error
    result = self._call_chain(*args)
  File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 494, in _call_chain
    result = func(*args)
  File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 747, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 523, in open
    response = meth(req, response)
  File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 632, in http_response
    response = self.parent.error(
  File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 561, in error
    return self._call_chain(*args)
  File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 494, in _call_chain
    result = func(*args)
  File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 641, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 503: Service Unavailable

we should catch the 503 and retry- this could happen for any number of reasons, but we're interested in addressing transient issues that happen occasionally.

some more information can be found here: https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/http-503-service-unavailable.html

sha-512 checksum not supported

when running the downloader/subscriber with a a collection supporting sha-512, we run into errors:

podaac-data-downloader --verbose -c GRACEFO_L2_CSR_MONTHLY_0060 -d ./podaac_csr -sd 2018-01-01T00:00:00Z -ed 2022-06-14T16:11:58Z -e "00"
[2022-06-13 11:02:44,494] {podaac_data_downloader.py:158} INFO - NOTE: Making new data directory at ./podaac_csr(This is the first run.)
[2022-06-13 11:02:44,699] {podaac_data_downloader.py:192} INFO - Temporal Range: 2018-01-01T00:00:00Z,2022-06-14T16:11:58Z
[2022-06-13 11:02:44,699] {podaac_data_downloader.py:195} INFO - Provider: POCLOUD
[2022-06-13 11:02:44,700] {podaac_access.py:300} INFO - https://cmr.earthdata.nasa.gov/search/granules.umm_json?page_size=2000&sort_key=-start_date&provider=POCLOUD&ShortName=GRACEFO_L2_CSR_MONTHLY_0060&temporal=2018-01-01T00%3A00%3A00Z%2C2022-06-14T16%3A11%3A58Z&token=5896F157-242C-A41B-F04D-45D86713C6ED&bounding_box=-180%2C-90%2C180%2C90
[2022-06-13 11:02:46,826] {podaac_data_downloader.py:209} INFO - 176 granules found for GRACEFO_L2_CSR_MONTHLY_0060
[2022-06-13 11:02:46,827] {podaac_data_downloader.py:249} INFO - Found 176 total files to download
[2022-06-13 11:02:46,827] {podaac_data_downloader.py:251} INFO - Downloading files with extensions: ['00']
[2022-06-13 11:02:52,528] {podaac_data_downloader.py:278} INFO - 2022-06-13 11:02:52.528421 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/GRACEFO_L2_CSR_MONTHLY_0060/GAD-2_2022060-2022090_GRFO_UTCSR_BC01_0600
[2022-06-13 11:02:54,344] {podaac_data_downloader.py:278} INFO - 2022-06-13 11:02:54.344359 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/GRACEFO_L2_CSR_MONTHLY_0060/GSM-2_2022060-2022090_GRFO_UTCSR_BB01_0600
[2022-06-13 11:02:56,177] {podaac_data_downloader.py:278} INFO - 2022-06-13 11:02:56.177678 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/GRACEFO_L2_CSR_MONTHLY_0060/GSM-2_2022060-2022090_GRFO_UTCSR_BA01_0600
[2022-06-13 11:02:58,036] {podaac_data_downloader.py:278} INFO - 2022-06-13 11:02:58.036421 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/GRACEFO_L2_CSR_MONTHLY_0060/GAC-2_2022060-2022090_GRFO_UTCSR_BC01_0600
[2022-06-13 11:03:00,256] {podaac_data_downloader.py:278} INFO - 2022-06-13 11:03:00.256610 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/GRACEFO_L2_CSR_MONTHLY_0060/GSM-2_2022032-2022059_GRFO_UTCSR_BB01_0600
[2022-06-13 11:03:02,585] {podaac_data_downloader.py:278} INFO - 2022-06-13 11:03:02.585296 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/GRACEFO_L2_CSR_MONTHLY_0060/GAD-2_2022032-2022059_GRFO_UTCSR_BC01_0600

running the same command again, we get an error:

WARNING - 2022-06-13 11:03:32.626580 FAILURE: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/GRACEFO_L2_CSR_MONTHLY_0060/GAD-2_2022060-2022090_GRFO_UTCSR_BC01_0600
Traceback (most recent call last):
  File "/Users/gangl/miniconda3/lib/python3.8/site-packages/subscriber/podaac_data_downloader.py", line 271, in run
    if(exists(output_path) and not args.force and pa.checksum_does_match(output_path, checksums)):
  File "/Users/gangl/miniconda3/lib/python3.8/site-packages/subscriber/podaac_access.py", line 418, in checksum_does_match
    computed_checksum = make_checksum(file_path, checksum["Algorithm"])
  File "/Users/gangl/miniconda3/lib/python3.8/site-packages/subscriber/podaac_access.py", line 431, in make_checksum
    hash_alg = getattr(hashlib, algorithm.lower())()
AttributeError: module 'hashlib' has no attribute 'sha-512'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.