podaac / data-subscriber Goto Github PK
View Code? Open in Web Editor NEWSubscribe and bulk download collections of data at PO.DAAC
License: Apache License 2.0
Subscribe and bulk download collections of data at PO.DAAC
License: Apache License 2.0
Currently the subscriber uses the bounding box information to search for granules of interest. With the addition of subset capabilities in harmony and our use of that, it would be great if the bounding box capabilities was enhanced to call harmony and do the actual subset on the data and only return data of interest WITHIN the bounding box.
from Jinbo:
"data_subscriber" can be a potential one-stop tool for all 'download and analysis' groups and scenarios. To me as a user, subsetting capability like this will complete this 'swiss army knife'.
Update 4/11/2022
Needs:
Dates currently need a format like 2002-07-04T00:00:00Z
as in
podaac-data-subscriber -c MODIS_A-JPL-L2P-v2019.0 -d ./data/MODIS_A-JPL-L2P-v2019.0 --start-date 2002-07-04T00:00:00Z
Let's be a little more relaxed in allowing the user to specify times:
2002-07-04 --> 2002-07-04T00:00:00Z
2002-07-04T00:00:00.000Z -> 2002-07-04T00:00:00Z
Hi! I took the liberty of putting this package on conda-forge:
https://anaconda.org/conda-forge/podaac-data-subscriber
I hope someone else finds it helpful too!
Subscriber wasn't written for massive bulk downloads in mind, so there is no concept of scrolling through large result sets.
Utilize the CMR scrolling mechanism to fetch listings of files, though this could be millions of files if the search criteria are large enough.
I use python, I use the command line. installed data-subscriber. ran podaac-data-subscriber -h, quickly read the syntax, tried to download all files for a single ECCO dataset into the current directory. According to the docs, the -sd STARTDATE and -ed ENDATE are OPTIONAL, but just try to run without them -- failure:
Also failing (not shown) is only using the -ed flag to get ALL available files from the beginning to the specified end date.
The only combination that works is when both -sd and -ed are specified.
$ podaac-data-subscriber -c ECCO_L4_ATM_STATE_05DEG_DAILY_V4R4 -ed 1993-01-01T00:00:00Z -d newDirectory
NOTE: Making new data directory at newDirectory(This is the first run.)
Traceback (most recent call last):
File "/home/ifenty/anaconda3/envs/ecco/bin/podaac-data-subscriber", line 8, in <module>
sys.exit(run())
File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/site-packages/subscriber/podaac_data_subscriber.py", line 345, in run
with urlopen(url) as f:
File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 531, in open
response = meth(req, response)
File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 640, in http_response
response = self.parent.error(
File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 569, in error
return self._call_chain(*args)
File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 502, in _call_chain
result = func(*args)
File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request
Currently we need to specify the -e .tiff
parameter to successfully download Opera products. We should add this to the defaults list.
I'm not sure if it's possible to use this tool for datasets not on the cloud, but there are some nice features I'd like to use. I specified the provider as PODAAC
and got an error, where it seemed to do nothing with the default POCLOUD
provider. I can download the files it lists as FAILURE
through a web browser, which suggests a bug in the download code.
With default provider:
$ ./podaac_data_subscriber.py -c SMAP_JPL_L2B_SSS_CAP_V5 --start-date 2021-11-07T00:00:00Z -d test -dydoy
DEBUG:root:Log level set to DEBUG
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): cmr.earthdata.nasa.gov:443
DEBUG:urllib3.connectionpool:https://cmr.earthdata.nasa.gov:443 "POST /legacy-services/rest/tokens HTTP/1.1" 201 None
WARN: No .update in the data directory. (Is this the first run?)
Downloaded: 0 files
Files Failed to download:0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): cmr.earthdata.nasa.gov:443
DEBUG:urllib3.connectionpool:https://cmr.earthdata.nasa.gov:443 "DELETE /legacy-services/rest/tokens/F7305582-9214-29AC-2F57-BE0E2C226EFB HTTP/1.1" 204 0
CMR token successfully deleted
With PODAAC
provider
$ ./podaac_data_subscriber.py -c SMAP_JPL_L2B_SSS_CAP_V5 --start-date 2021-11-07T00:00:00Z -d test -dydoy -p "PODAAC"
DEBUG:root:Log level set to DEBUG
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): cmr.earthdata.nasa.gov:443
DEBUG:urllib3.connectionpool:https://cmr.earthdata.nasa.gov:443 "POST /legacy-services/rest/tokens HTTP/1.1" 201 None
WARN: No .update in the data directory. (Is this the first run?)
2021-11-11 15:30:37.061956 FAILURE: https://podaac-tools.jpl.nasa.gov/drive/files/allData/smap/L2/JPL/V5.0/2021/313/SMAP_L2B_SSS_36180_20211109T091710_R18240_V5.0.h5
list index out of range
2021-11-11 15:30:37.062248 FAILURE: https://podaac-tools.jpl.nasa.gov/drive/files/allData/smap/L2/JPL/V5.0/2021/313/SMAP_L2B_SSS_36179_20211109T073843_R18240_V5.0.h5
list index out of range
...
2021-11-11 15:30:37.068847 FAILURE: https://podaac-tools.jpl.nasa.gov/drive/files/allData/smap/L2/JPL/V5.0/2021/311/SMAP_L2B_SSS_36146_20211107T012930_R18240_V5.0.h5
list index out of range
Downloaded: 0 files
Files Failed to download:32
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): cmr.earthdata.nasa.gov:443
DEBUG:urllib3.connectionpool:https://cmr.earthdata.nasa.gov:443 "DELETE /legacy-services/rest/tokens/7485FF97-5B2A-1150-65BE-0B6AE0D84E1A HTTP/1.1" 204 0
CMR token successfully deleted
instead of 'subscribing' we should allow the ability to set a start/stop time and download all files within that temporal window.
Running the command $ podaac-data-downloader -c GRACEFO_L2_CSR_MONTHLY_0060 -sd 2020-01-01T00:00:00Z -ed 2020-01-02T00:00:00Z -d ~/grace --limit 1 --verbose -e 00
causes 8 files to be downloaded even with --limit 1
The --limit option used to work, by setting the page_size in the CMR request. This worked because the downloader would only make 1 CMR request.
But now that the downloader/subscriber have implemented paging, setting page_size only results in multiple CMR requests.
Subscriber doesn't have --limit option, so it's not affected.
Update CMR query to use updated_since
rather than created_at
in the CMR query:
params = {
'scroll': "true",
'page_size': 2000,
'sort_key': "-start_date",
'provider': 'POCLOUD',
'ShortName': short_name,
'created_at': data_within_last_timestamp,
'token': token,
'bounding_box': bounding_extent,
}
https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#c-updated-since
here:
https://github.com/podaac/data-subscriber/blob/main/subscriber/podaac_data_subscriber.py#L298
as well as here:
https://github.com/podaac/data-subscriber/blob/main/subscriber/podaac_data_subscriber.py#L310
Users may not be using the awesome new functionality of the subscriber! let them know that a new version exists when applicable. Will require a user to grab the version with this capability before they get notified...
Assumptions
release tags will always use semantic versioning: 1.7.0, 1.7.1, 1.9.0, etc
Github releases API
curl -H "Accept: application/vnd.github.v3+json" https://api.github.com/repos/podaac/data-subscriber/releases
Semantic versioning comparison
as seen here: https://stackoverflow.com/questions/11887762/how-do-i-compare-version-numbers-in-python
>>> from packaging import version
>>> version.parse("2.3.1") < version.parse("10.1.2")
Allow mapping and usage of the tool to collections.
Say, user is changing the search parameters the update file might conflict and not download any files. If update file is present, then we need to compare it to the search parameters and make some decisions on whether we want to use the .update file or show a warning. For example -- say
"There are X number of files in the search parameter used , but all occur outside of the latest date in the .update date of YR-MO-DY. Delete update file if you wish to proceed,”
Other option is to add a flag that says, ignore .update file.
Not much guidance from CMR if a collection doesn't exist. We should add a check if 0 files are found to ensure that a collection exists. This could save time for a user using this for a non-PO.DAAC provider
Example log
A warning is printed to the log when downloading many granules that says only the first 2000 will be downloaded. But all granules get downloaded.
Limits were removed in #65 so this warning should likely be removed.
1 ============== Wed Aug 3 22:24:01 UTC 2022 ===============
2 [2022-08-03 22:24:01,555] {podaac_data_subscriber.py:165} INFO - NOTE: Making new data directory at /cloud/ghrsst/open/data/GDS2/L3U/AVHRRMTC/STAR/v2.80(This is the first run.)
3 [2022-08-03 22:24:01,558] {podaac_data_subscriber.py:206} INFO - Temporal Range: 2022-07-01T00:00:00Z,2022-08-03T22:24:01Z
4 [2022-08-03 22:24:01,558] {podaac_data_subscriber.py:212} INFO - Provider: POCLOUD
5 [2022-08-03 22:24:01,558] {podaac_data_subscriber.py:213} INFO - Updated Since: 2022-07-01T00:00:00Z
6 [2022-08-03 22:24:01,558] {podaac_access.py:301} INFO - https://cmr.earthdata.nasa.gov/search/granules.umm_json?page_size=2000&sort_key=-start_date&provider=POC LOUD&updated_since=2022-07-01T00%3A00%3A00Z&ShortName=AVHRRF_MC-STAR-L3U-v2.80&temporal=2022-07-01T00%3A00%3A00Z%2C2022-08-03T22%3A24%3A01Z&token=D5A7A608-AFCD-719D-7998-B46207622CB1
7 [2022-08-03 22:24:06,112] {podaac_data_subscriber.py:228} INFO - 4850 new granules found for AVHRRF_MC-STAR-L3U-v2.80 since 2022-07-01T00:00:00Z
>> 8 [2022-08-03 22:24:06,277] {podaac_data_subscriber.py:254} WARNING - Only the most recent 2000 granules will be downloaded; try adjusting your search criteria (suggestion: reduce time period or spatial region of search) to ensure you retrieve all granules.
9 [2022-08-03 22:24:06,283] {podaac_data_subscriber.py:270} INFO - Found 4850 total files to download
10 [2022-08-03 22:24:06,284] {podaac_data_subscriber.py:272} INFO - Downloading files with extensions: ['.nc']
11 [2022-08-03 22:24:10,666] {podaac_data_subscriber.py:299} INFO - 2022-08-03 22:24:10.666259 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/AVHRRF_MC-STAR-L3U-v2.80/2022/215/20220803195000-STAR-L3U_GHRSST-SSTsubskin-AVHRRF_MC-ACSPO_V2.80-v02.0-fv01.0.nc
...
4928 [2022-08-04 00:13:32,702] {podaac_data_subscriber.py:299} INFO - 2022-08-04 00:13:32.702847 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/AVHRRF_MC-STAR-L3U-v2.80/2022/182/20220701000000-STAR-L3U_GHRSST-SSTsubskin-AVHRRF_MC-ACSPO_V2.80-v02.0-fv01.0.nc
>> 4929 [2022-08-04 00:13:32,703] {podaac_data_subscriber.py:314} INFO - Downloaded Files: 4848
4930 [2022-08-04 00:13:32,703] {podaac_data_subscriber.py:315} INFO - Failed Files: 2
4931 [2022-08-04 00:13:32,703] {podaac_data_subscriber.py:316} INFO - Skipped Files: 0
4932 [2022-08-04 00:13:33,051] {podaac_access.py:118} INFO - CMR token successfully deleted
4933 [2022-08-04 00:13:33,052] {podaac_data_subscriber.py:318} INFO - END
This bug only affects restricted collections.
CMR Search parameters defined at https://github.com/podaac/data-subscriber/blob/main/subscriber/podaac_data_downloader.py#L175 do not include the generated EDL Token required for restricted searches.
https://podaac.jpl.nasa.gov/forum/viewtopic.php?f=6&t=1418
Good evening!
We download a lot of data from PODAAC and occasionally something goes wrong partway through the download (real-world stuff like the a bad network connection the system going down at the wrong moment).
Does the cloud data subscriber script have mechanisms to deal with these cases? Ideally it should try to get the failed download again but quickly "give up" if the problem persists.
Subscriber should be able to 'resume' during a download failure. Currently, if any of the downloads fail during a subscriber run, the subscriber "exits" without updating its last run, and the next time it runs, it will attempt to download all files from the previous, "failed" run, even if only one out of N files actually failed.
Ran into an error when testing phase 2 dataset downloads in dev.
Because we have multiple collections that share a similar destination path (found this error with MODIS_AQUA_L3_SST_THERMAL_DAILY_4KM_NIGHTTIME_V2019.0), there is a chance that two instances of data-subscriber both try to call makedirs
for the same directory at the same time. This results in one of them being successful and the other one throws an error [Errno 17] File exists
.
Easy fix is to use exist_ok=True
parameter to ignore an already existing directory.
if a user uses the same directory (-d) for multiple collections, then there is no ability for it to know which collection the ".update" file belongs to. We should probably update this to use the collection name instead of the ".update" file name to ensure the same directory specified doesn't break functionality.
.update_collectionshortname
file
Originally posted by @mike-gangl in #38 (comment)
Instead of asking for data between DATE-1 and DATE-2 some users may want data for CYCLE-23 to CYCLE-26. Expand subscriber to search/download by cycle as well. This would really be helpful for JASON series missions. Other users may want data by pass, too. For missions that don't include pass and cycle appropriate warning messages have to be displayed.
Similar to:
-dc Flag to use cycle number for directory where data products will be downloaded.
-dydoy Flag to use start time (Year/DOY) of downloaded data for directory where data products will be downloaded.
-dymd Flag to use start time (Year/Month/Day) of downloaded data for directory where data products will be downloaded.
Add new -dy flag for just YEAR as the output directory structure.
the provider for the tool is hardcoded to POCLOUD currently. We should make this configurable so that other DAACs/users can use this tool. if no provider is given, we should utilize a default of POCLOUD. The need for a provider is to prevent issues where the same collection name is used for multiple providers.
Users would like to see just how much data they are going to download, adding the --dry-run option would run the search and create the download lists and let a user know how many files to expect.
I got "Error getting the token - check user name and password". My username and password are correct in .netrc. The download was not affected and started anyways.
podaac-data-subscriber -c SWOT_SIMULATED_L2_KARIN_SSH_ECCO_LLC4320_CALVAL_V1 -d ./ --start-date 2011-11-25T00:00:00Z -b="-140,20,-110,40"
Error getting the token - check user name and password
WARN: No .update in the data directory. (Is this the first run?)
Warning: only the most recent 2000 granules will be downloaded; try adjusting your search criteria (suggestion: reduce time period or spatial region of search) to ensure you retrieve all granules.
2021-11-16 07:10:54.856155 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/SWOT_SIMULATED_L2_KARIN_SSH_ECCO_LLC4320_CALVAL_V1/SWOT_L2_LR_SSH_Expert_368_011_20121111T230805_20121111T235910_DG10_01.nc
2021-11-16 07:10:56.091361 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/SWOT_SIMULATED_L2_KARIN_SSH_ECCO_LLC4320_CALVAL_V1/SWOT_L2_LR_SSH_Expert_368_010_20121111T221659_20121111T230804_DG10_01.nc
if a bounding box is specified or not, we still send the 'bounding_box' parameter to CMR. This is not supported by some collections, for example:
podaac-data-subscriber -c AU_Ocean_NRT_R01 -sd 2022-03-01T00:00:00Z -ed 2022-04-30T23:59:59Z -e '.nc' -p LANCEAMSR2 -d data --verbose
with a CMR query of:
https://cmr.earthdata.nasa.gov/search/granules.umm_json?scroll=true&page_size=2000&sort_key=-start_date&provider=LANCEAMSR2&updated_since=2022-03-01T00%3A00%3A00Z&ShortName=AU_Ocean_NRT_R01&temporal=2022-03-01T00%3A00%3A00Z%2C2022-04-30T23%3A59%3A59Z&bounding_box=-180%2C-90%2C180%2C90
Yields 0 results.
if we take that same CMR query, and remove the bounding box parameter, we get results:
https://cmr.earthdata.nasa.gov/search/granules.umm_json?scroll=true&page_size=2000&sort_key=-start_date&provider=LANCEAMSR2&updated_since=2022-03-01T00%3A00%3A00Z&ShortName=AU_Ocean_NRT_R01&temporal=2022-03-01T00%3A00%3A00Z%2C2022-04-30T23%3A59%3A59Z
...
{
"hits": 370,
"took": 507,
"items": [
{
"meta": {
"concept-type": "granule",
"concept-id": "G2250539839-LANCEAMSR2",
"revision-id": 1,
"native-id": "AMSR_U2_L2_Ocean_R01_202204120051_D.he5",
"provider-id": "LANCEAMSR2",
"format": "application/echo10+xml",
"revision-date": "2022-04-12T03:23:40.993Z"
Even though this is not a PO.DAAC dataset, we should support this query. This is because the granule uses the OrbitCalculatedSpatialDomains, and not a geodectic or horizontal spatial domain.
Had user feedback that after podaac-data-subscriber was installed for them on a Unix machine, the module showed up as installed, but the commands "podaac-data-subscriber" and "podaac-data-downloader" could not be found.
Was fixed by adding $HOME/.local/bin to PATH.
This situation could come up for anyone doing a pip install --user podaac-data-subscriber
as well. Would be good to have something in the README covering this scenario.
Here's one potential reference to use: https://packaging.python.org/en/latest/tutorials/installing-packages/#installing-to-the-user-site
And since I’m emailing, I may as well ask about the problem I expected to meet. The GRACE series has four files per month:
Is there any way to use this downloader to only download, say, the GSM “BA” format and the GAC file, as I used to via FTP/Drive? (GRACE is a small enough data series that this isn’t crucial, but I thought I’d ask.)
An ECCO dataset of ancillary files has no natural 'start date' and 'end date'. users shouldn't be required to specify them to download.
$ podaac-data-subscriber -c ECCO_L4_ANCILLARY_DATA_V4R4 -d anc
WARN: No .update in the data directory. (Is this the first run?)
Downloaded: 0 files
Files Failed to download:0
CMR token successfully deleted
Oh, it also fails when I specify a start date and end date that spans the entire ECCO period:
$ podaac-data-subscriber -c ECCO_L4_ANCILLARY_DATA_V4R4 -d anc -sd 1990-01-01T00:00:00Z -ed 2021-01-01T01:01:01Z
NOTE: .update found in the data directory. (The last run was at 2022-01-27T20:07:48Z.)
Downloaded: 0 files
Files Failed to download:0
CMR token successfully deleted
ECHO-Token Deprecation Notice
CMR Legacy Services' ECHO tokens will be deprecated soon. Please use EDL tokens and send them with the Authorization header. This document contains many mentions of ECHO-Tokens, which will soon be out of date. Instructions on how to generate an EDL token are here
we will need to update the subscriber code base to either accept a EDL token or go and generate one. Generate one fits most seamlessly with the current way of working.
Is it possible to add a new optional argument that allows users to search available datasets from CMR based on keywords and return short_names?
The descriptions for --start-date and --end-date options are not correct in the online help and the md files.
Currently shows:
-sd STARTDATE, --start-date STARTDATE
The ISO date time before which data should be retrieved. For Example, --start-date 2021-01-14T00:00:00Z
-ed ENDDATE, --end-date ENDDATE
The ISO date time after which data should be retrieved. For Example, --end-date 2021-01-14T00:00:00Z.
Check the existence of .netrc (_netrc) file. If it does not exist, prompt a message with two options (1) if one has an EDL, ask for username and password then generate .netrc file for the user (2) if no EDL, redirect users to apply an EDL then come back to the data-subscriber.
Current subscriber only looks for the 'created_at' time which is fine for initial ingest. This will miss updates to data files.
Installed version 1.9.0 with pip but it did not install the required tenacity dependency.
pip install --force --user -U podaac-data-subscriber
Collecting podaac-data-subscriber
Using cached podaac_data_subscriber-1.9.0-py3-none-any.whl (22 kB)
Installing collected packages: podaac-data-subscriber
Attempting uninstall: podaac-data-subscriber
Found existing installation: podaac-data-subscriber 1.9.0
Uninstalling podaac-data-subscriber-1.9.0:
Successfully uninstalled podaac-data-subscriber-1.9.0
Successfully installed podaac-data-subscriber-1.9.0
If I try to run the new version:
Traceback (most recent call last):
File "/home/DEV-cloud/.local/bin/podaac-data-subscriber", line 5, in <module>
from subscriber.podaac_data_subscriber import main
File "/home/DEV-cloud/.local/lib/python3.6/site-packages/subscriber/podaac_data_subscriber.py", line 25, in <module>
from subscriber import podaac_access as pa
File "/home/DEV-cloud/.local/lib/python3.6/site-packages/subscriber/podaac_access.py", line 21, in <module>
import tenacity
ModuleNotFoundError: No module named 'tenacity'
Traceback (most recent call last):
File "/home/DEV-cloud/.local/bin/podaac-data-subscriber", line 5, in <module>
from subscriber.podaac_data_subscriber import main
File "/home/DEV-cloud/.local/lib/python3.6/site-packages/subscriber/podaac_data_subscriber.py", line 25, in <module>
from subscriber import podaac_access as pa
File "/home/DEV-cloud/.local/lib/python3.6/site-packages/subscriber/podaac_access.py", line 21, in <module>
import tenacity
ModuleNotFoundError: No module named 'tenacity'
Addition of helper flag to be used for shifting timestamp used on a collection by collection basis when creating DOY folder.
I use python, I use the command line. installed data-subscriber. ran podaac-data-subscriber -h, quickly read the syntax, tried to download a single ECCO dataset into the current directory -d ./
Note, ./
is a perfectly normal syntax meaning current directory. Error message is not helpful.
$ podaac-data-subscriber -c ECCO_L4_ATM_STATE_05DEG_DAILY_V4R4 -ed 1993-01-01 -d .
$ podaac-data-subscriber -c ECCO_L4_ATM_STATE_05DEG_DAILY_V4R4 -ed 1993-01-01T00:00:00Z -d ./
WARN: No .update in the data directory. (Is this the first run?)
Traceback (most recent call last):
File "/home/ifenty/anaconda3/envs/ecco/bin/podaac-data-subscriber", line 8, in <module>
sys.exit(run())
File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/site-packages/subscriber/podaac_data_subscriber.py", line 345, in run
with urlopen(url) as f:
File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 531, in open
response = meth(req, response)
File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 640, in http_response
response = self.parent.error(
File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 569, in error
return self._call_chain(*args)
File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 502, in _call_chain
result = func(*args)
File "/home/ifenty/anaconda3/envs/ecco/lib/python3.8/urllib/request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request
Took a while to discover that I need to specify -d DIRECTORY_NAME
and that the DIRECTORY_NAME will be created.
run ing the following command:
podaac-data-subscriber -c AU_Ocean_NRT_R01 -sd 2021-06-10T00:00:00Z -e '.he5' -p LANCEAMSR2 -d ./LANCE --verbose
we get no search results. the underlying query is setting a default bounding box of 'global', which returns 0 results because the underlying granules do not use horizontal spatial domains, they use the orbit number, crossing time, and crossing latitude for search.
Until we can support that search, we should remove the 'bounding box' from default searches.
I spoke too soon. The downloader is still working flawlessly for the Sentinel-6 data (thank you!). With optimism in mind, I moved to switch over my GRACE monthly spherical harmonic downloads to PODAAC as well. And… can’t get it to work.
Here’s the command I typed in:
podaac-data-downloader -c GRACEFO_L2_CSR_MONTHLY_0060 -d /Volumes/DataDisk/GRACE_RL06/CSR_SPHARM_60 -sd 2018-01-01T00:00:00Z -ed 2018-12-31T00:00:00Z
I’m looking for this data (which is supposed to be “cloud enabled” now – I checked this time!)
https://podaac.jpl.nasa.gov/dataset/GRACEFO_L2_CSR_MONTHLY_0060
And downloading by year, so for the first run, I was looking for all the data from 2018-1-1 to 2018-12-31.
When I called that, all the output I got was:
Found 0 total files to download
Downloaded: 0 files
Files Failed to download:0
It will be helpful to display the total number of found granules, especially when the warning "Warning: only the most recent 2000 granules will be downloaded;" is shown.
Allow the option to download a subscribed event to a users bucket in S3. no data should leave the cloud when this happens.
These are two defaults. It simplifies the use of the tool to an extreme "data-subscriber -c shortname". This will not affect experienced users but significantly lower the level of effort for new users.
The downloader does not seem to identify granules correctly during a request with '-e', for specific extensions. In contrast the subscriber identifies and downloads granules using all the same parameters. Examples below:
Requesting .nc files using downloader
podaac-data-downloader -c JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F -d S6_L2_HR_STD_NRT -e .nc -sd 2021-06-01T00:46:02Z -ed 2021-06-01T03:00:00Z
[2022-09-02 16:02:12,020] {podaac_data_downloader.py:242} INFO - Found 0 total files to download
[2022-09-02 16:02:12,021] {podaac_data_downloader.py:284} INFO - Downloaded Files: 0
[2022-09-02 16:02:12,025] {podaac_data_downloader.py:285} INFO - Failed Files: 0
[2022-09-02 16:02:12,029] {podaac_data_downloader.py:286} INFO - Skipped Files: 0
[2022-09-02 16:02:12,297] {podaac_access.py:118} INFO - CMR token successfully deleted
[2022-09-02 16:02:12,297] {podaac_data_downloader.py:288} INFO - END
Requesting .nc files using subscriber
podaac-data-subscriber -c JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F -d S6_L2_HR_STD_NRT -e .nc -sd 2021-06-01T00:46:02Z -ed 2021-06-01T03:00:00Z
[2022-09-02 16:01:06,815] {podaac_data_subscriber.py:179} WARNING - No .update__JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F in the data directory. (Is this the first run?)
[2022-09-02 16:01:07,953] {podaac_data_subscriber.py:270} INFO - Found 10 total files to download
[2022-09-02 16:01:12,063] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:12.063932 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_181_20210601T025238_20210601T025438_F02.nc
[2022-09-02 16:01:13,701] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:13.701817 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_181_20210601T024238_20210601T025238_F02.nc
[2022-09-02 16:01:15,301] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:15.301483 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_181_20210601T023308_20210601T024238_F02.nc
[2022-09-02 16:01:16,771] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:16.769997 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_181_20210601T022320_20210601T022637_F02.nc
[2022-09-02 16:01:18,283] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:18.283086 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_180_20210601T020158_20210601T020254_F02.nc
[2022-09-02 16:01:20,296] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:20.296783 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_180_20210601T014632_20210601T015135_F02.nc
[2022-09-02 16:01:21,884] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:21.884365 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_180_20210601T013541_20210601T013645_F02.nc
[2022-09-02 16:01:23,508] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:23.508012 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_180_20210601T012546_20210601T013428_F02.nc
[2022-09-02 16:01:25,340] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:25.340850 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_179_20210601T005400_20210601T010128_F02.nc
[2022-09-02 16:01:27,031] {podaac_data_subscriber.py:299} INFO - 2022-09-02 16:01:27.031012 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_HR_STD_OST_NRT_F/S6A_P4_2__HR_STD__NR_020_179_20210601T004400_20210601T005400_F02.nc
[2022-09-02 16:01:27,032] {podaac_data_subscriber.py:314} INFO - Downloaded Files: 10
[2022-09-02 16:01:27,035] {podaac_data_subscriber.py:315} INFO - Failed Files: 0
[2022-09-02 16:01:27,036] {podaac_data_subscriber.py:316} INFO - Skipped Files: 0
[2022-09-02 16:01:27,358] {podaac_access.py:118} INFO - CMR token successfully deleted
[2022-09-02 16:01:27,364] {podaac_data_subscriber.py:318} INFO - END
It would be convenient if I could optionally provide dates without times for --start-date
and --end-date
. If no time is provided, T00:00:00Z
/T23:59:59Z
would be automatically added to the provided start and end dates. EDSC does this as well as the C2C CLI tooling. The C2C CLI code can probably be reused here
User posted an issue on the po.daac forum.
meteo@BOIRA:~/PROJECTES/SST/NCEI/DATA/SST/NC2$ podaac-data-subscriber -c AVHRR_OI-NCEI-L4-GLOB-v2.1 -d ./ --verbose
NOTE: .update found in the data directory. (The last run was at 2021-10-29T00:05:03Z
.)
Provider: POCLOUD
Updated Since: 2021-10-29T00:05:03Z
https://cmr.earthdata.nasa.gov/search/granules.umm_json?scroll=true&page_size=2000&sort_key=-start_date&provider=POCLOUD&ShortName=AVHRR_OI-NCEI-L4-GLOB-v2.1&updated_since=2021-10-29T00%3A05%3A03Z%0A&token=****&bounding_box=-180%2C-90%2C180%2C90
Traceback (most recent call last):
File "/home/meteo/.local/bin/podaac-data-subscriber", line 11, in <module>
load_entry_point('podaac-data-subscriber==1.6.0', 'console_scripts', 'podaac-data-subscriber')()
File "/home/meteo/.local/lib/python3.6/site-packages/subscriber/podaac_data_subscriber.py", line 339, in run
with urlopen(url) as f:
File "/usr/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.6/urllib/request.py", line 532, in open
response = meth(req, response)
File "/usr/lib/python3.6/urllib/request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.6/urllib/request.py", line 570, in error
return self._call_chain(*args)
File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/usr/lib/python3.6/urllib/request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request
the trailing %0A in the 'updated_since' was causing an issue. Suggest we 'strip' any white space characters from the read of a .update file.
saw this during regression testing:
WARNING root:podaac_data_subscriber.py:307 2022-08-04 14:20:03.485885 FAILURE: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_LR_STD_OST_NRT_F/S6A_P4_2__LR_STD__NR_042_083_20220101T104242_20220101T123506_F04.nc
Traceback (most recent call last):
File "/Users/runner/work/data-subscriber/data-subscriber/subscriber/podaac_data_subscriber.py", line 302, in run
urlretrieve(f, output_path)
File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 239, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 214, in urlopen
return opener.open(url, data, timeout)
File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 523, in open
response = meth(req, response)
File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 632, in http_response
response = self.parent.error(
File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 555, in error
result = self._call_chain(*args)
File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 494, in _call_chain
result = func(*args)
File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 747, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 523, in open
response = meth(req, response)
File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 632, in http_response
response = self.parent.error(
File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 561, in error
return self._call_chain(*args)
File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 494, in _call_chain
result = func(*args)
File "/Users/runner/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/urllib/request.py", line 641, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 503: Service Unavailable
we should catch the 503 and retry- this could happen for any number of reasons, but we're interested in addressing transient issues that happen occasionally.
some more information can be found here: https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/http-503-service-unavailable.html
Add an option to the subscriber to enable non-flat output directories- that is, some default layouts:
Add the citation information for a dataset when we download the data. An example is:
<dataset_short_name>.citation.txt
which can, at minimum, include the citation information.
-start-date
should be --start-date
in README.md (see https://github.com/podaac/data-subscriber#your-first-run).
when running the downloader/subscriber with a a collection supporting sha-512, we run into errors:
podaac-data-downloader --verbose -c GRACEFO_L2_CSR_MONTHLY_0060 -d ./podaac_csr -sd 2018-01-01T00:00:00Z -ed 2022-06-14T16:11:58Z -e "00"
[2022-06-13 11:02:44,494] {podaac_data_downloader.py:158} INFO - NOTE: Making new data directory at ./podaac_csr(This is the first run.)
[2022-06-13 11:02:44,699] {podaac_data_downloader.py:192} INFO - Temporal Range: 2018-01-01T00:00:00Z,2022-06-14T16:11:58Z
[2022-06-13 11:02:44,699] {podaac_data_downloader.py:195} INFO - Provider: POCLOUD
[2022-06-13 11:02:44,700] {podaac_access.py:300} INFO - https://cmr.earthdata.nasa.gov/search/granules.umm_json?page_size=2000&sort_key=-start_date&provider=POCLOUD&ShortName=GRACEFO_L2_CSR_MONTHLY_0060&temporal=2018-01-01T00%3A00%3A00Z%2C2022-06-14T16%3A11%3A58Z&token=5896F157-242C-A41B-F04D-45D86713C6ED&bounding_box=-180%2C-90%2C180%2C90
[2022-06-13 11:02:46,826] {podaac_data_downloader.py:209} INFO - 176 granules found for GRACEFO_L2_CSR_MONTHLY_0060
[2022-06-13 11:02:46,827] {podaac_data_downloader.py:249} INFO - Found 176 total files to download
[2022-06-13 11:02:46,827] {podaac_data_downloader.py:251} INFO - Downloading files with extensions: ['00']
[2022-06-13 11:02:52,528] {podaac_data_downloader.py:278} INFO - 2022-06-13 11:02:52.528421 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/GRACEFO_L2_CSR_MONTHLY_0060/GAD-2_2022060-2022090_GRFO_UTCSR_BC01_0600
[2022-06-13 11:02:54,344] {podaac_data_downloader.py:278} INFO - 2022-06-13 11:02:54.344359 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/GRACEFO_L2_CSR_MONTHLY_0060/GSM-2_2022060-2022090_GRFO_UTCSR_BB01_0600
[2022-06-13 11:02:56,177] {podaac_data_downloader.py:278} INFO - 2022-06-13 11:02:56.177678 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/GRACEFO_L2_CSR_MONTHLY_0060/GSM-2_2022060-2022090_GRFO_UTCSR_BA01_0600
[2022-06-13 11:02:58,036] {podaac_data_downloader.py:278} INFO - 2022-06-13 11:02:58.036421 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/GRACEFO_L2_CSR_MONTHLY_0060/GAC-2_2022060-2022090_GRFO_UTCSR_BC01_0600
[2022-06-13 11:03:00,256] {podaac_data_downloader.py:278} INFO - 2022-06-13 11:03:00.256610 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/GRACEFO_L2_CSR_MONTHLY_0060/GSM-2_2022032-2022059_GRFO_UTCSR_BB01_0600
[2022-06-13 11:03:02,585] {podaac_data_downloader.py:278} INFO - 2022-06-13 11:03:02.585296 SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/GRACEFO_L2_CSR_MONTHLY_0060/GAD-2_2022032-2022059_GRFO_UTCSR_BC01_0600
running the same command again, we get an error:
WARNING - 2022-06-13 11:03:32.626580 FAILURE: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/GRACEFO_L2_CSR_MONTHLY_0060/GAD-2_2022060-2022090_GRFO_UTCSR_BC01_0600
Traceback (most recent call last):
File "/Users/gangl/miniconda3/lib/python3.8/site-packages/subscriber/podaac_data_downloader.py", line 271, in run
if(exists(output_path) and not args.force and pa.checksum_does_match(output_path, checksums)):
File "/Users/gangl/miniconda3/lib/python3.8/site-packages/subscriber/podaac_access.py", line 418, in checksum_does_match
computed_checksum = make_checksum(file_path, checksum["Algorithm"])
File "/Users/gangl/miniconda3/lib/python3.8/site-packages/subscriber/podaac_access.py", line 431, in make_checksum
hash_alg = getattr(hashlib, algorithm.lower())()
AttributeError: module 'hashlib' has no attribute 'sha-512'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.