landsat-pds / landsat_ingestor Goto Github PK
View Code? Open in Web Editor NEWScripts and other artifacts for landsat data ingestion into Amazon public hosting.
License: Apache License 2.0
Scripts and other artifacts for landsat data ingestion into Amazon public hosting.
License: Apache License 2.0
Currently each scene is processed individually. Part of the processing is requesting an authenticated download url from USGS. It's currently done:
usgs.api.download('LANDSAT_8', 'EE', [scene_root], 'STANDARD')
Rather than submitting 1 request per scene, we can group multiple scenes together in a single request.
usgs.api.download('LANDSAT_8', 'EE', [scene_root_1, scene_root_2, ..., scene_root_n], 'STANDARD')
/cc @warmerdam
When we're bringing in nighttime data, the results look a bit odd (e.g., http://landsat-pds.s3.amazonaws.com/L8/140/209/LC81402092016228LGN00/index.html). Is there a way we can make it obvious in the scene_list
which scenes are NIGHT scenes? Looking for cloudCover of -1 may be a fine proxy, but could be nice to make it more obvious what's going on. This data is already in individual metadata files, just not in top level scene_list
.
I am getting errors like this from time to time. Possibly it would help to do a retry or something?
+ l8_process_scene.py --verbose -s s3queue --clean --list-file job_20719258.csv LC80011152015035LGN00
.....Traceback (most recent call last):
File "/opt/planet/programs/landsat_ingestor/ingestor/l8_process_scene.py", line 110, in <module>
status = main(sys.argv[1:])
File "/opt/planet/programs/landsat_ingestor/ingestor/l8_process_scene.py", line 101, in main
overwrite = args.overwrite)
File "/opt/planet/programs/landsat_ingestor/ingestor/l8_process_scene.py", line 56, in process
verbose=verbose)
File "/opt/planet/programs/landsat_ingestor/ingestor/puller.py", line 18, in pull
return puller_s3queue.pull(scene_root, scene_dict, verbose=verbose)
File "/opt/planet/programs/landsat_ingestor/ingestor/puller_s3queue.py", line 34, in pull
for d in rv.iter_content(chunk_size=1024 * 1024 * 10):
File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 616, in generate
decode_content=True):
File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/response.py", line 236, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/response.py", line 183, in read
data = self._fp.read(amt)
File "/usr/lib/python2.7/httplib.py", line 561, in read
s = self.fp.read(amt)
File "/usr/lib/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
File "/usr/lib/python2.7/ssl.py", line 241, in recv
return self.read(buflen)
File "/usr/lib/python2.7/ssl.py", line 160, in read
return self._sslobj.read(len)
socket.error: [Errno 104] Connection reset by peer
If I download the scene list from https://landsat-pds.s3.amazonaws.com/c1/L8/scene_list.gz
it contains productId
as a column name.
If I download the scene list from s3://landsat-pds/scene_list.gz
it does NOT contain productId
as a column name.
Which scene list is considering best?
Right now a new API token is generated each time a download url is requested. The ingestor doesn't need to hit the auth endpoint each time:
landsat_ingestor/ingestor/puller_usgs.py
Lines 45 to 56 in 1d3876f
API tokens are valid for 1 hour after the initial request and reset each time a request is made. Let's be kind to USGS servers, and only request a single token.
/cc @warmerdam
I’ve just subscribed to the Landsat PDS AWS SNS feed, but it looks like I’m getting lots of old scenes through. In the last 12hrs I’ve received hundreds of events, but the most recent acquisition date is 2017-01-12.
Can anyone confirm whether this is expected behaviour? Are new scenes coming through the SNS feed at the moment?
Every two hours as we try to reprocess the tarq contents this corrupt scene is tried and fails. We need some better logic to migrate such corrupt files to the tarq_corrupt area like we do for the quick test in puller_s3queue.py. It is a little tricker for this case since it happens significantly later.
l8_process_scene.py --verbose -s s3queue --clean --overwrite --list-file job_33049805.csv LC81790442014335LGN00
Scene LC81790442014335LGN00 already exists on destination bucket.
Processing scene: LC81790442014335LGN00
Fetching: http://s3-us-west-2.amazonaws.com/landsat-pds/tarq/LC81790442014335LGN00.tar.gz
...
.....
....
.....
.....
....
.....
.....
.....
.....
.....
.....
....
....
......
.....
....
.....
.....
.LC81790442014335LGN00_B1.TIF
LC81790442014335LGN00_B2.TIF
LC81790442014335LGN00_B3.TIF
LC81790442014335LGN00_B4.TIF
LC81790442014335LGN00_B5.TIF
LC81790442014335LGN00_B6.TIF
LC81790442014335LGN00_B7.TIF
LC81790442014335LGN00_B8.TIF
LC81790442014335LGN00_B9.TIF
LC81790442014335LGN00_B10.TIF
LC81790442014335LGN00_B11.TIF
LC81790442014335LGN00_BQA.TIF
gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
LC81790442014335LGN00.tar.gz successfully downloaded (942628864 bytes)
tar xvf LC81790442014335LGN00.tar.gz --directory=LC81790442014335LGN00
Traceback (most recent call last):
File "/opt/planet/programs/landsat_ingestor/ingestor/l8_process_scene.py", line 110, in <module>
status = main(sys.argv[1:])
File "/opt/planet/programs/landsat_ingestor/ingestor/l8_process_scene.py", line 101, in main
overwrite = args.overwrite)
File "/opt/planet/programs/landsat_ingestor/ingestor/l8_process_scene.py", line 58, in process
local_dir = splitter.split(scene_root, local_tarfile, verbose=verbose)
File "/opt/planet/programs/landsat_ingestor/ingestor/splitter.py", line 63, in split
verbose=verbose)
File "/opt/planet/programs/landsat_ingestor/ingestor/splitter.py", line 13, in run_command
raise Exception('command "%s" failed with code %d.' % (cmd, result))
Exception: command "tar xvf LC81790442014335LGN00.tar.gz --directory=LC81790442014335LGN00 " failed with code 512.
Task ended with status 1
+ STATUS=1
Hi
On Landsat on AWS you say that "all Landsat-8 scenes from 2015 are available along with a selection of cloud-free scenes from 2013 and 2014". However, I have come across several scenes acquired after 2015 that I cant find at Landsat on AWS.
Here are two examples acquired 2016-07-08:
Am I doing something wrong/looking at the wrong place, or is it correct that the scenes are not there?
Regards,
Vebjørn
A proposed fix is adding:
key.content_disposition = 'attachment'
after setting of key.content_type = 'image/tiff' for TIFF files in pusher.py.
(originally proposed by Jed).
Circa March 1st (in 91e9f50) we set things up to tile and build overviews for ingested scenes; however, we still haven't gone back and reprocessed existing scenes.
There is now an ingestor/for_each_scene.py that can iterate over scenes, and a reprocess_scene.py that can fix up scenes. Work out a technique to run this for the existing outdated scenes but not newer scenes.
+cc @kapadia
I'm seeing lots of job failures like this:
gdaladdo -ro -r average --config COMPRESS_OVERVIEW DEFLATE --config PREDICTOR_OVERVIEW 2 --config GDAL_TIFF_OVR_BLOCKSIZE 512 LC82200762015113LGN00/LC82200762015113LGN00_B10.TIF 3 9 27 81
Traceback (most recent call last):
File "/opt/planet/programs/landsat_ingestor/ingestor/l8_process_scene.py", line 115, in
status = main(sys.argv[1:])
File "/opt/planet/programs/landsat_ingestor/ingestor/l8_process_scene.py", line 106, in main
overwrite = args.overwrite)
File "/opt/planet/programs/landsat_ingestor/ingestor/l8_process_scene.py", line 65, in process
scene_info.add_mtl_info(scene_dict, scene_root, local_dir)
File "/opt/planet/programs/landsat_ingestor/ingestor/scene_info.py", line 74, in add_mtl_info
mtl_dict['PRODUCT_METADATA']['SCENE_CENTER_TIME'])
TypeError: combine() argument 2 must be datetime.time, not str
Some scenes do not include all of the bands necessary to render a thumbnail, and their index pages look broken as a result.
Here are a few scenes with insufficient bands as an example:
https://s3-us-west-2.amazonaws.com/landsat-pds/L8/116/206/LT81162062015025LGN00/index.html
https://s3-us-west-2.amazonaws.com/landsat-pds/L8/116/202/LT81162022015025LGN00/index.html
https://s3-us-west-2.amazonaws.com/landsat-pds/L8/141/218/LT81412182015024LGN00/index.html
In instances where we do not have enough bands to produce a thumbnail, we should simply not insert a reference to a jpg. In the future, we may consider including messaging explaining that a limited set of bands is available for the scene.
An unbound var is lingering around when the tarball is corrupt.
+ l8_process_scene.py --verbose -s s3queue --clean --overwrite --list-file job_33861984.csv LC80750192015105LGN00
LC80750192015105LGN00_B1.TIF
LC80750192015105LGN00_B2.TIF
LC80750192015105LGN00_B3.TIF
LC80750192015105LGN00_B4.TIF
LC80750192015105LGN00_B5.TIF
LC80750192015105LGN00_B6.TIF
LC80750192015105LGN00_B7.TIF
gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
LC80750192015105LGN00.tar.gz successfully downloaded (402891096 bytes)
tar xvf LC80750192015105LGN00.tar.gz --directory=LC80750192015105LGN00
Traceback (most recent call last):
File "landsat_ingestor/ingestor/l8_process_scene.py", line 115, in <module>
status = main(sys.argv[1:])
File "landsat_ingestor/ingestor/l8_process_scene.py", line 106, in main
overwrite = args.overwrite)
File "landsat_ingestor/ingestor/l8_process_scene.py", line 65, in process
scene_info.add_mtl_info(scene_dict, scene_root, local_dir)
UnboundLocalError: local variable 'local_dir' referenced before assignment
Task ended with status 1
/cc @warmerdam
Can we reacquire and replace the 53,206 scenes listed at http://landsat-pds.s3.amazonaws.com/L8.reprocessed.2015.txt?
Have seen a couple of duplicates showing up in scene_list.gz. Doesn't seem to be tied to date. Maybe items are getting queued up twice?
$ grep LC80200312015200LGN00 scene_list
LC80200312015200LGN00,2015-07-19 16:16:07.837833,65.39,L1T,20,31,40.62882,-85.17706,42.79844,-82.23444,https://s3-us-west-2.amazonaws.com/landsat-pds/L8/020/031/LC80200312015200LGN00/index.html
LC80200312015200LGN00,2015-07-19 16:16:07.837833,65.39,L1T,20,31,40.62882,-85.17706,42.79844,-82.23444,https://s3-us-west-2.amazonaws.com/landsat-pds/L8/020/031/LC80200312015200LGN00/index.html
A new error is being propagated from USGS servers. It appears that changes have been made on their end, and are now throttling downloads.
If this is the case, we'll need to re-work certain areas of the landsat-ingestor to request no more than 10 download urls at a time. I'll investigate a little more to find out the extent of this new constraint.
usgs.USGSError: User currently has more than 10 downloads that have not been attempted in the past 10 minutes.
I get stuff like the following at least occationally. Perhaps there is something that needs retry logic?
+ l8_process_run.py -v -s auto --start-date=2015-01-15 --end-date=2015-02-06 --queue
logging in...
Traceback (most recent call last):
File "/opt/planet/programs/landsat_ingestor/ingestor/l8_process_run.py", line 178, in <module>
status = main(sys.argv[1:])
File "/opt/planet/programs/landsat_ingestor/ingestor/l8_process_run.py", line 155, in main
limit=args.limit)
File "/opt/planet/programs/landsat_ingestor/ingestor/l8_process_run.py", line 21, in query_for_scenes
os.environ['USGS_PASSWORD'])
File "/opt/planet/programs/landsat_ingestor/ingestor/usgs/api.py", line 154, in login
api_key = element.text
AttributeError: 'NoneType' object has no attribute 'text'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.