Code Monkey home page Code Monkey logo

ingest_deploy's Introduction

Harvesting infrastructure components

Consult the harvesting infrastructure diagram for an illustration of the key components. Ask Mark Redar for access to them; note that you will need to log onto the blackstar machine to run commands, using these Putty connection instructions (on Sharepoint)

As of February 2016, the process to publish a collection to production is as follows:

  1. Create collection, add harvest URL & mapping/enrichment chain
  2. Select "Queue harvest for collection on normal queue" on the registry page for the collection
  3. Check that there is a worker listening on the queue. If not start one. Stage Worker
  4. Wait until the harvest job finishes, hopefully without error. Now the collection has been harvested to the stage CouchDB.
  5. The first round of QA in CouchDB can be performed there CouchDB stage
  6. Push the new CouchDB docs into the stage Solr index. Select "Queue sync solr index for collection(s) on normal-stage" on the registry page for the colleciotn Updating Solr
  7. QA stage Solr index in the public interface Solr stage
  8. When ready to publish to production, edit Collection in the registry and check the "Ready for publication" box and save.
  9. Select the "Queue sync to production couchdb for collection" Syncing CouchDB
  10. Check that there is a worker in the production environment listening on the normal prod queue, if not start one. Production Worker
  11. Wait until the sync job finishes. Now the collection has been harvested to the production CouchDB.
  12. Sync the new docs to the production Solr by starting the sync from the registry for the new collections. At this point the Collection is in the new, candidate Calisphere Solr index
  13. Once QA is done on the candidate index and ready to push new one to Calisphere, push the index to S3
  14. Clone the existing Solr API Elastic Beanstalk and point to the packaged index on S3
  15. Swap the URL from the older Solr API Elastic Beanstalk and the new Elastic Beanstalk.

UCLDC Harvesting operations guide

User accounts

Preliminary steps: add collection to the Collection Registry and define harvesting endpoint

Conducting a harvest to stage

Moving a harvest to production

Updating Elastic Beanstalk with candidate Solr index

Removing items or collections (takedown requests)

Restoring collections from production

Additional resources

Fixes for Common Problems

[Addendum: Creating new AMI images - Developers only]

pull the ucldc/ingest_deploy project Get the ansible vault password from Mark. It's easiest if you create a file (perhaps /.vault-password-file) to store it in and alias ansible-playbook to ansible-playbook --vault-password-file=/.vault-password-file. Set mode to 600)

create an htdigest entry by running

htdigest -c tmp.pswd ingest <username>

Will prompt for password that is easy to generate with pwgen. copy the line in tmp.pswd

Then run:

ansible-vault --vault-password-file=~/.vault-password-file
  ingest_deploy/ansible/roles/ingest_front/vars/digest_auth_users.yml

Entries in this file are htdigest lines, preceded by a - to make a yaml list. eg:

---
digest_auth_users:
  - "u1:ingest:435srrr3db7b180366ce7e653493ca39"
  - "u1:ingest:rrrr756e5aacde0262130e79a888888c"
  - "u2:ingest:rrrr1cd0cd7rrr7a7839a5c1450bb8bc"

From a machine that can already access the ingest front machine with ssh run:

ansible-playbook -i hosts --vault-password-file=~/.vault_pass_ingest provision_front.yml

This will install the users.digest to allow access for the monitoring user.

add your public ssh to keys file in https://github.com/ucldc/appstrap/tree/master/cdl/ucldc-operator-keys.txt

From a machine that can already access the ingest front machine with ssh run:

ansible-playbook -i hosts --vault-password-file=~/.vault_pass_ingest provision_front.yml

This will add your public key to the ~/.ssh/authorized_keys for the ec2-user on the ingest front machine.

The first step in the harvesting process is to add the collection(s) for harvesting into the Collection Registry. This process is described further in Section 8 of our OAC/Calisphere Operations and Maintenance Procedures.

When establishing the entries, you'll need to determine the harvesting endpoint: Nuxeo, OAC, or an external source.

We use "transient" Redis Queue-managed (RQ) worker instances to process harvesting jobs in either a staging or production environment. They can be created as needed and then deleted after use. Once the workers have been created and provisioned, they will automatically look for jobs in the queue and run the full harvester code for those jobs.

  • Log onto blackstar and run sudo su - hrv-stg
  • To start some worker machines (bare ec2 spot instances), run: ansible-playbook ~/code/ansible/start_ami.yml --extra-vars="count=1" .
    • For on-demand instances, run: snsatnow ansible-playbook ~/code/ansible/start_ami_ondemand.yml --extra-vars="count=1"
    • For an extra large (and costly!) on-demand instance (e.g., m4.2xlarge, m4.4xlarge), run: ansible-playbook ~/code/ansible/start_ami_ondemand.yml --extra-vars="worker_instance_type=m4.2xlarge" . If you create an extra large instance, make sure you terminate it after the harvesting job is completed!

The count=## parameter will set the number of instances to create. For harvesting one small collection you can set this to count=1. To re-harvest all collections, you can set this to count=20. For anything in between, use your judgment.

The default instance creation will attempt to get instances from the "spot" market so that it is cheaper to run the workers. Sometimes the spot market price can get very high and the spot instances won't work. You can check the pricing by issuing the following command on blackstar, hrv-stg user:

aws ec2 describe-spot-price-history --instance-types m3.large --availability-zone us-west-2c --product-description "Linux/UNIX (Amazon VPC)" --max-items 2

Our spot bid price is set to .133 which is the current (20160803) on demand price. If the history of spot prices is greater than that or if you see large fluctuations in the pricing, you can request an on-demand instance instead by running the ondemand playbook : (NOTE: the backslash \ is required)

ansible-playbook ~/code/ansible/start_ami_ondemand.yml --extra-vars="count=3"

Sometimes the status of the worker instances is unclear.

To check the processing status for a given worker, log into Blackstar and SSH to the particular stage or prod machine.

cd to /var/local/rqworker and locate the worker.log file.
Run tail -f worker.log to view the logs.

You can also use the ec2.py dynamic ansible inventory script with jq to parse the json to find info about the state of the worker instances.

First, refresh the cache for the dynamic inventory:

~/code/ec2.py --refresh-cache

To see the current info for the workers:

get_worker_info.sh

This will report the running or not state, the IPs, ec2 IDs & the size of workers.

You can then see the state of the instance by using jq to filter on the IP:

~/code/ec2.py | jq '._meta.hostvars["<ip address for instance>"].ec2_state'

This will tell you if it is running or not.

To get more information about the instance, just do less filtering:

~/code/ec2.py | jq -C '._meta.hostvars["<ip address for instance>"]' | less -R

Once harvesting jobs are completed (see steps below), terminate the worker instances.

  • Log into blackstar and run sudo su - hrv-stg
  • To just stop instances, run `ansible-playbook
  • Run: ansible-playbook -i ~/code/ec2.py ~/code/ansible/terminate_workers.yml <--limit=10.60.?.?> . You can use the limit parameter to specify a range of IP addresses for deletion.
  • To force terminate an instance, append --tags=terminate-instances
  • You'll receive a prompt to confirm that you want to spin down the intance; hit Return to confirm.

We should now leave one instance in a "stopped" state. Terminate all but one of the instances then run:

ansible-playbook -i ~/code/ec2.py ~/code/ansible/stop_workers.yml

This will stop the instance so it can be brought up easily. get_worker_info.sh should report the instance as "stopping" or "stopped".

Before initiating a harvest, confirm if the collection has previously been harvested -- or if it's a new collection.

If the collection has previously been harvested and is viewable in the Calisphere stage UI (http://calisphere-data.cdlib.org/), then delete the collection from CouchDB stage and Solr stage:

  • Log into the Collection Registry and look up the collection
  • Run Queue deletion of documents from CouchDB stage.
  • Then run Queue deletion of documents from Solr stage.
  • You can track the progress through the RQ Dashboard; once the jobs are done, a results report will be posted to the #dsc_harvesting_report channel in Slack.

If you need more control of the process (i.e. to put on a different queue), you can use the following command syntaxes on the dsc-blackstar role account:

./bin/delete_couchdb_collection.py [email protected] high-stage https://registry.cdlib.org/api/v1/collection/26275 ./bin/queue_delete_solr_collection.py [email protected] high-stage 26275

This process will harvest metadata from the target system into a resulting CouchDB record.

  • From the Collection Registry, select Queue harvest to CouchDB stage
  • You should then get feedback message verifying that the collections have been queued
  • You can track the progress through the RQ Dashboard; once the jobs are done, a results report will be posted to the #dsc_harvesting_report channel in Slack.

If you need more control of the process (i.e. to put on a different queue), you can use the following command syntax on the dsc-blackstar role account:

queue_harvest.py [email protected] high-stage https://registry.cdlib.org/api/v1/collection/26943

This process will hit the URL referenced in isShownAt in the CouchDB record to derive a small preview image (used for the object landing page); that preview image is also used for thumbnails in search/browse and related item results.

  • From the Collection Registry, select Queue image harvest
  • You should then get feedback message verifying that the collections have been queued
  • You can track the progress through the RQ Dashboard; once the jobs are done, a results report will be posted to the #dsc_harvesting_report channel in Slack.

If you need more control of the process (i.e. to put on a different queue), you can run use the following command syntax on the dsc-blackstar role account:

queue_image_harvest.py [email protected] high-stage https://registry.cdlib.org/api/v1/collection/26943

Before initiating a harvest, confirm if the collection has previously been harvested -- or if it's a new collection.

If the collection has previously been harvested and is viewable in the Calisphere stage UI (http://calisphere-data.cdlib.org/), then delete the collection from CouchDB stage and Solr stage:

  • Log into the Collection Registry and look up the collection
  • Run Queue deletion of documents from CouchDB stage.
  • Then run Queue deletion of documents from Solr stage.
  • You can track the progress through the RQ Dashboard; once the jobs are done, a results report will be posted to the #dsc_harvesting_report channel in Slack.

If you need more control of the process (i.e. to put on a different queue), you can use the following command syntaxes on the dsc-blackstar role account:

./bin/delete_couchdb_collection.py [email protected] high-stage https://registry.cdlib.org/api/v1/collection/26275 ./bin/queue_delete_solr_collection.py [email protected] high-stage 26275

The process pulls files from the "Main Content File" section in Nuxeo, and formats them into access files for display in Calisphere. If you only need to pick up metadata changes in Nuxeo, skip this step. Here's what the process does:

  1. It stashes a high quality copy of any associated media or text files on S3. These files appear on the object landing page, for interactive viewing:
  • If image, creates a zoomable jp2000 version and stash it on S3 for use with our IIIF-compatible Loris server. Tools used to convert the image include ImageMagick and Kakadu
  • If audio, stashes mp3 on s3.
  • If file (i.e. PDF), stashes on s3
  • If video, stashes mp4 on s3
  1. Creates a small preview image (used for the object landing page) and complex object component thumbnails and stashes on S3. For these particular formats, it does the following:
  • If video, creates a thumbnail and stash on S3. Thumbnail is created by capturing the middle frame of the video using the ffmpeg tool.
  • If PDF, creates a thumbnail and stash on S3. Thumbnail is created by creating an image of the first page of the PDF, using ImageMagick.
  1. Compiles full metadata and structural information (such as component order) for all complex objects, in the form of a media.json file. To view the media.json for a given object, use this URL syntax (where is the Nuxeo unique identifier, e.g., 70d7f57a-db0b-4a1a-b089-cce1cc289c9e): https://s3.amazonaws.com/static.ucldc.cdlib.org/media_json/<UID>-media.json

To run the "deep harvest" process:

  • Log into the Collection Registry and look up the collection
  • Run Queue Nuxeo deep harvest drop-down.
  • You can track the progress through the RQ Dashboard; once the jobs are done, a results report will be posted to the #dsc_harvesting_report channel in Slack.

If you need more control of the process (i.e. to put on a different queue), you can run use the following command syntax on the dsc-blackstar role account:

queue_deep_harvest.py [email protected] high-stage 26959

If there are problems with individual items, you can do a deep harvest for just one object by its Nuxeo path. You need to log onto dsc-blackstar and sudo to the hrv-stg role account. Then:

queue_deep_harvest_single_object.py "<path to assest wrapped with quotes>"

e.g.

queue_deep_harvest_single_object.py "/asset-library/UCR/Manuscript Collections/Godoi/box_01/curivsc_003_001_005.pdf"

This will run 4 jobs, one for grabbing files, one for creating jp2000 for access & IIIF, one to create thumbs and finally a job to produce the media_json file.

This process will harvest metadata from Nuxeo into a resulting CouchDB record.

  • From the Collection Registry, select Queue harvest to CouchDB stage
  • You should then get feedback message verifying that the collections have been queued
  • You can track the progress through the RQ Dashboard; once the jobs are done, a results report will be posted to the #dsc_harvesting_report channel in Slack.

If you need more control of the process (i.e. to put on a different queue), you can use the following command syntax on the dsc-blackstar role account:

queue_harvest.py [email protected] high-stage https://registry.cdlib.org/api/v1/collection/26943

This process will hit the URL referenced in isShownBy in the CouchDB record to derive a small preview image (used for the object landing page); that preview image is also used for thumbnails in search/browse and related item results.

  • From the Collection Registry, select Queue image harvest
  • You should then get feedback message verifying that the collections have been queued
  • You can track the progress through the RQ Dashboard; once the jobs are done, a results report will be posted to the #dsc_harvesting_report channel in Slack.

If you need more control of the process (i.e. to put on a different queue), you can run use the following command syntax on the dsc-blackstar role account:

queue_image_harvest.py [email protected] high-stage https://registry.cdlib.org/api/v1/collection/26943

If there are problems with individual items, you can run the process on a specific object (or multiple objects) by referencing the harvest ID. You need to log onto dsc-blackstar and sudo to the hrv-stg role account. Then:

python ~/bin/queue_image_harvest_for_doc_ids.py [email protected] normal-stage 23065--http://ark.cdlib.org/ark:/13030/k600073n

For multiple items, separate the harvest IDs with commas:

python ~/bin/queue_image_harvest_for_doc_ids.py [email protected] normal-stage 23065--http://ark.cdlib.org/ark:/13030/k600073n,23065--http://ark.cdlib.org/ark:/13030/k6057mxb

  • Query CouchDB stage using this URL syntax. Replace the key parameter with the key for the collection: https://harvest-stg.cdlib.org/couchdb/ucldc/_design/all_provider_docs/_view/by_provider_name_count?key="26189"
  • Results in the "value" parameter indicate the total number of metadata records harvested; this should align with the expected results.
  • If you have results, continue with QA checking the collection in CouchDB stage and Solr stage.
  • If there are no results, you will need to troubleshoot and re-harvest. See What to do when harvests fail section for details.

The objective of this part of the QA process is to ensure that source metadata (from a harvesting target) is correctly mapped through to CouchDB Suggested method is to review the 1) source metadata (e.g., original MARC21 record, original XTF-indexed metadata*) vis-a-vis the 2) a random sample of CouchDB results and 3) metadata crosswalk. Things to check:

  • Verify if metadata from the source record was carried over into CouchDB correctly: did any metadata get dropped?
  • Verify the metadata mappings: was the mapping handled correctly, going from the source metadata through to CouchDB, as defined in the metadata crosswalk?
  • Verify if any needed metadata remediation was completed (as defined in the metadata crosswalk) -- e.g., were rights statuses and statements globally applied?
  • Verify DPLA/CDL required data values -- are they present? If not, we may need to go back to the data provider to supply the information -- or potentially supply it for them (through the Collection Registry)
  • Verify the data values used within the various metadata elements:
  • Do the data values look "correct" (e.g., for Type, data values are drawn from the DCMI Type Vocabulary)?
  • Any funky characters or problems with formatting of the data?
  • Any data coming through that looks like it may have underlying copyright issues (e.g., full-text transcriptions)?
  • Are there any errors or noticeable problems?

NOTE: To view the original XTF-indexed metadata for content harvested from Calisphere:

  • Go to Collection Registry, locate the collection that was harvested from XTF, and skip to the "URL harvest" field -- use that URL to generate a result of the XTF-indexed metadata (view source code to see raw XML)
  • Append the following to the URL, to set the number of results: docsPerPage=###

Required Data QA Views

The Solr update process checks for a number of fields and will reject records that are missing these required values.

Image records without a harvested image

Objects with a sourceResource.type value of 'image' without a stored image (no 'object' field in the record) are not put into the Solr index. This view identifies these objects in couchdb.

https://harvest-stg.cdlib.org/couchdb/ucldc/_design/all_provider_docs/_view/image_type_missing_object

The base view will report total count of image type records without harvested images. To see how many per collection add "?group=true" to the URL.

https://harvest-stg.cdlib.org/couchdb/ucldc/_design/all_provider_docs/_view/image_type_missing_object?group=true

To find the number for a given collection use the "key" parameter:

https://harvest-stg.cdlib.org/couchdb/ucldc/_design/all_provider_docs/_view/image_type_missing_object?key="<collection id>"

NOTE: the double quotes are necessary in the URL.

To see the ids for the records with this issue turn off the reduce fn:

https://harvest-stg.cdlib.org/couchdb/ucldc/_design/all_provider_docs/_view/image_type_missing_object?key="<collection id>"&reduce=false

Use the include_docs parameter to add the records to the view output:

https://harvest-stg.cdlib.org/couchdb/ucldc/_design/all_provider_docs/_view/image_type_missing_object?key="<collection id>"&reduce=false&include_docs=true
Records missing isShownAt
https://harvest-stg.cdlib.org/couchdb/ucldc/_design/all_provider_docs/_view/missing_isShownAt

As with the above you can add various parameters to get different information in the result.

Records missing title
https://harvest-stg.cdlib.org/couchdb/ucldc/_design/all_provider_docs/_view/missing_title

Querying CouchDB stage

  • Generate a count of all objects for a given collection in CouchDB: https://harvest-stg.cdlib.org/couchdb/ucldc/_design/all_provider_docs/_view/by_provider_name_count?key="26189"
  • Generate a results set of metadata records for a given collection in CouchDB, using this URL syntax: https://harvest-stg.cdlib.org/couchdb/ucldc/_design/all_provider_docs/_list/has_field_value/by_provider_name_wdoc?key="10046"&field=originalRecord.subject&limit=100. Each metadata record in the results set will have a unique ID (e.g., 26094--00000001). This can be used for viewing the metadata within the CouchDB UI.
  • Parameters:
  • field: Optional. Limit the display output to a particular field.
  • key: Optional. Limits by collection, using the Collection Registry numeric ID.
  • limit: Optional. Sets the number or results
  • originalRecord: Optional. Limit the display output to a particular metadata field; specify the CouchDB data element (e.g., title, creator)
  • include_docs="true": Optional. Will include complete metadata record within the results set (JSON output)
  • value: Optional. Search for a particular value, within a results set of metadata records from a particular collection. Note: exact matches only!
  • group=true: Group the results by key
  • reduce=false: do not count up the results, display the individual result rows
  • To generate a results set of data values within a particular element (e.g., Rights), for metadata records from all collections: https://harvest-stg.cdlib.org/couchdb/ucldc/_design/qa_reports/_view/sourceResource.rights_value?limit=100&group_level=2
  • To check if there are null data values within a particular element (e.g., isShownAt), for metadata records from all collections: https://harvest-stg.cdlib.org/couchdb/ucldc/_design/qa_reports/_view/isShownAt_value?limit=100&group_level=2&start_key=["__MISSING__"]
  • To view a result of raw CouchDB JSON output: https://harvest-stg.cdlib.org/couchdb/ucldc/_design/all_provider_docs/_view/by_provider_name?key="26094"&limit=1&include_docs=true
  • Consult the CouchDB guide for additional query details.

Viewing metadata for an object in CouchDB stage

  • Log into CouchDB
  • In the "Jump to" box, enter the unique ID for a given metadata record (e.g., 26094--00000001)
  • You can now view the metadata in either its source format or mapped to CouchDB fields

This process will update the Solr stage index with records from CouchDB stage:

  • From the Collection Registry, select Queue sync from CouchDB stage to Solr stage
  • You should then get feedback message verifying that the collections have been queued
  • You can track the progress through the RQ Dashboard; once the jobs are done, a results report will be posted to the #dsc_harvesting_report channel in Slack.

If you need more control of the process (i.e. to put on a different queue), you can run the queue_sync_to_solr.py on dsc-blackstar role account:

queue_sync_to_solr.py [email protected] high-stage 26943

You can view the raw results in Solr stage; this may be helpful to verify mapping issues or discrepancies in data between CouchDB and Solr stage.

  • Log into Solr to conduct queries
  • Generate a count of all objects for a given collection in Solr: https://harvest-stg.cdlib.org/solr/dc-collection/query?q=collection_url:%22https://registry.cdlib.org/api/v1/collection/26559/%22
  • Generates counts for all collections: https://harvest-stg.cdlib.org/solr/dc-collection/select?q=*%3A*&rows=0&wt=json&indent=true&facet=true&facet.query=true&facet.field=collection_url&facet.limit=-1&facet.sort=count
  • Consult the Solr guide for additional query details.

You can preview the Solr stage index in the Calisphere UI at http://calisphere-data.cdlib.org/.

To immediately view results, you can QA the Solr stage index on your local workstation, following these steps ("Windows install"). In the run.bat configuration file, point UCLDC_SOLR_URL to https://harvest-stg.cdlib.org/solr_api.

Follow the steps outlined above for starting and managing worker instances -- but once logged into blackstar, use sudo su - hrv-prd to create workers in the production environment.

Once the CouchDB and Solr stage data looks good and the collection looks ready to publish to Calisphere, start by syncing CouchDB stage to the CouchDB production:

  • In the Registry, edit the collection and check the box "Ready for publication" and save the collection.
  • Then select Queue Sync to production CouchDB for collection from the action on the Collection page.

If you need more control of the process (i.e. to put on a different queue), you can run the queue_sync_couchdb_collection.py on dsc-blackstar role account:

./bin/queue_sync_couchdb_collection.py [email protected] high-stage https://registry.cdlib.org/api/v1/collection/26681/

This process will update the Solr production index ("candidate Solr index") with records from CouchDB production:

  • From the Collection Registry, select Queue sync from CouchDB production to Solr production
  • You should then get feedback message verifying that the collections have been queued
  • You can track the progress through the RQ Dashboard; once the jobs are done, a results report will be posted to the #dsc_harvesting_report channel in Slack.

If you need more control of the process (i.e. to put on a different queue), you can run the queue_sync_to_solr.py on dsc-blackstar role account:

queue_sync_to_solr.py [email protected] high-stage 26943

You can preview the candidate Solr index in the Calisphere UI at http://calisphere-test.cdlib.org/.

To immediately view results, you can QA the Solr stage index on your local workstation, following these steps ("Windows install"). In the run.bat configuration file, point UCLDC_SOLR_URL to https://harvest-prd.cdlib.org/solr_api.

Generate and review a QA report for the candidate Solr index, following these steps. The main QA report in particular summarizes differences in item counts in the candidate Solr index compared with the current production index.

This section describes how to update an Elastic Beanstalk configuration to point to a new candidate Solr index stored on S3. This will update the specified Calisphere front-end web application so that it points to the data from Solr:

TODO: add how to run the QA spreadsheet generating code

Removing collections involves deleting records from CouchDB stage and production environments, as well as Solr stage and production environments; and then updating the Elastic Beanstalk:

  • Log into CouchDB stage; search for and delete the specific item record. Repeat the process on CouchDB production -or-
  • Create a list of the CouchDB identifiers for the items, and add them to a file (one per line). Then run delete_couchdb_id_list.py with the file as input:delete_couchdb_id_list.py <file with list of ids>
  • From the Collection Registry, select Queue sync from from CouchDB stage to Solr stage and Queue sync from CouchDB production to Solr production
  • Update Elastic Beanstalk with the updated Solr index
  • From the Collection Registry, select Queue deletion of documents from CouchDB stage, Queue deletion of documents from Solr stage, Queue deletion of documents from CouchDB production, and Queue deletion of documents from Solr production
  • Update the Collection Registry entry, setting "Ready to publish" to "None" -- and change the harvesting endpoint to "None"
  • Update Elastic Beanstalk with the updated Solr index

If you need more control of the process (i.e. to put on a different queue), you can use the following command syntaxes on the dsc-blackstar role account:

./bin/delete_couchdb_collection.py [email protected] high-stage https://registry.cdlib.org/api/v1/collection/26275 ./bin/queue_delete_solr_collection.py [email protected] high-stage 26275

We've had a couple of cases where the pre-prodution index has had a collection deleted for re-harvesting but the re-harvest has not been successful and we want to publish a new image. This script will take the documents from one solr index and push them to another solr index. This script can be run from the hrv-stg or hrv-prd account. For each, the source documents come from solr.calisphere.org which drives Calisphere. Depending on which role account you are in, it will either update the "stage" or the pre-production solr.

  • Log onto the appropriate role account (hrv-stg or hrv-prd). That will set the context for the originating solr index, from which you want to push data.
  • run sync_solr_documents.py <collection id> to push the data to the target solr index.

The snsatnow wrapper script may be used to run any long running process. It will background and detach the process so you can log out. When the process finishes or fails, a message will be sent to the dsc_harvesting_repot Slack channel.

To use the script, just add it to your script invocation

snsatnow <cmd> --<options> <arg1> <arg2>....

NOTE: if your command has arguments that are surrounded by quotes (") you'll need to escape those by putting a backslash () in front of them.

When new harvester or ingest code is pushed, you need to create a new generation of worker machines to pick up the new code:

  • First, terminate the existing machines: ansible-playbook -i ~/code/ec2.py ~/code/ingest_deploy/ansible/terminate_workers.yml <--limit=10.60.?.?>
  • Then go through the worker create process again, creating and provisioning machines as needed.

The solr index is run in a docker container. To make changes to the schema or other configurations, you need to recreate the docker image for the container.

NOTE: THIS NEEDS UPDATING To do so in the ingest environment, run ansible-playbook -i hosts solr_docker_rebuild.yml. This will remove the docker container & image, rebuild the image, remove the index files and run a new container based on the latest solr config in https://github.com/ucldc/solr_api/.

You will then have to run /usr/local/solr-update.sh --since=0 to reindex the whole couchdb database.

Tracing back to the document source in CouchDB is critical to diagnose problems with data and images.

Get the Solr id for the item. This is the part of the URL after the /item/ without the final slash. For https://calisphere.org/item/32e2220c1e918cf17f0597d181fa7e3e/, the Solr ID is 32e2220c1e918cf17f0597d181fa7e3e.

Now go to the Solr index of interest and query for the id: https://harvest-stg.cdlib.org/solr/dc-collection/select?q=32e2220c1e918cf17f0597d181fa7e3e&wt=json&indent=true

Find the harvest_id_s value, in this case "26094--LAPL00050887". Then plug this into CouchDB for the ucldc database: https://harvest-stg.cdlib.org/couchdb/ucldc/26094--LAPL00050887 (or with the UI - https://harvest-stg.cdlib.org/couchdb/_utils/document.html?ucldc/26094--LAPL00050887)

Sometimes you may need to create one or more "High Stage" workers, for example if the normal stage worker queue is very full and you need to run a harvest job without waiting for the queue to empty. The process is performed from the hrv-stg command line as follows.

Creating high stage workers:

  • Log onto blackstar and run sudo su - hrv-stg
  • Create one or more worker machines just as you would in the "developer" (see below) process: snsatnow ansible-playbook ~/code/ansible/create_worker.yml --extra-vars=\"count=1\" .
  • After workers are created, run get_worker_info.sh and compare results to currently provisioned/running "normal" workers RQ dashboard to determine the IP addresses of new workers.
  • Provision with --extra-vars="rq_work_queues=['high-stage']" switch to make new workers high stage workers. Also use --limit switch with IP addresses of new workers from step above to only provision new workers. Do NOT re-provision running workers! Full example command: snsatnow ansible-playbook -i ~/code/ec2.py ~/code/ansible/provision_worker.yml --limit=10.60.29.* --extra-vars="rq_work_queues=['high-stage']"

Running jobs on high stage workers:

  • From hrv-stg command line, run the following command to queue a high-stage harvest, providing your EMAIL address and collection # to harvest for XXXXX where appropriate: ./bin/queue_harvest.py [email protected] high-stage https://registry.cdlib.org/api/v1/collection/XXXXX/
  • To queue an image harvest or solr sync, replace the first part of the command above with ./bin/queue_image_harvest.py or ./bin/queue_sync_to_solr.py, respectively
  • More commands can be found in the bin folder by running ls ./bin from command line. Most are self-explanatory from the script titles. Again, just replace the first part of the full command above with ./bin/other-script-here.py as needed
  • When finished harvesting, terminate the high-stage workers as you would any other. EX: ansible-playbook -i ~/code/ec2.py ~/code/ansible/terminate_workers.yml <--limit=10.60.?.?>

First take a look at the RQ Dashboard. There will be a bit of the error message there. Hopefully this would identify the error and you can modify whatever is going wrong.

Common worker error messages

  • Worker forcibly terminated, while job was in-progress: ShutDownImminentException('shut down imminent (signal: %s)' % signal_name(signum), info) ShutDownImminentException: shut down imminent (signal: SIGALRM)
  • (More forthcoming...)

Checking the logs

If you need more extensive access to logs, they are all stored on the AWS CloudWatch platform. The /var/local/rqworker & /var/local/akara contain the logs from the worker processes & the Akara server on a worker instance. The logs are named with the instance id & ip address, e.g. ingest-stage-i-127546c9-10.60.28.224

From the blackstar machine you can access the logs on CloudWatch using the scripts in the bin directory

First, get the IPs of the worker machines by running get_worker_info.sh

Then for the worker whose logs you want to examine: get_log_events_for_rqworker.sh <worker ip>

This is an output of the rqworker log, for the akara log use: get_log_events_for_akara.sh <worker ip>

If you need to go back further in the log history, for now ask Mark.

If this doesn't get you enough information, you can ssh to a worker instance and watch the logs real time if you like. tail -f /var/local/rqworker/worker.log or /var/local/akara/logs/error.log.

Verify if and what files were harvested, for a given object

Use the following script in the ucldc_api_data_quality/reporting directory (following the steps at https://github.com/mredar/ucldc_api_data_quality/tree/master/reporting) to generate a report for the object. The value is the id for the object, as reflected in Solr or CouchDB (e.g., 6d445613-63d3-4144-a530-718900676db9):

python get_couchdata_for_calisphere_id.py <ID>

Example report result:

===========================================================================
Calisphere/Solr ID: 6d445613-63d3-4144-a530-718900676db9
CouchDB ID: 26883--6d445613-63d3-4144-a530-718900676db9
isShownAt: https://calisphere.org/item/6d445613-63d3-4144-a530-718900676db9
isShownBy: https://nuxeo.cdlib.org/Nuxeo/nxpicsfile/default/6d445613-63d3-4144-a530-718900676db9/Medium:content/
object: ce843950f622d303b83256add5b19d34
preview: https://calisphere.org/clip/500x500/ce843950f622d303b83256add5b19d34
===========================================================================

The URL in isShownBy reflects the endpoint to an file, which is used by the harvesting code ("Queue image harvest to CouchDB stage" action) to derive a small preview image (used for the object landing page); that preview image is also used for thumbnails in search/browse and related item results. Note that you can also verify isShownBy by looking up the object in CouchDB.

The URL in preview points to the resulting preview image.

No preview image, or thumbnail in search/browse results? (Nuxeo and non-Nuxeo sources)

Double-check the URL in the preview field. If there's no functional URL in preview (value indicates "None"), then a file was not successfully harvested. To fix:

For Nuxeo-based objects, the following logic is baked into the process for harvesting preview and thumbnail images:

  1. If object has an image at the parent level, use that. Otherwise, if component(s) have images, use the first one we can find
  2. If an object has a PDF or video at parent level, use the image stashed on S3
  3. Otherwise, return "None"

No access files, preview image (for PDF or video objects), or complex object component thumbnails? (Nuxeo only)

The media.json output created through the "deep harvest" process references URL links back to the source files in Nuxeo. If there's no media.json file -- or if the media.json has broken or missing URLs -- then the files could not be successfully harvested. To fix:

  • Try re-running the deep harvest for a single object to regenerate the media.json and files.
  • Check the media.json again, to confirm that it was generated and/or its URLs resolve to files. If AOK, sync from CouchDB stage to Solr stage

Persistent older versions of access files, preview image (for PDF or video objects), or complex object component thumbnails? (Nuxeo only)

If older versions of the files don't clear out after re-running a deep harvest, you can manually queue the image harvest to force it to re-fetch images from Nuxeo. First, you need to clear the "CouchDB ID -> image url" cache and then set the image harvest to run with the flag --get_if_object (so get the image even if the "object" field exists in the CouchDB document)

  • Log onto blackstar & sudo su - hrv-stg
  • Run python ~/bin/redis_delete_harvested_images_script.py <collection_id>. This will produce a file called delete_image_cache-<collection_id> in the current directory.
  • Run redis.sh < delete_image_cache-<collection_id>. This will clear the cache of previously harvested URLs.
  • Run python ~/bin/queue_image_harvest.py [email protected] normal-stage https://registry.cdlib.org/api/v1/collection/<collection_id>/ --get_if_object

Development

ingest_deploy

Ansible, packer and vagrant project for building and running ingest environment on AWS and locally. Currently only the ansible is working, need to get a local vagrant version working....

Dependencies

Tools

Addendum: Building new worker images - For Developers

  • Log onto blackstar and run sudo su - hrv-stg
  • To start some worker machines (bare ec2 spot instances), run: snsatnow ansible-playbook ~/code/ansible/create_worker.yml --extra-vars=\"count=1\" .
    • For on-demand instances, run: snsatnow ansible-playbook ~/code/ansible/create_worker_ondemand.yml --extra-vars=\"count=1\"
    • For an extra large (and costly!) on-demand instance (e.g., m4.2xlarge, m4.4xlarge), run: ansible-playbook ~/code/ansible/create_worker_ondemand.yml --extra-vars="worker_instance_type=m4.2xlarge" . If you create an extra large instance, make sure you terminate it after the harvesting job is completed!

The count=## parameter will set the number of instances to create. For harvesting one small collection you can set this to count=1. To re-harvest all collections, you can set this to count=20. For anything in between, use your judgment.

With the snsatnow wrapper, the results will be messaged to the dsc_harvesting_report Slack channel when the instances are created.

The default instance creation will attempt to get instances from the "spot" market so that it is cheaper to run the workers. Sometimes the spot market price can get very high and the spot instances won't work. You can check the pricing by issuing the following command on blackstar, hrv-stg user:

aws ec2 describe-spot-price-history --instance-types m3.large --availability-zone us-west-2c --product-description "Linux/UNIX (Amazon VPC)" --max-items 2

Our spot bid price is set to .133 which is the current (20160803) on demand price. If the history of spot prices is greater than that or if you see large fluctuations in the pricing, you can request an on-demand instance instead by running the ondemand playbook : (NOTE: the backslash \ is required)

snsatnow ansible-playbook ~/code/ansible/create_worker_ondemand.yml --extra-vars=\"count=3\"

If you restarted a stopped instance, you don't need to do the steps below

Once this is done and the stage worker instances are in a state of "running", you'll need to provision the workers by installing required software, configurations and start running Akara and the worker processes that listen on the queues specified:

  • Log onto blackstar and run sudo su - hrv-stg
  • To provision the workers, run: snsatnow ansible-playbook -i ~/code/ec2.py ~/code/ansible/provision_worker.yml
  • Wait for the provisioning to finish; this can take a while, 5-10 minutes is not unusual. If the provisioning process stalls, use ctrl-C to end the process then re-do the ansible command.
  • Check the status of the the harvesting process through the RQ Dashboard. You should now see the provisioned workers listed, and acting on the jobs in the queue. You will be able to see the workers running jobs (indicated by a "play" triangle icon) and then finishing (indicated by a "pause" icon).

Limiting provisioning by IP

If you already have provisioned worker machines running jobs, use the --limit=<ip range> eg. --limit=10.60.22.* or --limit=<ip>,<ip> eg. --limit=10.60.29.109,10.60.18.34 to limit the provisioning to the IPs of the newly-provisioned machines (and so you don't reprovision a currently running machine). Otherwise rerunning the provisioning will put the current running workers in a bad state, and you will then have to log on to the worker and restart the worker process or terminate the machine. Example of full command: snsatnow ansible-playbook -i ~/code/ec2.py ~/code/ansible/provision_worker.yml --limit=10.60.29.*

AWS assigns unique subnets to the groups of workers you start, so in general, different generations of machines will be distinguished by the different C class subnet. This makes the --limit parameter quite useful.

Provisioning workers to specific queues

By default, stage workers will be provisioned to a "normal-stage" queue. To provision them to a different queue -- e.g., "high-stage", use the following command with the --extra-vars parameter:

ansible-playbook -i ~/code/ec2.py ~/code/ansible/provision_worker.yml --limit=10.60.22.123 --extra-vars="rq_work_queues=['high-stage']"

Creating new worker AMI

Once you have a new worker up and running with the new code, you need to create an image from it. From the appropriate environment:

ansible-playbook -i hosts ~/code/ansible/create_worker_ami.yml --extra-vars="instance_id=<running worker instance id>"

You can get the instance_id by running get_worker_info.sh.

This will produce a new image named _worker_YYYYMMDD. Note the image id that is returned by this command.

You now need to update the image id for the environment. Edit the file ~/code/ansible/group_vars/ (either stage or prod). Change the worker_ami value to the new image id e.g:

worker_ami: ami-XXXXXX

License

Copyright © 2015, Regents of the University of California All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
  • Neither the name of the University of California nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

ingest_deploy's People

Contributors

aturner avatar barbarahui avatar matthewjmckinley avatar mmmmatthew avatar mredar avatar

Watchers

 avatar

Forkers

aturner

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.