Code Monkey home page Code Monkey logo

birdhouse-deploy's People

Contributors

aulemahal avatar chaamc avatar cjauvin avatar crim-jenkins-bot avatar cwcummings avatar dbyrns avatar dchandan avatar dependabot[bot] avatar fmigneault avatar francisplt avatar huard avatar ldperron avatar matprov avatar mishaschwartz avatar nazim-crim avatar nikola-rados avatar tlvu avatar zeitsperre avatar zvax avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

birdhouse-deploy's Issues

Canarie Solr monitoring not working on staging system

Migration from old PAVICS https://github.com/Ouranosinc/PAVICS/issues/207

This is because the staging system do not have the same data as the production system.

Could we change this query to some data that will minimally exist on all PAVICS deployment of all organizations, like Ouranosinc/pavics-sdi#147 https://github.com/Ouranosinc/PAVICS/blob/dd0eefb584724e720ccb3efa648f6b42d43d6f70/birdhouse/config/canarie-api/docker_configuration.py.template#L81-L88

This error is the only one that fail canarie monitoring on staging server:
2020-01-09-132100_921x778_scrot

If Canarie monitoring pass on staging servers, then we can have this https://github.com/Ouranosinc/PAVICS/issues/186.

Big win because we get notification when new deployment is broken on staging system so we can react faster to not break the production system.

[Feature] Add a bird that allows for loose coupling between components while keeping them in synch

This use case became obvious with the fact that a lot of components needs to be aware and configured when a new user is created.
Magpie handles those users, but we don't want it to be tighly-coupled with other components.
Magpie has also the ability to notify whoever registers a hook, but we don't want other components to be tighly-coupled to the Magpie hook neither.
Here comes Cowbird (a bird that lays in the nests of other birds), who will be able to register a hook in Magpie and do whatever is necessary for all the components in needs when it is notify.
With this paradigm, we can do a lot of things :

  • When a user is created, workspace will be created in geoserver, thredds and jupyterhub
  • When a user is deleted a cleanup will be performed
  • When new data is added (via the ingestion service, a process output or a user notebook), a monitoring sub-system will be able to update components in needs.
  • For ressources being served by more than one services, a permissions synchronizer will be able to keep them aligned across services.

Here we got a preliminary design :
Cowbird Components 1 2 0

Free disk space problem due to docker images logging

Migrated from old PAVICS https://github.com/Ouranosinc/PAVICS/issues/152, discussion summarized below:

Containers that generate a lot of logging, for example Twitcher/Magpie on every request, cause disk to fill up rapidly. Following config (or similar) should be added to containers in compose.

    logging:
      driver: "json-file"
      options:
        max-size: "100m"

stdout/stderr logging in docker is by default placed within each container folder:
du -h /var/lib/docker/containers//.log | sort -h

We do have a few containers logging like crazy here, not clear if those logs are valuable. Maybe a variable max-size depending on the container expectation/logging needs, ping @moulab88

# This is on Boreas
# du -sh */*.log
96.0K   01f8c3a7c20d234b023d73d798e56c20fa4a54cf9fffa12ff9f71b9cf9c74e18/01f8c3a7c20d234b023d73d798e56c20fa4a54cf9fffa12ff9f71b9cf9c74e18-json.log
16.0K   06b14f1163225843605288e5a8f203612f4dd3ae3761ff27713d6122333a030c/06b14f1163225843605288e5a8f203612f4dd3ae3761ff27713d6122333a030c-json.log
8.0K    0e1cc179a4114d7d10fde79c77cde728e6e37a6006e70b1f21120cbb673cf2c5/0e1cc179a4114d7d10fde79c77cde728e6e37a6006e70b1f21120cbb673cf2c5-json.log
* 155.0M  194af2ac8810d44c463c8da5b5942e48e3bf5485ccbeb47b21c03b8468780057/194af2ac8810d44c463c8da5b5942e48e3bf5485ccbeb47b21c03b8468780057-json.log
2.0M    1d2d23dc5d0dfb1049efdc9efa6e45386e2eea7e12227717e8ddca0935634f22/1d2d23dc5d0dfb1049efdc9efa6e45386e2eea7e12227717e8ddca0935634f22-json.log
32.0K   31814345a0d379063062b2015695ad5f6fb623c5f2c48619a85f8f0823398642/31814345a0d379063062b2015695ad5f6fb623c5f2c48619a85f8f0823398642-json.log
344.0K  353b6f48121b9d6ce6b42a1ed588455e91002225a9ad0e5626628db0f5e17865/353b6f48121b9d6ce6b42a1ed588455e91002225a9ad0e5626628db0f5e17865-json.log
436.0K  3d1a8e0c535ff110399b872d5cd759546c6f2c9356ee0f068841931ce785605f/3d1a8e0c535ff110399b872d5cd759546c6f2c9356ee0f068841931ce785605f-json.log
*** 19.4G   3f2f4c0457ccfd600fb48d245e323f58894ad791427af8387d95400766161747/3f2f4c0457ccfd600fb48d245e323f58894ad791427af8387d95400766161747-json.log
56.0K   3f92224708a09870c0dfa079ebae00461128880ed28ed23b3de9964d4c0a1b6f/3f92224708a09870c0dfa079ebae00461128880ed28ed23b3de9964d4c0a1b6f-json.log
40.0K   49312c398b724442922a8233458eb43f4832e66c1703193d20fe81ce6bd3551c/49312c398b724442922a8233458eb43f4832e66c1703193d20fe81ce6bd3551c-json.log
80.0K   54461d996891c145b912f5ee308ce625cc5367f59b319b42ea4fee603e58762d/54461d996891c145b912f5ee308ce625cc5367f59b319b42ea4fee603e58762d-json.log
4.0K    655a6a7b5978675c5d6c5382b9ee9e42d3e78432916545218f33f51ea7f91820/655a6a7b5978675c5d6c5382b9ee9e42d3e78432916545218f33f51ea7f91820-json.log
44.0K   65e0ac0df5c4da26eaf856b1677ef176ad31d1797697260c633bd8f0e79c1868/65e0ac0df5c4da26eaf856b1677ef176ad31d1797697260c633bd8f0e79c1868-json.log
18.7M   8084e627be8cc40cd30a5e03489c4b29e5190ec101429723038e45835d687f59/8084e627be8cc40cd30a5e03489c4b29e5190ec101429723038e45835d687f59-json.log
6.6M    97d2e7553bc6ee5fb3db183f11702c58d823544f6970d26e9e532f8cc1ade23d/97d2e7553bc6ee5fb3db183f11702c58d823544f6970d26e9e532f8cc1ade23d-json.log
76.0K   99fda25d27c8647732e3e3157d91b506f315ae2f591d8a53327271c7dce9bf2b/99fda25d27c8647732e3e3157d91b506f315ae2f591d8a53327271c7dce9bf2b-json.log
72.0K   a1e4f3b79fdae839f2e8f2367f11945872a14e8fc64393741d00b93335508a2f/a1e4f3b79fdae839f2e8f2367f11945872a14e8fc64393741d00b93335508a2f-json.log
216.0K  a5d35f52bd480a21c1768a90e616456eda307ec7ea05ce37ec63c34dff174850/a5d35f52bd480a21c1768a90e616456eda307ec7ea05ce37ec63c34dff174850-json.log
116.0K  c389703d90ba267d471f7f86f9f03c0ed4470a85fcaec64efc772478e888224f/c389703d90ba267d471f7f86f9f03c0ed4470a85fcaec64efc772478e888224f-json.log
** 551.4M  d237df0d3bb21c72dacd4098f8c2ac04737aac9357740638a06e11242098c077/d237df0d3bb21c72dacd4098f8c2ac04737aac9357740638a06e11242098c077-json.log
180.0K  dfa70f3379cda36dc1c76604b1f5f048e1232d66bb4c8e11ba568d8ea39b0e59/dfa70f3379cda36dc1c76604b1f5f048e1232d66bb4c8e11ba568d8ea39b0e59-json.log
48.0K   e3e758b108a81b06f7d5ed00124cc93066e470858aabff56840558ce22583328/e3e758b108a81b06f7d5ed00124cc93066e470858aabff56840558ce22583328-json.log
24.0K   e630ee7c268dad10d4e9c9d38cffa859595d1b4dee3620439cc2263084bfc815/e630ee7c268dad10d4e9c9d38cffa859595d1b4dee3620439cc2263084bfc815-json.log
1.1M    f36d34e11cd7732f277c2c1031705c745389cdc1a1bc9c4a5c78bb50e1b6c050/f36d34e11cd7732f277c2c1031705c745389cdc1a1bc9c4a5c78bb50e1b6c050-json.log
160.0K  f8dfbe7e3b15c710f6de5d650107be5c8e3eccb9a5772f4754ca165001581e04/f8dfbe7e3b15c710f6de5d650107be5c8e3eccb9a5772f4754ca165001581e04-json.log

The autodeploy mechanism should also be able to autodeploy itself

Right now, everything can be "autodeployed" except the autodeploy mechanism itself and its environment (cron, git).

This is annoying and breaks the autodeploy "workflow".

User have additional manual steps to install the autodeploy directly on the host and manually deploy any updates.

We also had to deal multiple versions of git, see

# * if on host with git version less than 2.3 (no support for GIT_SSH_COMMAND),
# have to configure your ~/.ssh/config with the following:
# Host github.com
# IdentityFile ~/.ssh/id_rsa_git_ssh_read_only
# UserKnownHostsFile /dev/null
# StrictHostKeyChecking no

All the above annoyances can technically be solved if we run this autodeploy cron job inside a docker as well. Enabling/disabling autodeploy can be configured in env.local, centralizing all configurations in the same place.

Having a scheduler part of PAVICS will also help share clean up jobs that's useful for all PAVICS deployments, ex: clean up of the wps_outputs folders so problem like this do not happen again #4 (comment)

Merge deploy_data_specific_image script with the deploy-data script

The deploy_data_specific_image script found on the pavics-jupyter-base repo was initially implemented as a separate script from the deploy-data script.

It made more sense at first since the specific image script is required to run directly on the specific images, using .yaml config files stored on the images, as opposed to the deploy-data script from birdhouse, which runs on a generic image that has git and docker installed, using .env config files found directly on the birdhouse repo ( example of .env file )

Since both scripts are pretty similar in the kind of task they do, deploy_data_specific_image could be merged into the more generic deploy-data script.

To do (non-exhaustive, taken from this comment) :

  • being able to use yq as docker run (deploy-data) or as installed (deploy_data_specific_image), will have to add a new switch for that.
  • download the folders with svn (deploy_data_specific_image) instead of git pull (deploy-data), will have to add a new switch for that as well.
  • Determine if we should add a switch between the svn and git pull/rsync method, or if we should just use one of those methods. After implementing the deploy_data_specific_image script, some limitations were observed. For example, it didn't seem easy to use * tokens in the svn command, to download only the notebooks for example with the format "*.ipynb". Also, the current deploy_data_specific_image script is not made to download single files for now, although it is possible and not hard to add as a feature. The script only supports downloading a full folder's content and sending it to the destination folder.

In conclusion, it would be nice to improve this part, but it is not required for now, and not a priority.

Canarie API config uses config env vars incorrectly

Summary

The configuration defined in: config/canarie-api/docker_configuration.py.template is a template, and therefore is expected to be a configurable item per-organization. In that regard, many items are incorrectly defined.

Details

The configuration in PLATFORM section uses defaults that enforces Ouranos and pavics.ouranos.ca references, support email, support url, etc.
In other words, any other server that is not pavics.ouranos.ca still identifies itself as being pavics.ouranos.ca, which is just wrong. Other organizations that will use the same config would incorrectly have Ouranos receive their server support requests.
The default "demo/dummy" configurations should not use directly the prod server values as in those cases.
If every organization needs to completely override the file to define their server, then there is not much point to even have a template file in the first place.

Furthermore, components defined in SERVICES instead use ${SUPPORT_EMAIL} globally.
This doesn't make sense as individual components should have the email relevant to their developer community and docs.
For example, Ouranos (or whichever override of SUPPORT_EMAIL gets defined) shouldn't be the support reference that responds to issues to components Thredds. When SUPPORT_EMAIL is overridden by organization X, issues specific to finch, raven, etc. should all be forwarded to Ouranos, not org. X, and therefore, those settings should be hardcoded in that case.

Monitor all components in Canarie node, both public and internal url

Migrated from old PAVICS https://github.com/Ouranosinc/PAVICS/issues/172. Lots of discussion, summarized below.

The motivation was the need for some quick dashboard for the working state of all the components, not to get more stats.

Right now we bypassing twitcher, which is not real life, it's not what real users will experience.

This is ultra cheap to add and provide very fast and up-to-date (every minute) result. It's like an always on sanity check that can quickly help debugging any connectivity issues between the components.

We do not intend to scale the canarie-api monitoring to become a full blown monitoring app.

Config updates for DB connections

Migrated from old PAVICS https://github.com/Ouranosinc/PAVICS/issues/100

I have found in a few places (Magpie, Twitcher, etc.) repos some traces of previously set variables to connect to Postgres DBs of various birds. Most of the time, the 'stubs' are still the same values employed for usr/pwd connection in PAVICS, even if the configs are overridden.

I have replaced some values for actual 'stubs' as I saw them (mostly usr: / pwd: qwerty). Since they are over overridden by PAVICS configs, the change is transparent when deployed.
BUT
Anyone could go through other public repo history, retrieve the credentials, and just connect to DBs.
We should update this repo's configs to change database credentials.

Suggested Changes:

  • move all following configs under PAVICS/birdhouse/env.local.example, replace templates elsewhere where needed, and update values in env.local of each bird. (ref issue #59)
  • POSTGRES_DB=pavics => project-api to match the bird/repo-specific naming convention used by all other birds creating a db. Will require a database rename. Would be less confusing to know which code is using this db.
    • PAVICS/birdhouse/config/postgres/docker-entrypoint-initdb.d/create-wps-databases.sql
  • POSTGRES_USER=pavics: not critical, but could be renamed
    (pavics is almost always the 'stub' value found in configs, but persists as the currently used value)
    • PAVICS/birdhouse/config/postgres/credentials.env
  • POSTGRES_PASSWORD to be changed completely (especially not qwerty)
    • PAVICS/birdhouse/config/postgres/credentials.env
    • PAVICS/birdhouse/config/postgres/magpie-credentials.env
  • MAGPIE_SECRET to something else than the obvious magpie
    • PAVICS/birdhouse/config/magpie/magpie.env.template
  • in each bird's env.local change MAGPIE_PW, MAGPIE_ADMIN_PW

note
Changing MAGPIE_SECRET will make all existing usr/pwd invalid because the AuthN cookie (auth_tkt) signing is done with its value. Users, service/resource permissions, group members, etc. will all need to be reassigned (OR maybe regenerate user security_code will be sufficient, not tested).

Maybe some points became irrelavant with all the major refactoring since the introduction of vagrant.
Still relevant for the case of magpie secret to update (if not already done) which would invalidate user login credentials.
relates to Ouranosinc/Magpie#229

Tracking CHANGES

Sorry, I didn't think about that before bird-house/contributions-guideline (#156) was merged... I would like to amend something.

What came up and made it very obvious during PRs #152, #154, and probably more to come... is that it is not easy to know what changed between versions. There are effectively no easy way to retrieve full changelogs. With upcoming releases by tagged versions and different forks across organizations, these will become ever increasingly important.

Also, there is no specific guidelines about how changelogs should be tracked, so it is very free-for-all in the commits.
I know that @tlvu does very detailed descriptions in his PRs, but any other commit/PR contributors doesn't (including myself).
I don't think that is the usual (natural?) way for most developers also, and I find the process of opening individual commits to read descriptions very time wasting/frustrating when I want to find when a change was introduced.

For this reason, I would like to propose to have a standard CHANGES file, as most repos do, and have a proper listing of versions and relevant changes each time. The guideline would also add a requirement to list whatever was changed by each PR, making the info still easy to retrieve within individual merge commits.

The biggest advantages would be:

  1. a standard format defined in CHANGES, normalizing how everyone reports them
  2. a common place where to look for them, making it easy to retrieve what happened between each version by looking through a single file.
  3. easy auto doc in the ReadTheDocs, allowing to also expose the CHANGES in the published/official documentation references
  4. possibility to add bumpversion and other useful tools if the need arises

Thanks for feedback and discussions.
Whomever it concerns: @huard @tlvu @dbyrns @matprov
(tag any others as needed)

[Feature] Sync cache invalidation between Magpie/Twitcher

Description

Because permissions are applied onto Magpie and are resolved for request-access by Twitcher instead, any already active caching of older requests will not be immediately synchronized on Twitcher side right after the permission update on Magpie side (caches are not shared, and therefore not invalidated on permission update).

Following requests to the same resources (within the caching expiration delay) will hit the cached response triggered by the access during the previous requests. New permissions resolution will not be effective until the cache expires.
For example:

GET /twitcher/ows/proxy/thredds/<file-ref>   
    => denied, response is now cached
PUT /magpie/users|groups/{id}/resources/<file-ref-id>/permissions  
    => <file-ref> made 'allowed' for previous request user
GET /twitcher/ows/proxy/thredds/<file-ref>  (cached)  
    =>  denied instead of allowed
... wait ~20s (caching delay) ...
GET /twitcher/ows/proxy/thredds/<file-ref>  
    =>  allowed

Note that the effect goes both ways, i.e.: removing access to a resource will not be blocked until the delay was reached.

Edit:

For the invalidation to take effect on Twitcher side, there are 2 methods 3 methods:

  1. Explicitly set cache-control: no-cache header during the next file access request to enforce reset of cache.
    This works, but should be done only on the first call after permission update, otherwise all caching performance advantages are lost over many repeated access to the same resource.
  2. Share the Magpie/Twitcher caches via file references to allow them to invalidate each other.
    • To do this, we need to have a volume mounted by both images, and have both of them use cache.type = file + corresponding paths for cache.lock_dir and cache.data_dir in their INI configs.
    • More updates to Magpie/Twitcher will be needed to correctly invalidate caches of type file (only memory is tested, and they are not hashed the same way for selective invalidation - e.g.: invalidate only ACL for resource X / user Y, etc.).
  3. (best) Employ redis or mongodb extension with beaker to synchronize caches.
    https://beaker.readthedocs.io/en/latest/modules/redis.html
    https://beaker.readthedocs.io/en/latest/modules/mongodb.html
    Not only would this allow to sync or invalidate caches across Magpie/Twitcher, but also between individual workers of Magpie and Twitcher. At the moment, each worker holds its own in-memory cache depending on which requests it received, meaning cached/non-cached responses won't be the same (and won't expire at the same time) depending on which worker processes the request and when was the last one received by it.

References

Flesh out README

  • Paragraph about what this is and how it is used
  • List supported components
  • Acknowledge contributions

Document ambiguous configuration variables in env.local

Reminder for when I'll work on the Documentation.

Found by @nikola-rados.

The first part of env.local

export SSL_CERTIFICATE="/path/to/ssl/cert.pem" # path to the nginx ssl certificate, path and key bundle
export PAVICS_FQDN="hostname.domainname" # Fully qualified domain name of this Pavics installation
export DOC_URL="https://www.example.com/" # URL where /doc gets redirected
export MAGPIE_SECRET=itzaseekrit
export MAGPIE_ADMIN_USERNAME=admin
export MAGPIE_ADMIN_PASSWORD=qwerty
export TWITCHER_PROTECTED_PATH=/twitcher/ows/proxy
export PHOENIX_PASSWORD=phoenix_pass
export PHOENIX_PASSWORD_HASH=sha256:123456789012:1234567890123456789012345678901234567890123456789012345678901234
export TOMCAT_NCWMS_PASSWORD=ncwmspass
export [email protected]
export CMIP5_THREDDS_ROOT=birdhouse/CMIP5/CCCMA
export JUPYTERHUB_ADMIN_USERS="{'admin'}" # python set syntax
export CATALOG_USERNAME=admin-catalog
export CATALOG_PASSWORD=qwerty
export CATALOG_THREDDS_SERVICE=thredds
export POSTGRES_PAVICS_USERNAME=postgres-pavics
export POSTGRES_PAVICS_PASSWORD=postgres-qwerty
export POSTGRES_MAGPIE_USERNAME=postgres-magpie
export POSTGRES_MAGPIE_PASSWORD=postgres-qwerty
has very few docs.

Some are pretty self-explanatory, example MAGPIE_ADMIN_USERNAME and MAGPIE_ADMIN_PASSWORD, but some requires more documentation, example MAGPIE_SECRET.

Properly validate requied ENV VAR that have defaults

A lot of variables defined in default.env are required by the stack. Examples of such are the MAGPIE_VERSION, some jupyter related items, etc. They are not necessarily validated though, because it is assumed that default.env is always present and source'd.

Although most deployments use directly the default.env, standalone operations that do not require default.env explicitly (because they override everything in env.local) suffer from having to add a dummy default.env since those scripts force source them.

An improvement would be to have and "if exists" check prior to sourcing default.env wherever referenced, to skip it if missing, but since variables defined in there will not be guaranteed to exist anymore, the deployment would be prone to errors when forgetting to define one of the needed variables.

A validation check (similar to what is done for VARS items), should be added to explicitly check that required but optional overrides variables are properly set before deployment.

Incorrect URL endpoints on birds

Migrated from old PAVICS https://github.com/Ouranosinc/PAVICS/issues/98, lots of discussions, too much to summarize. Just action items.

Most (if not all?) GetCapabilities indicate URL endpoints that are invalid on any bird-server (pluvier, colibri, etc.) for multiple bird-services (flyingpigeon, catalog, etc). Seems like the twitcher-url wasn't properly setup when they where deployed.

Ex:

GET
https://pluvier.crim.ca:/twitcher/ows/proxy/flyingpigeon/wps?service=WPS&version=1.0.0&request=GetCapabilities

returns:

[..­.]
<ows:HTTP>
     <ows:Get xlink:href="https://pluvier.crim.ca/ows/proxy/flyingpigeon"/>
     <ows:Post xlink:href="https://pluvier.crim.ca/ows/proxy/flyingpigeon"/>
</ows:HTTP>                               
[..­.]

but the real URL is: (see twitcher)

https://pluvier.crim.ca/twitcher/ows/proxy/flyingpigeon

Action item:

Yes. MagpieAdapter would need to add purl here:
https://github.com/Ouranosinc/Magpie/blob/master/magpie/adapter/magpieservice.py#L76-L78
Then function replace_caps_url in Twitcher should handle the rest further down the processing chain.

Kubernetes?

Questions

At PCIC we are in the middle of building some new servers that will be capable of running Kubernetes. I feel like there is potential to have birdhouse run on k8s by taking the same infrastructure that is in place for docker and translating it. There are tools like kompose and kustomize that have the potential to help with this process. That being said I imagine there would be a fair amount of work to get something like this operational.

I just wanted to spark this conversation and see what people think. I'm not sure if this is something the wider birdhouse audience would be interested in but at PCIC we are eager to try it out.

Productize Notos: actual go-live

Replay Ansible playbook.
Transfer data.
Point pavics.ouranos.ca to notos.
Shutdown PAVICS stack on old Boreas to ensure everything "pavics.ouranos.ca" comes from the new Notos.
Jenkins end-to-end test.

Automated LetsEncrypt SSL certificate renewal

Looks like the certbot client installed by the distro package have automated renewal by default https://certbot.eff.org/docs/using.html?highlight=renew#automated-renewals

We are using certbot the docker way so not using any distro package so this PAVICS docker stack is self-contained and more or less distro agnostic.

We'll just have to reverse engineer the distro package and reproduce it in this PAVICS docker stack here.

LetsEncrypt SSL certs expire every 3 months so this is a good manual annoyance to get rid off.

Feature Template and auto-labels

Another amend related to bird-house/contributions-guideline (#156) that I discovered while writing this issue and the one in #157

There is no template for issues!!
One was introduced for PRs, but issues remain a blank canvas, which makes description of new ones either missing important details or not following any format to ease retrieval of details.

A few things that I though could be important to define in the issue template:

  • type of issue (bug, proposition, feature, etc.?)
  • who does it concern ? (a list of tagged users impacted by those changes)
  • related issues, PR, components?
  • related build, environment, server URL, test suite

I think we should also take better advantage of labels.
There are some currently employed (bug, documentation, enhancement, question), but they are added more or less based on good will.
Issue template allows to define different template formats for different types of issues, and those can be associated with the corresponding tags.

Same thing for the default title format. Some issues have [Feature], other don't... The template can resolve this problem to make it more uniform, and the guidelines should reflect the use of those templates.

@huard @tlvu @dbyrns @matprov
I would like to have your feedback and additional suggestion about other details I could have missed that could be needed in templates.

Idle Jupyter server cause open file limit exhaustion

This morning, none of the docker commands was responding because we have exhausted the open file limit on the user that runs the PAVICS platform (the user that does ./pavics-compose.sh up -d).

$ ./pavics-compose.sh ps
-bash: fork: retry: No child processes
-bash: fork: retry: No child processes
-bash: fork: retry: No child processes
-bash: fork: retry: No child processes
-bash: fork: Resource temporarily unavailable

The immediate work-around is to increase the soft and hard nofile limit for the corresponding user in /etc/security/limits.conf file and apply that new limit immediately ulimit -n NEW_LIMIT (ulimit -n show the current effective limit). Find the current number of open file this way sudo lsof -u $USER|wc -l and put something higher than that in the limits.conf file. Reference https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/5/html/tuning_and_optimizing_red_hat_enterprise_linux_for_oracle_9i_and_10g_databases/chap-oracle_9i_and_10g_tuning_guide-setting_shell_limits_for_the_oracle_user

Then using a different user than the user that usually runs PAVICS because that user can not do anything else and restart the docker daemon (sudo systemctl restart docker). Back to the regular user running PAVICS, ff the containers have problems to restart, destroy them and re-create them from scratch (./pavics-compose.sh down && sleep 10 && ./pavics-compose.sh up -d).

For a more permanent solution than keep increasing the limit each time we burst it is to setup culling idle Jupyter server as describe here https://discourse.jupyter.org/t/jupyterhub-doesnt-kill-processes-and-threads-when-notebooks-are-closed-or-user-log-out/2244/2

Also need to add monitoring for open file limit to we are alerted in advance of near exhaustion to avoid having to restart the entire docker daemon.

Ping @moulab88 @tlogan2000 if you guys have anything to add.

Edit:

  • add command to see current effective limit and set new limit immediately
  • add reference to redhat docs

Document config for self-signed SSL

Reminder for when I'll work on the Documentation.

@nikola-rados had troubled with this. It's the env VERIFY_SSL in env.local.

Should also refer to LetsEncrypt or PageKite to get real SSL cert because support for self-signed SSL is not 100% bulletproof.

:bulb: [Feature] SMTP configuration for services that need it

Description

Following features required a SMTP configuration to be defined in order to add more functionalities:

Variables should be made available for easy overrides to be pushed within configuration files of each relevant services.
Empty defaults should be used to preserve features inactive (current behaviour).

Concerned Organizations

Any that what's to make sure of those features.

Productize Notos: Benchmark CPU and IO and document procedure and results

For comparing different filesystems (XFS vs ZFS) performance and for detecting regressions for later upgrades to RedHat 8 or RockyLinux (alternative to Centos that do not exist anymore).

Also to test performance later when disk will be full vs current empty disk.

Also apply same performance test to old existing Boreas and compare with Notos (performance is the entire reason why we purchased the new Notos).

Remove Twitcher's direct connection to Magpie's DB

Migrated from old PAVICS https://github.com/Ouranosinc/PAVICS/issues/101

MagpieAdapter should be used to obtain service/resource permissions via a Magpie API request.
Twitcher is actually making direct calls to Magpie's postgres db (in Magpie/adapter/MagpieOWSSecurity) which goes against the whole purpose of using Magpie to manage those permissions.

Since pavics-0.3.10, credentials to connect to Magpie's postgres db have changed, which highlighted this fact. Since they are now required in Twitcher, they have been added (see 069b51abb2d0ebdc90e1e15f4ce03d15ecb8b91b / PR #99 ).

Notebook autodeploy wipe all existing deployed notebooks if github down

Should check if source exist before deleting dest.

Should even fail earlier when github download failed.

notebookdeploy START_TIME=2020-04-23T10:01:01-0400
++ mktemp -d -t notebookdeploy.XXXXXXXXXXXX
+ TMPDIR=/tmp/notebookdeploy.ICk70Vto2LaE
+ cd /tmp/notebookdeploy.ICk70Vto2LaE
+ mkdir tutorial-notebooks
+ cd tutorial-notebooks
+ wget --quiet https://raw.githubusercontent.com/Ouranosinc/PAVICS-e2e-workflow-tests/master/downloadrepos
+ chmod a+x downloadrepos
chmod: cannot access ‘downloadrepos’: No such file or directory
+ wget --quiet https://raw.githubusercontent.com/Ouranosinc/PAVICS-e2e-workflow-tests/master/default_build_params
+ wget --quiet https://raw.githubusercontent.com/Ouranosinc/PAVICS-e2e-workflow-tests/master/binder/reorg-notebooks
+ chmod a+x reorg-notebooks
chmod: cannot access ‘reorg-notebooks’: No such file or directory
+ wget --quiet --output-document - https://github.com/Ouranosinc/PAVICS-e2e-workflow-tests/archive/master.tar.gz
+ tar xz

gzip: stdin: unexpected end of file
tar: Child returned status 1
tar: Error is not recoverable: exiting now
+ ./downloadrepos
/etc/cron.hourly/PAVICS-deploy-notebooks: line 63: ./downloadrepos: No such file or directory
+ ./reorg-notebooks
/etc/cron.hourly/PAVICS-deploy-notebooks: line 64: ./reorg-notebooks: No such file or directory
+ mv -v 'PAVICS-e2e-workflow-tests-master/notebooks/*.ipynb' ./
mv: cannot stat ‘PAVICS-e2e-workflow-tests-master/notebooks/*.ipynb’: No such file or directory
+ rm -rfv PAVICS-e2e-workflow-tests-master
+ rm -rfv downloadrepos default_build_params reorg-notebooks
+ TMP_SCRIPT=/tmp/notebookdeploy.ICk70Vto2LaE/deploy-notebook
+ cat
+ chmod a+x /tmp/notebookdeploy.ICk70Vto2LaE/deploy-notebook
+ docker pull bash
Using default tag: latest
latest: Pulling from library/bash
Digest: sha256:febb3d74f41f2405fe21b7c7b47ca1aee0eda0a3ffb5483ebe3423639d30d631
Status: Image is up to date for bash:latest
+ docker run --rm --name deploy_tutorial_notebooks -u root -v /tmp/notebookdeploy.ICk70Vto2LaE/deploy-notebook:/deploy-notebook:ro -v /tmp/notebookdeploy.ICk70Vto2LaE/tutorial-notebooks:/tutorial-notebooks:ro -v /data/jupyterhub_user_data:/notebook_dir:rw --entrypoint /deploy-notebook bash
+ cd /notebook_dir
+ rm -rf tutorial-notebooks/WCS_example.ipynb tutorial-notebooks/WFS_example.ipynb tutorial-notebooks/WMS_example.ipynb tutorial-notebooks/WPS_example.ipynb tutorial-notebooks/catalog_search.ipynb tutorial-notebooks/dap_subset.ipynb tutorial-notebooks/esgf-compute-api-examples-devel tutorial-notebooks/esgf-dap.ipynb tutorial-notebooks/finch-usage.ipynb tutorial-notebooks/hummingbird.ipynb tutorial-notebooks/opendap.ipynb tutorial-notebooks/pavics_thredds.ipynb tutorial-notebooks/raven-master tutorial-notebooks/rendering.ipynb tutorial-notebooks/subsetting.ipynb
+ cp -rv '/tutorial-notebooks/*' tutorial-notebooks
cp: can't stat '/tutorial-notebooks/*': No such file or directory
+ chown -R root:root tutorial-notebooks
+ set +x
removed directory: ‘/tmp/notebookdeploy.ICk70Vto2LaE/tutorial-notebooks’
removed ‘/tmp/notebookdeploy.ICk70Vto2LaE/deploy-notebook’
removed directory: ‘/tmp/notebookdeploy.ICk70Vto2LaE’

notebookdeploy finished START_TIME=2020-04-23T10:01:01-0400
notebookdeploy finished   END_TIME=2020-04-23T10:02:12-0400

Update to TDS 4.16.1

According to https://github.com/Unidata/thredds/releases

All TDS administrators are strongly encouraged to move to 4.6.16.1. For more information about the 4.6 release and how to upgrade from previous versions, please see Upgrading to TDS 4.6.

I don't see any bug fix or features that look critical for us.

Jupyterhub container creation fails when username contains a dot

On our CRIM instances, something fails when the username contains a dot (ex: david.caron)

These 2 folders are created in /data/jupyterhub_user_data: david_2Ecaron and david.caron and both folders are empty. Not sure if this issue is with this configuration or with the magpie-jupyterhub authentication.

I didn't have time to investigate more.

Migrate existing pieces to new pluggable component architecture, part 1 clean up the unused pieces

The following are scheduled to be removed:

  frontend:
    image: pavics/pavics-frontend:1.0.5
  project-api:
    image: pavics/pavics-project-api:0.9.0
  catalog:
    image: pavics/pavics-datacatalog:0.6.11
  solr:
    image: pavics/solr:5.2.1
  ncwms2:
    image: pavics/ncwms2:2.0.4

Includes finding and removing all references from

  • configs of those components above, and remove the following unused directory birdhouse/config/ncops
  • canarie-api, magpie and other remaining components if exist
  • notebooks tested by Jenkins only. Other notebooks are not included.

Migrate existing pieces to new pluggable component architecture, part 2 the simple pieces

Goal is for the CCDP deployment on CRIM side that a subset of PAVICS (nginx proxy, finch, weaver) to use the actual PAVICS stack and avoid duplicating code (it has finch-compose.sh which is sort of a copy of pavics-compose.sh with the same env.local and config/ dir mechanism).

CCDP deployment has a logging monitoring that PAVICS do not have but PAVICS have system metrics monitoring and autodeploy that CCDP do not have.

Basically both sides have new development that are compatible with each other and that can mutually benefit each other.

Simple standalone pieces are thredds, geoserver, jupyterhub.

Includes:

  • moving many configs in default.env and env.local.example into each component default.env. Everything related should be within the component, no leaking outside.
  • plugin to magpie for permission, plugin to proxy for exposing the service and anything else. These already exist, hopefully no new are required.

Backward compat:

  • Creation of a DEFAULT_COMPONENTS list that will activate these by default so this re-architecture is completely transparent

Add a Docker image cleaner as a component

Local Docker image cache on host becomes bigger and bigger when new versions of components are released.
Also, unused image versions continue to use space even though we don't use them anymore.

We would therefore need to have a component (optional or not) which periodically cleans the Docker images.

One possible solution would be to use docker-gc, as it is what we currently use to clean big image history, to only keep latest versions.

See sudo DRY_RUN=1 MINIMUM_IMAGES_TO_SAVE=1 ./docker-gc

Would need to be configured over here, with according schedule parameter : https://github.com/bird-house/birdhouse-deploy/blob/master/birdhouse/components/scheduler/config.yml.template

Testing this would only require creating dummy image tags on IaC, and check after the task has been triggered if the old tags for a single image are gone.

Build and deploy intake catalogs

Build and deploy intake catalogs from pavics-vdb

  • Should be be triggered on merge or tag ?
  • Where should the catalogs live ? pavics.ouranos.ca/catalogs/ ? pavics.ouranos.ca/intake

Intake catalogs are our interim solution until STAC is more mature. So maybe we should reserve /catalog for the future.

:bug: [BUG]: CERTIFICATE_VERIFY_FAILED error with canarie-api monitoring

Summary

Using LetsEncrypt SSL certificate, the canarie-api monitoring page https://lvupavicsdev.ouranos.ca/canarie/node/service/status fail with a bunch of CERTIFICATE_VERIFY_FAILED but all the services are working properly.

This is most likely due to https://letsencrypt.org/docs/dst-root-ca-x3-expiration-september-2021/ meaning the openssl library in the proxy container is too old.

This bug only impact PAVICS deployment using LetsEncrypt certificate and only if the canarie-api monitoring matter. Otherwise all other services are still working fine (Jenkins run all pass http://jenkins.ouranos.ca/job/PAVICS-e2e-workflow-tests/job/master/1253/console).

Ouranos production has been switched to use another SSL certificate provider as a work-around.

@moulab88 FYI

Details

Screenshot 2021-09-30 at 14-55-31 Ouranos - Node Service

How to reproduce

$ docker exec proxy python -c "import requests; requests.request('GET', 'https://lvupavicsmaster.ouranos.ca/geoserver'
> )"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 50, in request
    response = session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 433, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:661)

Productize Notos: Clean up existing data on Boreas

Boreas currently has 56TB of data. But the max storage size for new Notos is 55TB !

Cleanup candidates: Thredds, Geoserver, Jupyter user data, backups, ...

Ideal would be to reduce to 35T (65% of max 55T).

This issue involves many person to decide on the various type of data to clean up.

Replace ncWMS2 by THREDDS WMS service

Migrated from old PAVICS https://github.com/Ouranosinc/PAVICS/issues/167, had lots of discussion, summarize below.

Find where in the platform links to the ncWMS2 are used and replaced them with links to the THREDDS WMS service.

Related to https://github.com/Ouranosinc/PAVICS/issues/108

Need more testing to be confident Thredds WMS can replace ncwms2 both in terms of integration with the rest of the components and performance before we pull the plug on ncwms2.

Need to add some monitoring and notification to the automated deployment system and PAVICS in general

Migrated from old PAVICS https://github.com/Ouranosinc/PAVICS/issues/140

Automated deployment was triggered but not performed on boreas because of

++ git status -u --porcelain
+ '[' '!' -z '?? birdhouse/old_docker-compose.override.yml_18062019' ']'
+ echo 'ERROR: unclean repo'
ERROR: unclean repo
+ exit 1

We need some system to monitor the logs and send notification if there are any errors. This log file error monitoring and notification can be generalized to watch any systems later so each system is not forced to reimplement monitoring and notification.

This problem has triggered this issue https://github.com/Ouranosinc/PAVICS/issues/176

There are basically 4 types of monitoring that I think we need:

  • Monitor system-wide resource usage (CPU, ram, disk, I/O, processes, ...): we already have this one

  • Monitor per container resource usage (CPU, ram, disk, I/O, processes, ...): we already have this one

  • Monitor application logs for errors and unauthorized access: we do not have this one. Useful for proactively catching errors instead of waiting for users to log bugs.

  • Monitor end-to-end workflow of all deployed applications to ensure they are working properly together (no config errors): we partially have this one with tutorial notebooks being tested by Jenkins daily. Unfortunately not all apps have associated notebooks or the notebooks exist but have problem being run non-interractively under Jenkins.

Monitoring settings in env.local is required even if monitoring component is not enabled

Given

$ALERTMANAGER_ADMIN_EMAIL_RECEIVER
$SMTP_SERVER

User are forced to set those vars even if the Monitoring component is not enabled.

Further more, the following data volumes are also created even when the Monitoring component is not enabled.

docker volume create prometheus_persistence # metrics db
docker volume create grafana_persistence # dashboard and config db
docker volume create alertmanager_persistence # storage

We need a way for each component to inject their various requirements without hardcoding them in pavics-compose.sh script or anywhere else than inside their own component folder.

Found out by @nikola-rados.

Configurable Alerting threshold to avoid false positive alerts

Too much false positive alerts will decrease the importance and usefulness of each alert. Alerts should not feel like spams.

Not all deployments are equal so can not have same threshold.

Have to basically use variables instead of hardcoding threshold values in this file https://github.com/bird-house/birdhouse-deploy/blob/cf055ccf7c06439de36aba5686758ff37db2d864/birdhouse/components/monitoring/prometheus.rules.template

Now that each component can have their own localized defaults (PR #64) we can implement this. This was left out during the initial implementation since this would literally spam the global default file.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.