Code Monkey home page Code Monkey logo

hpc-cluster-lsf-test's Introduction

hpc-cluster-lsf

Repository for the HPC Cluster LSF implementation files. Learn more

Deployment with Schematics on IBM Cloud

Initial configuration:

$ cp sample/configs/hpc_workspace_config.json config.json
$ ibmcloud iam api-key-create trl-tyos-api-key --file ~/.ibm-api-key.json -d "my api key"
$ cat ~/.ibm-api-key.json | jq -r ."apikey"
# copy your apikey
$ vim config.json
# paste your apikey and set entitlements for LSF

You also need to generate github token if you use private Github repository.

Deployment:

$ ibmcloud schematics workspace new -f config.json --github-token xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
$ ibmcloud schematics workspace list
Name               ID                                            Description   Status     Frozen
hpcc-lsf-test       us-east.workspace.hpcc-lsf-test.7cbc3f6b                     INACTIVE   False

OK
$ ibmcloud schematics apply --id us-east.workspace.hpcc-lsf-test.7cbc3f6b
Do you really want to perform this action? [y/N]> y

Activity ID b0a909030f071f51d6ceb48b62ee1671

OK
$ ibmcloud schematics logs --id us-east.workspace.hpcc-lsf-test.7cbc3f6b
...
 2021/04/05 09:44:54 Terraform apply | Apply complete! Resources: 14 added, 0 changed, 0 destroyed.
 2021/04/05 09:44:54 Terraform apply |
 2021/04/05 09:44:54 Terraform apply | Outputs:
 2021/04/05 09:44:54 Terraform apply |
 2021/04/05 09:44:54 Terraform apply | sshcommand = ssh -J [email protected]  [email protected]
 2021/04/05 09:44:54 Command finished successfully.
 2021/04/05 09:45:00 Done with the workspace action

OK
$ ssh -J [email protected]  [email protected]

$ ibmcloud schematics destroy --id us-east.workspace.hpcc-lsf-test.7cbc3f6b

Steps to validate the cluster post provisioning

  • Login to controller node using ssh command
  • Check the existing machines part of cluster using $bhosts or $bhosts -w command, this should show all instances
  • Submit a job that would spin 1 VM and sleep for 10 seconds -> $bsub -n 1 sleep 10, once submitted command line will show a jobID
  • Check the status for job using -> $bjobs -l
  • Check the log file under /opt/ibm/lsf/log/ibmgen2... for any messages around provisioning of the new machine
  • Continue to check status of nodes using lshosts, bjobs or bhosts commands
  • To test multiple VMs you can run multiple sleep jobs -> $bsub -n 10 sleep 10 -> This will create 10 VMs and each job will sleep for 10 seconds

Storage Node and NFS Setup

The storage node is configured as an NFS server and the data volume is mounted to the /data directory which is exported to share with LSF cluster nodes.

Steps to validate NFS Storage node

1. To validate the NFS storage is setup and exported correctly
  • Login to the storage node using SSH (ssh -J [email protected] [email protected])
  • The below command shows that the data volume, /dev/vdd, is mounted to /data on the storage node.
# df -k | grep data
/dev/vdd       104806400 1828916 102977484   2% /data`
  • The command below shows that /data is exported as a NFS shared directory.
# exportfs -v
/data         	10.242.66.0/23(sync,wdelay,hide,no_subtree_check,sec=sys,rw,secure,no_root_squash,no_all_squash)
  • At the NFS client end, the LSF cluster nodes in this case, we mount the /data directory in NFS server to the local directory, /mnt/data.
# df -k | grep data
10.242.66.4:/data 104806400  1828864 102977536   2% /mnt/data

The command above shows that the local directory, /mnt/data, is mounted to the remote /data directory on the NFS server, 10.242.66.4.

For ease of use, we create a soft link, /home/lsfadmin/shared, pointed to /mnt/data. The data that needs to be shared across the cluster can be placed in /home/lsfadmin/shared.

/home/lsfadmin>ls -l
total 0
lrwxrwxrwx. 1 root root 9 May 27 14:52 shared -> /mnt/data
2. Steps to validate whether the clients are able to write to the NFS storage
  • Login to the controller as shown in the ssh_command output
  • Submit a job to write the host name to the /home/lsfadmin/shared directory on the NFS server
$ bsub sh -c 'echo $HOSTNAME > /home/lsfadmin/shared/hello.txt'
  • Wait until the job is finished and then run the command to confirm the hostname is written to the file on the NFS share
$ bjobs
$ cat /home/lsfadmin/shared/hello.txt
ibm-gen2host-10-241-0-21  # worker hostname
3. steps to validate spectrum scale integration
  • Login to scale storage node using SSH. (ssh -J [email protected] [email protected], details will be available in the logs output with key spectrum_scale_storage_ssh_command)
  • The below command shows the gpfs cluster setup on scale storage node.
# /usr/lpp/mmfs/bin/mmlscluster
  • The below command shows file system mounted on number of nodes
# /usr/lpp/mmfs/bin/mmlsmount all
  • The below command shows the fileserver details. This command can be used to validate file block size(Inode size in bytes).
#   /usr/lpp/mmfs/bin/mmlsfs all -i
  • Login to controller node using SSH. (ssh -J [email protected] [email protected])
  • The below command shows the gpfs cluster setup on computes node. This should contain the controller,controller-candidate, and worker nodes.
# /usr/lpp/mmfs/bin/mmlscluster
  • Create a file on mountpoint path(e.g /gpfs/fs1) and verify on other nodes that the file can be accessed.
4. steps to accessing the Scale cluster GUI
  • Open a new command line terminal.
  • Run the following command to access the storage cluster:
#ssh -L 22443:localhost:443 -J root@{FLOATING_IP_ADDRESS} root@{STORAGE_NODE_IP_ADDRESS}
  • where STORAGE_NODE_IP_ADDRESS needs to be replaced with the storage IP address associated with hpc-pc-scale-storage-0, which you gathered earlier, and FLOATING_IP_ADDRESS needs to be replaced with the floating IP address that you identified.

  • Open a browser on the local machine, and run https://localhost:22443. You will get an SSL self-assigned certificate warning with your browser the first time that you access this URL.

  • Enter your login credentials that you set up when you created your workspace to access the Spectrum Scale GUI. Accessing the compute cluster

  • Open a new command line terminal.

  • Run the following command to access the compute cluster:

 #ssh -L 21443:localhost:443 -J root@{FLOATING_IP_ADDRESS} root@{COMPUTE_NODE_IP_ADDRESS}
  • where COMPUTE_NODE_IP_ADDRESS needs to be replaced with the storage IP address associated with hpc-pc-primary-0, which you gathered earlier, and FLOATING_IP_ADDRESS needs to be replaced with the floating IP address that you identified.

  • Open a browser on the local machine, and run https://localhost:21443. You will get an SSL self-assigned certificate warning with your browser the first time that you access this URL.

  • Enter your login credentials that you set up when you created your workspace to access the Spectrum Scale GUI.

Terraform Documentation

Requirements

Name Version
http 3.0.1
ibm 1.41.0

Providers

Name Version
http 3.0.1
ibm 1.41.0
null n/a
template n/a

Modules

Name Source Version
compute_nodes_wait ./resources/scale_common/wait n/a
invoke_compute_playbook ./resources/scale_common/ansible_compute_playbook n/a
invoke_remote_mount ./resources/scale_common/ansible_remote_mount_playbook n/a
invoke_storage_playbook ./resources/scale_common/ansible_storage_playbook n/a
login_ssh_key ./resources/scale_common/generate_keys n/a
permission_to_lsfadmin_for_mount_point ./resources/scale_common/add_permission n/a
prepare_spectrum_scale_ansible_repo ./resources/scale_common/git_utils n/a
remove_ssh_key ./resources/scale_common/remove_ssh n/a
schematics_sg_tcp_rule ./resources/ibmcloud/security n/a
storage_nodes_wait ./resources/scale_common/wait n/a

Resources

Name Type
ibm_is_dedicated_host.worker resource
ibm_is_dedicated_host_group.worker resource
ibm_is_floating_ip.login_fip resource
ibm_is_instance.controller resource
ibm_is_instance.controller_candidate resource
ibm_is_instance.login resource
ibm_is_instance.spectrum_scale_storage resource
ibm_is_instance.storage resource
ibm_is_instance.worker resource
ibm_is_public_gateway.mygateway resource
ibm_is_security_group.login_sg resource
ibm_is_security_group.sg resource
ibm_is_security_group_rule.egress_all resource
ibm_is_security_group_rule.ingress_all_local resource
ibm_is_security_group_rule.ingress_tcp resource
ibm_is_security_group_rule.ingress_vpn resource
ibm_is_security_group_rule.login_egress_tcp resource
ibm_is_security_group_rule.login_egress_tcp_rhsm resource
ibm_is_security_group_rule.login_egress_udp_rhsm resource
ibm_is_security_group_rule.login_ingress_tcp resource
ibm_is_security_group_rule.login_ingress_tcp_rhsm resource
ibm_is_security_group_rule.login_ingress_udp_rhsm resource
ibm_is_subnet.login_subnet resource
ibm_is_subnet.subnet resource
ibm_is_volume.nfs resource
ibm_is_vpc.vpc resource
ibm_is_vpn_gateway.vpn resource
ibm_is_vpn_gateway_connection.conn resource
null_resource.delete_schematics_ingress_security_rule resource
http_http.fetch_myip data source
ibm_iam_auth_token.token data source
ibm_is_dedicated_host_profiles.worker data source
ibm_is_image.image data source
ibm_is_image.scale_image data source
ibm_is_image.stock_image data source
ibm_is_instance_profile.controller data source
ibm_is_instance_profile.login data source
ibm_is_instance_profile.spectrum_scale_storage data source
ibm_is_instance_profile.storage data source
ibm_is_instance_profile.worker data source
ibm_is_region.region data source
ibm_is_ssh_key.ssh_key data source
ibm_is_volume_profile.nfs data source
ibm_is_vpc.existing_vpc data source
ibm_is_vpc.vpc data source
ibm_is_zone.zone data source
ibm_resource_group.rg data source
template_file.controller_user_data data source
template_file.login_user_data data source
template_file.metadata_startup_script data source
template_file.storage_user_data data source
template_file.worker_user_data data source

Inputs

Name Description Type Default Required
TF_PARALLELISM Parallelism/ concurrent operations limit. Valid values are between 1 and 256, both inclusive. Learn more. string "250" no
TF_VERSION The version of the Terraform engine that's used in the Schematics workspace. string "1.1" no
TF_WAIT_DURATION wait duration time set for the storage and worker node to complete the entire setup string "180s" no
api_key This is the IBM Cloud API key for the IBM Cloud account where the IBM Spectrum LSF cluster needs to be deployed. For more information on how to create an API key, see Managing user API keys. string n/a yes
cluster_prefix Prefix that is used to name the IBM Spectrum LSF cluster and IBM Cloud resources that are provisioned to build the IBM Spectrum LSF cluster instance. You cannot create more than one instance of the lsf cluster with the same name. Make sure that the name is unique. string "hpcc-lsf" no
dedicated_host_enabled Set to true to use dedicated hosts for compute hosts (default: false). Note that lsf still dynamically provisions compute hosts at public VSIs and dedicated hosts are used only for static compute hosts provisioned at the time the cluster is created. The number of dedicated hosts and the profile names for dedicated hosts are calculated from worker_node_min_count and dedicated_host_type_name. bool false no
dedicated_host_placement Specify 'pack' or 'spread'. The 'pack' option will deploy VSIs on one dedicated host until full before moving on to the next dedicated host. The 'spread' option will deploy VSIs in round-robin fashion across all the dedicated hosts. The second option should result in mostly even distribution of VSIs on the hosts, while the first option could result in one dedicated host being mostly empty. string "spread" no
hyperthreading_enabled Setting this to true will enable hyper-threading in the worker nodes of the cluster (default). Otherwise, hyper-threading will be disabled. Note: Only a value of true is supported for this release. See this FAQ for an explanation of why that is the case. bool true no
image_name Name of the custom image that you want to use to create virtual server instances in your IBM Cloud account to deploy the IBM Spectrum LSF cluster. By default, the automation uses a base image with e with additional software packages documented here. If you would like to include your application-specific binary files, follow the instructions in Planning for custom images to create your own custom image and use that to build the IBM Spectrum LSF cluster through this offering. string "hpcc-lsf10-scale5131-rhel84-2-0-5" no
login_node_instance_type Specify the VSI profile type name to be used to create the login node for Spectrum LSF cluster. Learn more. string "bx2-2x8" no
ls_entitlement Entitlement file content for Spectrum LSF license scheduler. string "LS_Standard 10.1 () () () () 18b1928f13939bd17bf25e09a2dd8459f238028f" no
lsf_entitlement Entitlement file content for core Spectrum LSF software. string "LSF_Standard 10.1 () () () pa 3f08e215230ffe4608213630cd5ef1d8c9b4dfea" no
lsf_license_confirmation If you have confirmed the availability of the Spectrum LSF license for a production cluster on IBM Cloud or if you are deploying a non-production cluster, enter true. Note:Failure to comply with licenses for production use of software is a violation of theIBM International Program License Agreement. string n/a yes
management_node_count Number of management nodes. This is the total number of management and management candidates. Enter a value in the range 1 - 3. number 2 no
management_node_instance_type Specify the virtual server instance profile type to be used to create the management nodes for the Spectrum LSF cluster. For choices on profile types, see Instance profiles. string "bx2-4x16" no
resource_group Resource group name from your IBM Cloud account where the VPC resources should be deployed. For additional information on resource groups, see Managing resource groups. string "Default" no
scale_compute_cluster_filesystem_mountpoint Compute cluster (accessingCluster) file system mount point. The accessingCluster is the cluster that accesses the owningCluster. For more information, see Mounting a remote GPFS file system. string "/gpfs/fs1" no
scale_compute_cluster_gui_password Password for Compute cluster GUI. Note: Password should be at least 8 characters, must have one number, one lowercase letter, and one uppercase letter, at least one unique character. Password Should not contain username. string "" no
scale_compute_cluster_gui_username GUI user to perform system management and monitoring tasks on compute cluster. Note: Username should be at least 4 characters, any combination of lowercase and uppercase letters. string "" no
scale_filesystem_block_size File system block size. Spectrum Scale supported block sizes (in bytes) include: 256K, 512K, 1M, 2M, 4M, 8M, 16M. string "4M" no
scale_storage_cluster_filesystem_mountpoint Spectrum Scale Storage cluster (owningCluster) Filesystem mount point. The owningCluster is the cluster that owns and serves the file system to be mounted. Mounting a remote GPFS file system. string "/gpfs/fs1" no
scale_storage_cluster_gui_password Password for Spectrum Scale storage cluster GUI. Note: Password should be at least 8 characters, must have one number, one lowercase letter, one uppercase letter, and at least one unique character. Password Should not contain username string "" no
scale_storage_cluster_gui_username GUI user to perform system management and monitoring tasks on storage cluster. Note: Username should be at least 4 characters, any combination of lowercase and uppercase letters. string "" no
scale_storage_image_name Name of the custom image that you would like to use to create virtual machines in your IBM Cloud account to deploy the Spectrum Scale storage cluster. By default, the automation uses a base image plus the Spectrum Scale software and any other software packages that it requires. If you would like, you can follow the instructions for Planning for custom images to create your own custom image and use that to build the Spectrum Scale storage cluster through this offering. string "hpcc-scale5131-rhel84" no
scale_storage_node_count The number of Spectrum scale storage nodes that will be provisioned at the time the cluster is created. Enter a value in the range 2 - 18. It must to be divisible of 2. number 4 no
scale_storage_node_instance_type Specify the virtual server instance storage profile type name to be used to create the Spectrum Scale storage nodes for the Spectrum Storage cluster. For more information, see Instance profiles. string "cx2d-8x16" no
spectrum_scale_enabled Setting this to true will enables Spectrum Scale integration with the cluster. Otherwise, Spectrum Scale integration will be disabled (default). By entering 'true' for the property, you have also agreed to one of the two conditions: (1) You are using the software in production and confirm you have sufficient licenses to cover your use under the International Program License Agreement (IPLA). (2) You are evaluating the software and agree to abide by the International License Agreement for Evaluation of Programs (ILAE). Note: Failure to comply with licenses for production use of software is a violation of IBM International Program License Agreement. bool false no
ssh_allowed_ips Comma-separated list of IP addresses that can access the Spectrum LSF instance through SSH interface. The default value allows any IP address to access the cluster. string "0.0.0.0/0" no
ssh_key_name Comma-separated list of names of the SSH key configured in your IBM Cloud account that is used to establish a connection to the LSF management node. Ensure that the SSH key is present in the same resource group and region where the cluster is being provisioned. If you do not have an SSH key in your IBM Cloud account, create one by using the instructions given at SSH Keys. string n/a yes
storage_node_instance_type Specify the virtual server instance profile type to be used to create the storage nodes for the Spectrum LSF cluster. The storage nodes are the ones that are used to create an NFS instance to manage the data for HPC workloads. For choices on profile types, see Instance profiles. string "bx2-2x8" no
volume_capacity Size in GB for the block storage that will be used to build the NFS instance and will be available as a mount on the Spectrum LSF controller node. Enter a value in the range 10 - 16000. number 100 no
volume_iops Number to represent the IOPS (Input Output Per Second) configuration for block storage to be used for the NFS instance (valid only for ‘volume_profile=custom’, dependent on ‘volume_capacity’). Enter a value in the range 100 - 48000. For possible options of IOPS, see Custom IOPS Profile. number 300 no
volume_profile Name of the block storage volume type to be used for NFS instance. For possible options, see Block storage profiles. string "general-purpose" no
vpc_name Name of an existing VPC in which the cluster resources will be deployed. If no value is given, then a new VPC will be provisioned for the cluster. Learn more. string "" no
vpn_enabled Set to true to deploy a VPN gateway for VPC in the cluster. bool false no
vpn_peer_address The peer public IP address to which the VPN will be connected. string "" no
vpn_peer_cidrs Comma separated list of peer CIDRs (e.g., 192.168.0.0/24) to which the VPN will be connected. string "" no
vpn_preshared_key The pre-shared key for the VPN. string "" no
worker_node_instance_type Specify the virtual server instance profile type name to be used to create the worker nodes for the Spectrum LSF cluster. The worker nodes are the ones where the workload execution takes place and the choice should be made according to the characteristic of workloads. For choices on profile types, see Instance Profiles. Note: If dedicated_host_enabled == true, available instance prefix (e.g., bx2 and cx2) can be limited depending on your target region. Check ibmcloud target -r {region_name}; ibmcloud is dedicated-host-profiles. string "bx2-4x16" no
worker_node_max_count The maximum number of worker nodes that can be deployed in the Spectrum LSF cluster. In order to use the Resource Connector feature to dynamically create and delete worker nodes based on workload demand, the value selected for this parameter must be larger than worker_node_min_count. If you plan to deploy only static worker nodes in the LSF cluster, e.g., when using Spectrum Scale storage, the value for this parameter should be equal to worker_node_min_count. Enter a value in the range 1 - 500. number 10 no
worker_node_min_count The minimum number of worker nodes. This is the number of static worker nodes that will be provisioned at the time the cluster is created. If using NFS storage, enter a value in the range 0 - 500. If using Spectrum Scale storage, enter a value in the range 1 - 64. NOTE: Spectrum Scale requires a minimum of 3 compute nodes (combination of controller, controller-candidate, and worker nodes) to establish a quorum and maintain data consistency in the even of a node failure. Therefore, the minimum value of 1 may need to be larger if the value specified for management_node_count is less than 2. number 0 no
zone IBM Cloud zone name within the selected region where the Spectrum LSF cluster should be deployed. To get a full list of zones within a region, see Get zones by using the CLI. string n/a yes

Outputs

Name Description
region_name n/a
spectrum_scale_storage_ssh_command n/a
ssh_command n/a
vpc_name n/a
vpn_config_info n/a

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.