aws-samples / 1click-hpc Goto Github PK
View Code? Open in Web Editor NEWDeploy your HPC Cluster on AWS in 20min. with just 1-Click.
License: MIT No Attribution
Deploy your HPC Cluster on AWS in 20min. with just 1-Click.
License: MIT No Attribution
I bumped a while ago into LBInit issues, meaning when I delete a stack usually LBInit fails to delete. The workaround is to wait some more minutes then retry the stack delete and it works.
But today I started having problems with its creation. In the cloudwatch log I find this:
{
"Status": "FAILED",
"Reason": "See the details in CloudWatch Log Stream: 2022/06/22/[$LATEST]d883c58469a545d58f5fb61863e901ef",
"PhysicalResourceId": "2022/06/22/[$LATEST]d883c58469a545d58f5fb61863e901ef",
"StackId": "arn:aws:cloudformation:us-east-1:842865360552:stack/origtest/0cdfe300-f1fa-11ec-b068-121de38a7e19",
"RequestId": "10fc583d-c908-41c1-af07-751ba3a4b563",
"LogicalResourceId": "LBInit",
"NoEcho": false,
"Data": {
"ClientErrorCode": "NoSuchEntity",
"ClientErrorMessage": "The Server Certificate with name origtest-981587795.us-east-1.elb.amazonaws.com cannot be found."
}
}
I have another HPC cluster active, with a different name, it should not interfere with the creation of another cluster in the account. The above error still appears with everything set on AUTO
need to fix https://github.com/aws-samples/1click-hpc/blob/main/modules/07.configure.slurm.tagging.headnode.sh and use the existing prolog dir
As of today, the template URL https://enginframe.s3.amazonaws.com/AWS-HPC-Cluster.yaml returns an Access Denied error.
Yesterday, I was able to deploy the 1click-hpc solution without problems.
The lambda function created gave the error as below
26 Jan 2022 09:05:13,763 [INFO] (/var/runtime/bootstrap.py) main started at epoch 1643187913763
--
26 Jan 2022 09:05:13,961 [INFO] (/var/runtime/bootstrap.py) init complete at epoch 1643187913961
Traceback (most recent call last):
File "/var/task/index.py", line 48, in lambda_handler
responseData = {'Error': traceback.format_exc(e)}
File "/var/lang/lib/python3.6/traceback.py", line 167, in format_exc
return "".join(format_exception(*sys.exc_info(), limit=limit, chain=chain))
File "/var/lang/lib/python3.6/traceback.py", line 121, in format_exception
type(value), value, tb, limit=limit).format(chain=chain))
File "/var/lang/lib/python3.6/traceback.py", line 509, in __init__
capture_locals=capture_locals)
File "/var/lang/lib/python3.6/traceback.py", line 338, in extract
if limit >= 0:
TypeError: '>=' not supported between instances of 'ClientError' and 'int'
Updating the cluster config from the Cloud9 instance is removing the slurmdbd service.
I am not sure maybe there is a personalized update procedure via enginframe portal instead?
when we use a non-zero minimum in cluster config for resources, they get alive at cluster launch. then this job-related check will never have a value of True:
because this must be run in the root context, the only chance to do it is in the prolog script to attach it to a job, so basically the plan would be to
the problem is how to send a signal about the job to prolog and epilog since the custom user env variables are not sent, and the job comment is not sent. Because per slurm manuals, we should not perform scontrol
from prolog; this will impair the scaling of the jobs similarly to the API calls (this is related to #34 )
Looking at the variables available at prolog/epilog time I only have 2 ideas so far:
#SBATCH --nice 0
or some sensible value to uniquely identify the intention then use the TaskProlog and TaskEpilog to start/stop the monitoring container[GM] my job name
then pick and interpret this from SLURM_JOB_NAME Name of the job. Available in PrologSlurmctld, SrunProlog, TaskProlog, EpilogSlurmctld, SrunEpilog and TaskEpilog. Meaning also the use of TaskProlog and TaskEpilog to start/stop the monitoring containerNot sure if there's a setup step that I'm missing here but when I run the included Windows or Linux DCV job I get:
sbatch failed (parameters: -J Linux_Desktop -D /fsx/nice/enginframe/sessions/ec2-user/tmp4716553958834750820.session.ef -C dcv2, exit value: 1)
you can see this in bootstrap.log
+ /home/ec2-user/.local/bin/pcluster create-cluster --cluster-name hpc-1click-hpc365 --cluster-configuration config.us-east-1.yaml --rollback-on-failure false --wait
{
"message": "The security token included in the request is expired"
}
Seems like if the command goes beyond 10-15 minutes, the EC2/Cloud9 credentials are cycling and the pcluster CLI doesn't take this into account. The --wait
option seems to be deprecated in pcluster so we probably need to move to a polling approach that allows the credentials to refresh.
This causes the outer CloudFormation stack to fail initially, but it will succeed if it is Retried.
Hi, I am adding customization to implement https://docs.aws.amazon.com/parallelcluster/latest/ug/launch-instances-odcr-v3.html
I am creating a uniquely named policy and attach it to the HeadNode just fine
I am also creating a resource group to add all existing targeted capacity reservations. I should use some query for that or can I just attach arn containing wildcard on last section?
Second and harder problem, I should create the json to override the slurm compute nodes settings. I can retrieve current zone id and account id from the headnode itself but I should somehow transmit the cluster name or the group name so I do not have to hardcode it in the file. Currently that script looks like this: https://github.com/rvencu/1click-hpc/blob/main/modules/50.install.capacity.reservation.pool.sh
#!/bin/bash
set -e
ACCOUNT_ID=`aws sts get-caller-identity | jq -r '."Account"'`
EC2_AVAIL_ZONE=`curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone`
EC2_REGION="`echo \"$EC2_AVAIL_ZONE\" | sed 's/[a-z]$//'`"
# Override run_instance attributes
# Name of the group is still hardcoded, need a way to get variable from cloudformation here
cat > /opt/slurm/etc/pcluster/run_instances_overrides.json << EOF
{
"compute-od-gpu": {
"p4d-24xlarge": {
"CapacityReservationSpecification": {
"CapacityReservationTarget": {
"CapacityReservationResourceGroupArn": "arn:aws:resource-groups:$EC2_REGION:$ACCOUNT_ID:group/EC2CRGroup"
}
}
}
}
}
EOF
Hi, I can see a nvlink panel, is there any way to also monitor EFA metrics?
Enginframe is interesting as workspace but it seems it needs to define users locally while we are already using AD to manage the users.
Is there any integration made or can I get some hints about such a potential integration so I can develop it myself?
I have an error : « ERROR 502: Bad Gateway »
The CFT installation seems working fine. I have no error message.
On the Head Node I have the same message :
[ec2-user@ip-XXXX ~]$ wget https://XXXX.eu-west-1.elb.amazonaws.com/ --no-check-certificate
--2022-01-18 17:05:38-- https://XXXX.eu-west-1.elb.amazonaws.com/
Resolving XXXXX.eu-west-1.elb.amazonaws.com (XXXXX.eu-west-1.elb.amazonaws.com) [SNIP] connected.
WARNING: cannot verify XXXXX.eu-west-1.elb.amazonaws.com's certificate, issued by ‘/C=US/ST=WA/L=Seattle/O=AWS WWSO/OU=HPC/CN=EnginFrame’:
Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 502 Bad Gateway
2022-01-18 17:05:38 ERROR 502: Bad Gateway.
Ciao Nicola!
As you know in the context of a HPC POC in AWS for a french company, UCit (mainly myself) have sligtly modified and used 1Click-HPC to run the POC's HPC environment.
I have faced a strange authentication issue while trying to start a DCV session on a g4dn instance with Centos 7.9.2009 + DCV 2022.0 r12760 + EnginFrame (EF) 2021.0-r1592 + Slurm 21.08.8-2 on AWS. The issue is specific to the users stored in the AD attached to the cluster.
I have finally found a fix for the issue but I think it's important to discuss it with you to understand what's the underlying behaviour here...
The symptom is the following;
The error message got in slurm-$JobID.out is the following:
[2022/06/09 14:40:15] INFO Starting DCV session...
[2022/06/09 14:40:15] INFO DCV version supports --gl-display parameter
[2022/06/09 14:40:15] INFO Creating DCV session "dcv create-session --type=virtual tmp7339021573904918669 "
Could not create session. Could not get the system bus: Exhausted all available authentication mechanisms (tried: EXTERNAL, DBUS_COOKIE_SHA1, ANONYMOUS) (available: EXTERNAL, DBUS_COOKIE_SHA1, ANONYMOUS)
[2022/06/09 14:40:15] ERROR Failed to launch DCV session (exit code: 1)
[2022/06/09 14:40:15] FATAL Error: DCV failed to create session
[2022/06/09 14:40:15] FATAL Exiting with code 1
After a lot of tests, described below, I have found a solution which consists in adding at the very beginning of the file $EF_ROOT/plugins/interactive/lib/remote/linux.jobscript.functions the following line:
id "${USER}"
Indeed, this initializes kind of a "first connection" of the user trying to start a session on the targeted system, so that the User is known at system level...
With the help of Benjamin Depardon, I have tested the issue by issueing on the Head Node of the cluster in the command line the following command:
srun -N 1 -p dcv-gpu --exclusive -C "[g4dn.xlarge*1]" dcv create-session my_session
And then, we have tried all the following options:
restarting gdm only after dcvserver was restarted at the end of the installation process => NOT working
restarting dbus + dbus-org.+ gdm after dcvserver was restarted at the end of the installation process => NOT working
changing /etc/pam.d/dcv with the following contents
#%PAM-1.0
# Default NICE DCV PAM configuration.
# This file is auto-generated, user changes will be destroyed at
# installation/update time.
# To make changes, create a file named dcv.custom in this
# directory and set the 'pam-service-name' parameter in the
# [security] section of dcv.conf to 'dcv.custom'
#auth include password-auth
#account include password-auth
auth include password-auth
account required pam_access.so
account required pam_unix.so
account sufficient pam_localuser.so
account sufficient pam_usertype.so issystem
account [default=bad success=ok user_unknown=ignore] pam_sss.so
account required pam_permit.so
=> NOT working
running on the remote system the commands:
$> getent passwd | grep username
or
$> getent passwd -s sss | grep username
or
$> sssctl cache-upgrade
=> NOT working
adding the following command at the very beginning of Slurm's prolog.sh script:
$> id "${SLURM_JOB_USER}"
-> NOT working
running the following command on the DCV node before the session was created:
$> id username
or
$> sssctl user-checks username
=> SUCCESSFUL
connecting on the DCV node with SSH as the user username (or as any other user and the switching with the command: $> su - username
) before the session was created
=> SUCCESSFUL
Our conclusion is that the user must be known by the system (and stored in any kind of cache) for the authentication process to allow the execution of the tasks required by the Slurm job.
Our questions are:
Please don't hesitate to ask for any complementary information and to let us know what you think.
Best regards,
Vincent.
Hi, I noticed that enginframe creates a database on the same DB server as slurm accounting.
while slurm can use a single accounting DB per organization https://aws.amazon.com/blogs/compute/enabling-job-accounting-for-hpc-with-aws-parallelcluster-and-amazon-rds/
is the same true for enginframe? I get 404 error when I want to open the enginframe link on a second deployed cluster
Since the /fsx/nice location is not unique to the cluster, starting multiple clusters with the same fsx will overwrite the portal data for older clusters
Getting this after leaving everything on AUTO but Active Directory
Template format error: Unresolved resource dependencies [ActiveDirectory] in the Resources block of the template
1click-hpc doesn't work with the latest version of Parallel Cluster, the OnNodeConfigured scripts of the HeadNode are failing; this seems to be related to a change to /etc/parallelcluster/cfnconfig
introduced in the newer version.
Hi Nicola, Sean / Team,
Is there a way to integrate SSO in this stack.
@nicolaven I tried to integrate the octa but without success.
Could you please help me a bit with the integration.
Grafana seems to be installed correctly but all panels say No data. Including the compute nodes list, everything is empty.
Is there any further configuration step that I missed?
We got into scaling issue with the tagging in prolog script
I understand the prolog is ran at every step and when many nodes are involved the job fails with timeouts
we need to find another place to do the tagging and I understand that the comment is job related but some other tags can be done only once when the instances are created, either because of the min value in the configuration or created by slurm
I am looking at places where this could be done.
maybe it can be done at the headnode instead in the PrologSlurmctld https://slurm.schedmd.com/prolog_epilog.html
need bigger instances cause for large fleets this small instance db.t4g.micro is unable to respond in time when large jobs are launched
All,
Stumbled on this while struggling to get slurmrestd set up on pcluster. It looks like this provides a lot of friendly wrappers for HPC type problems. Is this code production-ready? Does it support GPU instances? Is there an API provided?
Hi Nicola, Sean / Team,
Please add a note ec2-user password should not contain special character " @ " at cloud formation page.. Because i see Cloudformation rolling back at the stage of SlurmDB creation with the below status reason if i use " @ " in password.
"The password for the master user. The password can include any printable ASCII character except "/", """, or "@"."
We noticed in a multiuser HPC cluster with FSX attached, all users are able to browse and read files from all other users, even though writing is only possible on own folders only
Is there a way to restrict access similar to /home folder functionality?
I noticed the head node was created in the private subnet. Checked the parameters in the cloudformation where I supplied my custom VPC details:
PrivateSubnetAId | subnet-0112272390ac53c95 | - |
---|---|---|
PublicSubnetAId | subnet-79775a34 | - |
PublicSubnetBId | subnet-2fd94770 | - |
VpcId | vpc-4678d63b |
And the head node was created in the private subnet.
As a consequence, I cannot SSH into the EIP of the head node.
I can see the code makes this deliberately
1click-hpc/scripts/Cloud9-Bootstrap.sh
Line 74 in 93d1930
If it is supposed to be there, then please explain how to SSH into the head node using the elastic IP.
Missing that bit, mapping job IDs to instance IDs...
And the extra docs referring to the customizations seem behind to older version of ParallelCluster.
Maybe we could host our custom templates on your server too to avoid an extra installation of a custom enginframe?
I think the title is self-explaining.
I had to increase the Idle Timeout setting in the ALB to make it work.
You may want to adjust it in the CF template.
Regards.
PL
We have many p4d.24xlarge pods and we need them into the config.
But more than this, we need to be able to pull them from the capacity reservations we got. Without that there are not a lot of p4d instances ondemand and usually the cluster fails to build.
Hi, is there a method to customize the splash page of the enginframe portal?
When I try and create in us-west-1
I get an error:
2021-04-01 14:51:49 UTC-0700 | PublicSubnetB | CREATE_FAILED | Template error: Fn::Select cannot select nonexistent value at index 2
-- | -- | -- | --
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.