simonsobs-uk / data-centre Goto Github PK

View Code? Open in Web Editor NEW

2.0 4.0 1.0 266 KB

This tracks the issues in the baseline design of the SO:UK Data Centre at Blackett

Home Page: https://souk-data-centre.readthedocs.io

License: BSD 3-Clause "New" or "Revised" License

Makefile 15.62% Python 73.58% Shell 10.81%

data-centre's Introduction

SO:UK Data Centre

This documents the baseline design of the SO:UK Data Centre at Blackett.

The GitHub repository simonsobs-uk/data-centre contains

source of the documentation, including the codes in the documentation that you can run directly,
the Issues tracker as well as Discussions, for any data centre related bugs or questions, and
a Python package for data centre maintenance.

To access our documentation, you have a few options, (in the order from convenience to advanced usages):

SO:UK Data Centre documentation at Read the Docs. This gives you access to all versions of SO:UK Data Centre documentations, as well as multiple output formats including HTML, ePub, PDF.
SO:UK Data Centre documentation at GitHub Pages which is HTML only and points to the latest commits only.
SO:UK Data Centre documentation at GitHub Releases. This gives you additional output formats such as man page and plain text.

Note that Read the Docs serves the searches from server-side powered by Elasticsearch. So searches from Read the Docs and GitHub Pages will gives you different results. Try the Read the Docs first for better results and fall back to GitHub Pages.

Lastly, those single file documentation formats are very suitable for feeding into Large Language Models (LLMs). For example, try downloading our plain text format and upload it to ChatGPT or Claude and start chatting. You can ask them to explain things in the documentations in details that we cannot covers here.

data-centre's People

Contributors

Stargazers

Watchers

Forkers

shaikhhasib

data-centre's Issues

occasionally output & error files comes back empty

Copied from email thread:

occasionally I do not get the stdout and/or stderr back from the job. i.e. the file listed in the job configuration file with output and error are empty files. (This only happens randomly even when the same job is submitted multiple times.) Is it a known problem and how to fix this?

mounting XRootDFS on other systems including NERSC

We will provide wrapper scripts to allow users to mount XRootDFS, similar to what is done on vm77.
Our wrapper script will only let users mount it for their own access.
Our documentation should explicitly prohibit users from mounting using user certificate for other people, such as a group machine. It will mentions that it is a security requirement, but is also a common sense (hopefully it will become common knowledge after documented it!)

C.f. a related discussion: #22

Continuous Deployment (CD) to CVMFS

Currently, accessing the CVMFS publishing node requires one to open a link in the browser to authenticate first.

How to achieve CD then? Is there any option to authenticate without human interaction?

ControlMaster in ssh config can be useful but it requires running it on a instance that are persistent (such as our workstation), but not GitHub Actions.

`$HOME/.ssh/authorized_keys` is centrally managed

Currently, $HOME/.ssh/authorized_keys will be overwritten periodically by central management system. But this is counter-intuitive from the expectation of the Filesystem Hierarchy Standard. A better approach would be putting it under /etc (Host-specific system configuration).

For example, use AuthorizedKeysFile in /etc/ssh/sshd_config:

# /etc/ssh/sshd_config
AuthorizedKeysFile /etc/ssh/%u/authorized_keys .ssh/authorized_keys

and put the centrally managed file in /etc/ssh/$USER/authorized_keys instead.

How to modify TOAST in the provided software stack

Migrated from Slack.

People involved: @earosenberg

Erik Rosenberg
3:53 PM
I'm also having a second issue, my sims need a newer version of TOAST than included in your tarball. I have been using 3.0.0a18 (commit 6a03ec95) at NERSC but newer (ie the most recent toast3) would probably be fine.
I did try: download your tarball to my laptop; unzip to /tmp; complile TOAST with -DMAKE_INSTALL_PREFIX=/tmp/{pmpm-....} and upload back to Blackett, but it complains "GLIBC_2.29 not found" when I try to run there.
More details on how to correctly make the tarball so it's portable (or just an updated tarball for now) would be much great when you get a chance

Intermittent Network Overloads and Connection Drops When Accessing Blackett from UoM

According to conversations between @rwf14f, @shaikhhasib, this is a known problem, and a temporary fix is to connect to it from outside UoM.

Edit: symptom and background: UoM has its own firewall and Blackett while geographically located within UoM, has their own firewall outside UoM. Connections from within UoM to Blackett are mostly fine, but sometimes it will just disconnect and succeeding reconnection refused.

Clarify how wall-clock time is constrained?

First part of #14 is partially addressed in commit 7b88aaf. But I still don't understand how it is counted, copied from #14 below:

About wall clock time, when is the 2nd constraint not the same as the first constraint? How is it going to track CPU hours?

The wallclock time needs to be documented, but I don't think we have all the information to add such page yet. I envisioned it to be something like https://docs.nersc.gov/jobs/policy/ where wall-clock time should be part of the constraints on the computer systems that the user should be aware of. C.f. #6.

From your wordings, it seems it is somehow measuring how much CPU time it is using? Shouldn't wall-clock time be exactly 72 hours in the real world regardless how the job is configured (or using resources)? I'm guessing because "Cgroup-Based Process Tracking" is not done via HTCondor?

Optimizing Resource Access: Implementing 'sss' Security for XrootdFS on User Nodes

We aim to enhance the user experience by mounting the grid storage system (located at root://bohr3226.tier2.hep.manchester.ac.uk:1094/dpm/tier2.hep.manchester.ac.uk/home/souk.ac.uk/) on nodes accessible to our users.

According to the xrootdfs man page, employing an 'sss' security node could be a viable solution:

SECURITY
By default, XrootdFS does not send individual user identity to the Xrootd storage servers. So Xrootd storage thinks that all operations from an XrootdFS instance come
from the user that runs the XrootdFS instance. When the Xrootd "sss" security module (Simple Shared Security) is enabled at both XrootdFS and Xrootd storage system,
XrootdFS will send individual user identity information to the Xrootd storage servers. This info can be used along with the Xrootd ACL to control file/directory access.

To use "sss" security module, both Xrootd data servers and XrootdFS should be configured to use "sss" in a particular way, e.g. both sides should use a key file that con‐
tains the same key generated by the xrdsssadmin program in the following way:

xrdsssadmin -k my_key_name -u anybody -g usrgroup add keyfile

(change only "my_key_name" and "keyfile"). Please refer to environment variable "XrdSecsssKT" in Xrootd "Authentication & Access Control Configuration Reference" for more
information on the location of the keyfile and its unix permission bits. That same document also describes the Xrootd ACL DB file.

To enable "sss" with XrootdFS, use the sss=/keyfile option with XrootdFS.

The following example shows how to use both unix and sss security modules with the Xrootd data servers.

xrootd.seclib /usr/lib64/libXrdSec.so
sec.protocol /usr/lib64 sss -s /keyfile
sec.protocol /usr/lib64 unix
acc.authdb /your_xrootd_ACL_auth_db_file
acc.authrefresh 300
ofs.authorize

@rwf14f, @afortiorama, could you please confirm if a similar setup is in operation at Blackett, and if so, what steps are necessary for us to replicate this configuration? Thank you.

HTCondor ssh setup may fail depends on users' ssh config

C.f. #13 (comment) . The current solution requires coordination from users. Ideally, this should be decoupled with user config.

Check if HTCondor, possibly in a later version, expose a way to configure system-wide how it launches ssh jobs.
If not, consider file an issue / pull request to HTCondor on this matter.

This is of low priority, we may not fix it depending on the scope.

Feedback on user documentation

Just a few notes on the user documentation:

Wallclock Time:

There's a maximum wallclock time configured. It's currently 72 hours, jobs get killed if they exceed this.
The same is true for CPU time for each requested CPU (ie job gets killed if total used CPU hours > requested CPUs * 72 hours)
Need to check how the machine count fits into this, we'll most likely have to update the configuration to take this into account (ie machine count * requested CPUs * 72 hours).

Container support:

The docker universe won't work because there's no docker on the worker nodes.
The worker nodes have apptainer installed though. Jobs can use it to run their workloads inside a container.
Running containers in a vanilla universe job works fine.
No experience with doing the same in parallel universe jobs. There might be communication problems between the mpi processes if mpi is running inside a container.

Making xrootdfs available on all worker nodes?

@rwf14f, what are your thoughts on making xrootdfs available on worker nodes?

Currently, commands like xrdcp and xrdfs are available on worker nodes, but not xrootdfs.

I understand that xrootdfs may not be performant enough to be suitable for heavy data IO, but it can be useful for persistent interactive session for example. I.e. people can start an interactive job, and have the xroodtfs mounted with some of their dotfiles for example. Say when they have some notebooks running, they can quickly save the notebook on disk and have them be persistent in the next interactive session.

Mounting grid from vm77 not working

Trying to mount the grid storage from vm77 as per these instructions https://souk-data-centre.readthedocs.io/en/latest/user/pipeline/4-IO/1-grid-storage-system/4-xrootdfs/.
This worked a few weeks ago. Now when I run /opt/simonsobservatory/xrootdfs.sh start it creates the directory souk.ac.uk in my $HOME and the script doesn't complain.
Then running ls in $HOME gives the error:
ls: cannot access souk.ac.uk: No such file or directory.
After /opt/simonsobservatory/xrootdfs.sh stop the error message disappears and souk.ac.uk becomes a normal empty directory.

gfal-ls root://bohr3226.tier2.hep.manchester.ac.uk:1094//dpm/tier2.hep.manchester.ac.uk/home/souk.ac.uk/ works fine on the other hand.

Deal with heterogeneous nodes

A meta-issue collecting issues with heterogeneous nodes first.

software deployment: AMD vs Intel, AVX-512 availability, no-mkl. etc.
- possibly look for ways to continue to use MKL while disguising AMD CPU as Intel's.
- dispatch software environment per node by detecting architecture on-the-fly. (Possibly make debugging harder.)
  - dispatch over x86-64-v3, x86-64-v4?
load-balancing per job. C.f. #6

Excessive Data Transfer Time from UoM Laptop to Blackett: 1GB Python Tarball Taking ~45 Mins

I've had slow data transfer rates (~45 mins to transfer 1GB python tarball from my laptop by sftp) but not sure if this is the same problem

Originally posted by @earosenberg in #8 (comment)

Cannot constraint on Arch and/or Microarch

Hi, @rwf14f, I just try constraining on Arch and/or Microarch but is not successful:

Requirements = (Arch == "INTEL") && (Microarch == "x86_64-v4")

The reason is that apparently the attributes are not available:

❯ sudo condor_status -autoformat Arch Microarch
X86_64 undefined
X86_64 undefined
X86_64 undefined
X86_64 undefined
X86_64 undefined
X86_64 undefined
X86_64 undefined
X86_64 undefined
X86_64 undefined
X86_64 undefined
X86_64 undefined
X86_64 undefined
X86_64 undefined
X86_64 undefined
X86_64 undefined
X86_64 undefined
X86_64 undefined
X86_64 undefined
X86_64 undefined
X86_64 undefined
X86_64 undefined
X86_64 undefined
X86_64 undefined
X86_64 undefined
X86_64 undefined
X86_64 undefined
X86_64 undefined
X86_64 undefined
X86_64 undefined
X86_64 undefined

Is it possible to make it available to the system?

P.S. We know the implication of constraining jobs which reduces the pool of nodes it can be submitted to and therefore increase the queue time for example.

Warnings when running OpenMPI

Copied from email thread:

in running MPI jobs with Open MPI 3, I have the following warnings. Is it normal or was there any configuration problem?

From stderr:

--------------------------------------------------------------------------
[[29587,1],2]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
 Host: wn1906370

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
[wn1906370.in.tier2.[hep.manchester.ac.uk](http://hep.manchester.ac.uk/):149808] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[wn1906370.in.tier2.[hep.manchester.ac.uk](http://hep.manchester.ac.uk/):149808] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

From stdout:

[1689097992.669727] [wn5914090:142928:0] sys.c:618 UCX ERROR shmget(size=2097152 flags=0xfb0) for mm_recv_desc failed: Operation not permitted, please check shared memory limits by 'ipcs -l'
[1689097992.671404] [wn5916090:65908:0] sys.c:618 UCX ERROR shmget(size=2097152 flags=0xfb0) for mm_recv_desc failed: Operation not permitted, please check shared memory limits by 'ipcs -l'
[1689097992.696781] [wn1906370:149888:0] sys.c:618 UCX ERROR shmget(size=2097152 flags=0xfb0) for mm_recv_desc failed: Operation not permitted, please check shared memory limits by 'ipcs -l'
[1689097992.697059] [wn1906370:149889:0] sys.c:618 UCX ERROR shmget(size=2097152 flags=0xfb0) for mm_recv_desc failed: Operation not permitted, please check shared memory limits by 'ipcs -l'

Jobs can over-subscribed into each other when sharing the same worker node

@rwf14f, I am encountering some strange issues which looks like an interactive node and worker node is oversubscribing the same node.

This concerns the following 2 jobs:

OWNER    BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
khcheung ID: 309      7/19 13:55      _      1      _      1 309.0
khcheung ID: 331      7/22 00:49      _      1      _      1 331.0

For job 309, it is an interactive job landed on <wn1906370.in.tier2.hep.manchester.ac.uk>. This job was started by the following config:

RequestMemory=16384
RequestCpus=8
queue

This node has 8 physical CPU (2 socket of Intel(R) Xeon(R) Gold 5122 CPU each has 4 cores). So this reservation should have made the node exclusive to the job.

Strangely, job 331 has landed on the same node <wn1906370.in.tier2.hep.manchester.ac.uk> apparently from this line of log:

001 (331.000.000) 2023-07-22 00:50:11 Job executing on host: <195.194.108.112:9618?addrs=195.194.108.112-9618+[2001-630-22-d0ff-b226-28ff-fe53-755c]-9618&alias=wn1906370.in.tier2.hep.manchester.ac.uk&noUDP&sock=startd_2449_6386>

This was submitted to the vanilla universe with this bit of relevant config:

request_cpus            = 8
request_memory          = 32G
request_disk            = 32G

So apparently these 2 jobs are oversubscribed to the same node.

(De)duplicating data on grid storage systems

As users are starting to share files with each other, we are now seeing people copying files within our VO (at root://bohr3226.tier2.hep.manchester.ac.uk:1094//dpm/tier2.hep.manchester.ac.uk/home/souk.ac.uk/).

What are the best practice here? We want to be able to not costing n copy of files as n collaborators are making copies.

For example, when gfal-copy, behind the scene, would it costs 2 times as much storage? Or was it doing COW (Copy On Write) behind the scene? If not COW, is there any way we can make some sort of hard-link so that the copy is cheap, such as the cp --reflink behavior?

Would the answers to these be different once we migrate from DPM to Ceph?

Thanks.

how to "chgrp" in grid storage system?

In traditional POSIX compliant FS, we can just chgrp and the group permission bits to share to other collaborators in the same group. In a grid storage system, how to achieve that?

In particular, under the SO:UK VO, @earosenberg is trying to share a sub-directory erosenberg/toast_sims to others, but to me it is empty.

HOME does not seem to be set up correctly

C.f. #4

For example, on wn5916340.in.tier2.hep.manchester.ac.uk, HOME points to /scratch/condor_pool/condor/dir_16550 for a particular job. But there is a /home/$USER directory. According to #4 (comment), HOME should points to /home/$USER instead.

I suspect in the shell initialization process, probably coupled with how HTCondor is set up too, HOME has been overridden?

Establishing a Dedicated Login Node for Interactive Computing and JupyterHub Integration

@DanielBThomas, @rwf14f, @afortiorama,

Introduction and Summary

Following our recent internal meeting, we are planning to deploy a JupyterHub instance, inspired by the model implemented at the National Energy Research Scientific Computing Center (NERSC). For more information, refer to NERSC's Jupyter Documentation and JupyterHub@NERSC.

Project Objective

Our primary goal is to establish a robust Jupyter environment, mirroring NERSC's implementation. Jupyter, an advanced computing platform, supports digital notebooks that combine executable code, equations, data visualization, interactive interfaces, and text. This deployment is particularly vital for enabling exploratory data analysis and visualization, especially with data stored at NERSC. This initiative gains added importance as the Simons Observatory collaborators plan to release Jupyter Notebooks for training, tutorials, and exploratory data analysis (EDA). It's essential that our infrastructure effectively supports these activities.

System Requirements

Hardware Configuration: We require a single physical node, robustly equipped with substantial CPUs, RAM, and disk space.
Home Directory Configuration:
- We aim for the login node to also serve as the JupyterHub node, allowing users to maintain their HOME environment consistently, similar to their setup on vm77, whether accessing via SSH or JupyterHub.
- This setup does not imply a merger of vm77 into this system; vm77 will continue functioning as a job submission node, providing redundancy.
- A shared HOME environment between vm77 and the new node is not part of our current plan.
Intended Use of the Login Node:
- The node is designated mainly for lightweight computation, interactive use, and visualization.
- It will support running Jupyter Notebooks via JupyterHub and command-line tools, possibly with X11Forwarding.
- Additionally, it's suitable for compilation tasks and running short, small-scale tests, potentially for continuous integration (CI) purposes.
Domain Assignment: A future consideration includes assigning a domain, such as jupyter.souk.ac.uk.

Login Process

Our preferred method involves opening a specific port in the firewall, allowing UNIX username and password authentication. This approach might present security challenges, which we aim to address with Multi-Factor Authentication (MFA) or OAuth. We welcome suggestions regarding security requirements.
As an alternative, SSH forwarding is an option, though it might be less convenient for users.

Maintenance and Security

Our Data Centre software engineers will maintain the JupyterHub instance, in line with the JupyterHub Security Guidelines.
We are prepared to initially deploy using SSH tunneling, considering the opening of a firewall port as a subsequent step.
We will commence with a single node, distinct from worker nodes, meaning that job launches from Notebooks will not be possible. We plan to reevaluate this arrangement as future needs and feedback dictate.

transition between pre-production CephFS to production CephFS

From a meeting between SO:UK DC and Blackett people today:

Blackett is currently using DPM and will transition into Ceph in 2024.
Currently, XRootD is serving from DPM, which will transition into serving from Ceph instead.
Ceph exposes the filesystem via CephFS (currently xrootdfs can be used to mount XRootD as FS via FUSE, but not on worker nodes), available on all nodes including worker nodes.
pre-production CephFS will be put on our machines (the SO:UK DC purchased machines), but may be subject to purging when transitioning into production
- This will not be a problem for us (stopping us using pre-production CephFS ASAP), and we plan to copy data for our users. Both can be mounted, and then say we make the pre-production CephFS read-only, and replicates it to the production CephFS. We should provide env. var. to hide the differences in absolute path (see #23)

Edit: relevant context: I find this when surfing the net, probably relevant: Ceph Deployment and Monitoring at Lancaster, GridP47, 23 March 2022, Gerard Hand, Steven Simpson, Matt Doidge. I like this bit:

There’s only so many times you can declare a million files lost and not rethink all your life
choices.

support software deployment via conda environment file

consider merging https://github.com/simonsobs-uk/so-software-environment to this repo

test how long it takes to create such environment on worker nodes

MPI_ERR_NO_SUCH_FILE because different physical nodes do not have a shared file system

Transferred from a Slack conversation.

People involved: @DanielBThomas, @chervias.

Dan Thomas
8:03 PM
Hi both
Kolen as I mentioned, Carlos has encountered some issues on Blackett that we are hoping you can help with.
This is his latest message:
Do you know in Blackett if I launch with 2 MPI jobs (machine_count = 2) the /tmp dir is the same one seen in both jobs?
I cannot successfully run the toast tests python -c 'import toast.tests; toast.tests.run()' with MPI as described here https://souk-data-centre.readthedocs.io/en/latest/user/pipeline/3-MPI-applications/1-OpenMPI/
I get a IO error, I suspect a file cannot be accessed since at least one of the MPI jobs cannot see a directory. But the toast tests work fine in a single job.

Carlos Hervías
8:06 PM
this is the output from running the mpi.ini example as exactly described in the instructions. You can see the code failing at the end
This file was deleted.
8:07
Plain Text

mpi-0.out.txt
Plain Text

8:09
I get this MPI_ERR_NO_SUCH_FILE: no such file or directory which I suspect because its trying to write a file to a directory that is not there (edited)
8:13
this runs successfully in a single job, only fails for me when I try parallel

Kolen Cheung
6:28 PM
Hi, Carlos. Sorry for the late reply, comments below:

different HTCondor processes sees a different /tmp directory, even when they lands on the same physical nodes (/tmp is actually a symlink to somewhere within scratch there, and scratch is unique to HTCondor processes.) I will add this to our documentation.
regarding your MPI_ERR_NO_SUCH_FILE situation, I am able to reproduce your error and will investigate into it. (This error is stateful as I haven’t seen it last I set it up.) For the interim, you can run MPI applications with 1 node. It can be done either by changing machine_count to 1 in https://simonsobs-uk.github.io/data-centre/user/pipeline/1-classad/3-classad-parallel/, or by adapting the Vanilla Universe job to call mpirun directly without calling my wrapper script. I’ll document this later method in the documentation.

Carlos Hervías
2:16 AM
Thank you Kolen, so if I set machine_count=1 would that be a single MPI job?

Carlos Hervías
2:22 AM
So the only way I can take full advantage of many processors is by running a single MPI job with 64 threads or something?

Kolen Cheung
11:26 AM
Machine count is the number of nodes. I think the largest node we have right now has 20 threads, which equals to 10 cores. Once you request machine count equals to 1 and number of cpu equals to 20, in that job you can use mpirun/mpiexec -n 10 …

11:28
Correction: 10 physical cores per CPU, but it has 2 sockets. So you can request 40 cpu and set -n 20 in MPI.

Dan Thomas
11:33 AM
we have 8 64 thread machines; I think 4 can run vanilla universe and they can all run parallel universe.

Kolen Cheung
11:41 AM
If you set machine_count=1 and request parallel universe, it should works. Remember the request_cpu corresponds to number of logical cores, so requesting 64 of them corresponds to -n 32.

Carlos Hervías
1:17 PM
ok thanks, I did some tests in the vanilla universe where I requested 32 cpus and then ran toast with mpiexec -n 8 … for example (I’m running each job with 4 threads). What is the maximum cpus I can ask for right now? I could not ask more than 32 in my tests, the job would go into the idle queue. So if I wanted to run 32 mpi jobs with 2 threads each for example, I could do it requesting 64 cpus?

Kolen Cheung
1:29 PM
You may need to change it to the parallel universe in order to use the 64 threads machine. I think you mean an MPI job with 32 processes? When you say 2 threads each, do you mean OMP_NUM_THREADS? Bear in mind if you requests 64 cpus, that corresponds to 32 physical core, so the recommended setting would be 32 MPI processes with OMP_NUM_THREADS=1 (which is set automatically if you use my wrapper script for parallel universe.)

1:31
Ie N_MPI_PROCESSES times OMP_NUM_THREADS should equals to total no of physical cores. Otherwise there will be oversubscription and you will find it slower (in most cases except those that are IO bound.)

Carlos Hervías
1:31 PM
ah ok I get it, thanks! I thought the hyperthreading was on, but it is best not to use it

Kolen Cheung
1:40 PM
Yes, hyperthreading is enabled, but only in specialized cases that would be beneficial, so the recommendation is not to setup your OMP threading like that unless it is proven to be useful.

Kolen Cheung
10:01 PM
Hi,
@Carlos Hervías
, just to go back to your previous issue about the MPI error. You can safely ignore that. The TOAST 3 test assumes that different MPI processes can see the filesystem at the same path, which is not true in Blackett. I.e. the master processes created a directory at a certain path, and one of the other processes are writing to that directory. It also means that any script making that assumption would break, but that should be ok most of the time, as MPI processes shouldn’t rely on the filesystem to communicate anyway (hence a script trying to have one process write to a file and have another process to read it is slim). For scripts that assumes this, it should be a quick and easy fix.

CVMFS publishing server not working

/cvmfs/northgrid.gridpp.ac.uk/simonsobservatory/hello_world.sh is created on cvmfs-uploader02.gridpp.rl.ac.uk at Nov 17 15:58. As of writing, it still cannot be seen at vm77.

We need to document the expected time scale that the users can see deployed softwares.

C.f. #20

Elevated privilege on grid storage system?

What are our options to manage the grid storage with "elevated privilege" such as deleting any files within the pool? This would be similar to a group user account such that they can control everything inside that pool. Some of the things that we'd want to be able to do are:

deleting files,
moving files around,
fixing file permissions.

Thanks.

cbatch_openmpi does not handle MPI abortion correctly

The provided wrapper script to launch multi-node OpenMPI job in /opt/simonsobservatory/cbatch_openmpi does not handle MPI abortion correctly.

Symptom: Job hangs after MPI abortion, leaving node idle in the queue

TODO:

make a MWE to reproduce this bug first

Failed to transfer files

Copied from email thread:

I’m following https://htcondor.readthedocs.io/en/latest/users-manual/parallel-applications.html#simplest-example to test submitting jobs to the parallel universe.

Upon submitting that, the error I got is

007 (041.000.000) 2023-07-06 13:00:04 Shadow exception!
 Error from [[email protected]](mailto:[email protected]): Failed to transfer files

And then the job would seems to be stuck in the queue and idle forever.

If I submitted a slightly modified example of

#############################################
## submit description file for a parallel universe job
#############################################
universe = parallel
executable = /bin/sleep
arguments = 1
machine_count = 2
log = log
should_transfer_files = NO
request_cpus = 1
request_memory = 1024M
request_disk = 10240K

queue

Then the job would not fail, but seems to be stuck in the queue and idle forever too.

How to solve this?

HTCondor RequestMemory constraint does not reflect actual DetectedMemory

I submit the following ClassAd file

universe = parallel
executable = single.sh

log = single.log
output = single.out
error = single.err
stream_error = True
stream_output = True

use_x509userproxy = True

should_transfer_files = No

machine_count = 1
request_cpus = 4
request_memory = 48G
request_disk = 8G
queue

and the job stays idle forever.

Document how to load the CVMFS environment

Everything is ready except to

document it,
- what would be a good user experience here? Wrapper scripts?
- think about constraining to x86-64-v4
think about long term support policy, we shouldn't be keeping all of them forever...

(base) [northgridsgm@cvmfs-uploader02 simonsobservatory]$ pwd
/cvmfs/northgrid.gridpp.ac.uk/simonsobservatory
(base) [northgridsgm@cvmfs-uploader02 simonsobservatory]$ ls -1 pmpm conda
conda:
so-conda-20231121
so-conda-20231122
so-conda-20231123
so-conda-py310-20231211
so-conda-py310-20231212
so-conda-py38-20231211
so-conda-py39-20231211

pmpm:
so-pmpm-py310-mkl-x86-64-v3-mpich-20231212
so-pmpm-py310-mkl-x86-64-v3-openmpi-20231212
so-pmpm-py310-mkl-x86-64-v4-mpich-20231212
so-pmpm-py310-mkl-x86-64-v4-openmpi-20231212
so-pmpm-py310-nomkl-x86-64-v3-mpich-20231212
so-pmpm-py310-nomkl-x86-64-v3-openmpi-20231212
so-pmpm-py310-nomkl-x86-64-v4-mpich-20231212
so-pmpm-py310-nomkl-x86-64-v4-openmpi-20231212

Setting up CVMFS on souk.ac.uk

C.f. #20. It requires additional approval from GridPP for our new VO souk.ac.uk. The original issue is named to track setting it up on North Grid.

passwordless and sudo power combined seems dangerous

@rwf14f, currently we use ssh key to authenticate and get into the login node. And we're granted sudo (wheel group) in #9. Consider the following scenario:

downloaded a script, containing

...
sudo do_something_harmful
...

and we run this script.

If there's password, then the password authentication needed will alarm the user something fishy may be going on. So the existence of password would provide an extra layer of protection.

VOMS proxy expiration time always set to 12 hours regardless vomslife

Hi, @rwf14f,

MWE, running on vm77:

❯ voms-proxy-init --voms souk.ac.uk --vomslife 10000:0; date 
Enter GRID pass phrase for this identity:
Contacting voms02.gridpp.ac.uk:15519 [/C=UK/O=eScience/OU=Oxford/L=OeSC/CN=voms02.gridpp.ac.uk] "souk.ac.uk"...
Remote VOMS server contacted succesfully.

voms02.gridpp.ac.uk:15519: The validity of this VOMS AC in your proxy is shortened to 604800 seconds!

Created proxy in /tmp/x509up_u511.

Your proxy is valid until Tue Nov 07 11:00:29 GMT 2023
Mon Nov  6 23:00:30 GMT 2023

Explanation:

A very long vomslife is requested for it to shorten to max config value, which is revealed as 604800 seconds (1 week) here.
The date command is run immediately after that. From that we can deduce the time is actually only 43200 seconds (12 hours).

When running instead

❯ voms-proxy-init --voms souk.ac.uk --vomslife 1:0; date 
Enter GRID pass phrase for this identity:
Contacting voms02.gridpp.ac.uk:15519 [/C=UK/O=eScience/OU=Oxford/L=OeSC/CN=voms02.gridpp.ac.uk] "souk.ac.uk"...
Remote VOMS server contacted succesfully.


Created proxy in /tmp/x509up_u511.

Your proxy is valid until Tue Nov 07 11:07:16 GMT 2023
Mon Nov  6 23:07:17 GMT 2023

i.e. Requesting for 1 hour also resulted in 12 hours.

some services require elevated permissions which is not permitted on our testbed vm77

E.g. @shaikhhasib needs to run PostgreSQL but we don't have sudo power on vm77. @shaikhhasib already asked @rwf14f if a VM with sudo power can be granted for us.

Illegal instruction

Migrated from Slack

People involved: @earosenberg

Erik Rosenberg
3:57 PM
Hi Kolen, new question: I am trying to run toast_ground_schedule on Blackett. I have a script to do this that seems to run fine on the interactive node but fails when I submit it as a job to the worker node. The error is on the toast_ground_schedule line with the error message "Illegal instruction". Is this something you've seen before? This is with your anaconda tarball btw

Kolen Cheung
8:34 PM
Oh, wow. I need a reproducible working example for this. I bet if you rerun it you might not be able to get the same error right away. My bet is that it lands on a node with very old processors that lacks some newer instructions? Probably related to AVX512.

8:35
If you still have the log, could you please send me the log? Specifically I want to have the address of the node. I want to see what that node is capable of.

New

Erik Rosenberg
8:47 PM
I did submit it a couple times with the same result. Here is the address line (I think) Transferring to host: <195.194.109.209:9618?addrs=195.194.109.209-9618+[2001-630-22-d0ff-5054-ff-fee9-c3d]-9618&alias=vm75.in.tier2.hep.manchester.ac.uk&noUDP&sock=slot1_4_1883_7e66_92777>

8:48
(the full log file and a bit more info here if helpful)
Gzip

preprocess_schedule_info.tar.gz

Setup global rc files for users

which includes env var, e.g. SCRATCH pointing to pre-production CephFS and later changed to production CephFS path.
document this
decide supported shells (bash, zsh?)

Non-interactive nodes has no HOME

Currently, interactive node has a HOME, pointing to the scratch directory. But non-interactive node, such as those sent to vanilla universe or parallel universe, has no HOME defined. Defining one manually seems to be overridden inside subprocesses (i.e. may be some system level rc scripts un-define it?).

This breaks some scripts that assumes the presence of HOME, e.g. mamba.

This then makes it difficult to submit jobs that worksas CI (Continuous Integration) or some other routine compilation work (recall that the login node cannot be used to compile things, as it does not have access to modules for example and has no gcc compiler.)

Having a equivalent of export HOME=$_CONDOR_SCRATCH_DIR is good enough for our purpose. It should not cause (any more) confusions as the interactive node is already like this.

Setting up CVMFS on northgrid.gridpp.ac.uk

@rwf14f, what are the required steps to setup CVMFS with souk.ac.uk? If there are blocking items, can we write to northgrid for now? What are the necessary steps to have write access? Thanks.

SSH Config Issue: Excessive Key Accumulation in SSH Agent with HTCondor Jobs Leading to Authentication Failures

Hi, @rwf14f,

I’m experiencing this issue for a few days: "Too many authentication failures” occurs when I requested an interactive node.

❯ cat example.ini
RequestMemory=32999
RequestCpus=16
queue
❯ condor_submit -i example.ini
Submitting job(s).
1 job(s) submitted to cluster 463.
Waiting for job to start...
Received disconnect from UNKNOWN port 65535:2: Too many authentication failures
Disconnected from UNKNOWN port 65535

I’m not sure if this is related to testing sshd.sh in parallel universe recently in #12.

TOAST test failing with env. so-pmpm-py310-mkl-x86-64-v3-mpich-latest

See attached files: job seems to run but many tests seem to fail, including a bunch of assertion errors.
The loaded CVMFS environment is MPICH not OpenMPI.

mpi_singlenode.sh.txt
mpi_singlenode.ini.txt
mpi_singlenode.out.txt
mpi_singlenode.err.txt
mpi_singlenode.log.txt

Edit by @ickc: relevant page: https://docs.souk.ac.uk/en/latest/user/pipeline/3-MPI-applications/0-Vanilla-MPI/

Basic documentation on architecture of Blackett

Running multi-nodes applications at Blackett would benefits from some knowledge on the architecture of Blackett. E.g.

(compute capability) load-balancing w.r.t. heterogeneous nodes
(network capability) how the nodes are connected and if there are bottle necks, say, communications between multiple nodes will be choking between 2 network switches

I think something like this would be a example to follow: Architecture - NERSC Documentation (but obviously don't need to be as detailed.)

I think we can help with documentations. If information is passed to us then we can compile a documentation like that. Information needed are probably something like:

configuration of each node, e.g. N nodes has n-socket "MODEL" CPU with X amount of RAM. A raw table of hardware information is probably enough for us to compile a summary table.
network topology, e.g. how they are connected, bandwidth, etc. This would sheds some light on how many nodes we can launch an MPI application until the network cannot cope up.

@rwf14f, what do you think about this? Thanks.

Setting up ssh connections between worker nodes in parallel universe

@rwf14f, I'm trying to establish ssh connections between worker nodes in parallel universe, which is a prerequisite to bootstrap MIPCH.

The following is a Minimal Working Example (MWE) demonstrating the problem. The key is that it follows the example of mp1script which uses sshd.sh provided by HTCondor to bootstrap a "contact" file between the processes. And then a command performing ssh -p PORT -i KEY_FILE USER@HOSTNAME date to show the error Host key verification failed.. Different variations of this example is performed, including less descriptive variants such as ssh HOSTNAME date.

Is it a configuration issue at Blackett?

MWE:

In mpi.ini,

universe = parallel
executable = mp1script
machine_count = 2
should_transfer_files = yes
when_to_transfer_output = ON_EXIT_OR_EVICT
request_cpus   = 2
request_memory = 1G
request_disk   = 10G

log = mpi.log
output = mpi-$(Node).out
error = mpi-$(Node).err
stream_error = True
stream_output = True

queue

In a modified mp1script,

#!/bin/sh

# modified from /usr/share/doc/condor-9.0.17/examples/mp1script

_CONDOR_PROCNO=$_CONDOR_PROCNO
_CONDOR_NPROCS=$_CONDOR_NPROCS

CONDOR_SSH=`condor_config_val libexec`
CONDOR_SSH=$CONDOR_SSH/condor_ssh

SSHD_SH=`condor_config_val libexec`
SSHD_SH=$SSHD_SH/sshd.sh

. $SSHD_SH $_CONDOR_PROCNO $_CONDOR_NPROCS 

# If not the head node, just sleep forever, to let the
# sshds run
if [ $_CONDOR_PROCNO -ne 0 ]
then
		wait
		sshd_cleanup
		exit 0
fi

export P4_RSHCOMMAND=$CONDOR_SSH

CONDOR_CONTACT_FILE=$_CONDOR_SCRATCH_DIR/contact
export CONDOR_CONTACT_FILE

echo "Created the following contact file:"
cat $CONDOR_CONTACT_FILE

# The second field in the contact file is the machine name
# that condor_ssh knows how to use
sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $2}' > machines

export idkey
awk '{
    print "ssh -p " $3 " -i " ENVIRON["idkey"] " " $4 "@" $2 " date";
}' $CONDOR_CONTACT_FILE > ssh.bash
echo "Trying to reach other hosts via ssh..."
cat ssh.bash
bash ./ssh.bash

sshd_cleanup
rm -f machines

exit $?

Submitting the job:

condor_submit mpi.ini

which results in error:

Host key verification failed.
Host key verification failed.

condor_ssh_to_job resulted in interactive job ending immediately

If condor_ssh_to_job is used to ssh into an interactive job, it will terminates immediately.

MWE:

On vm77, in the 1st process,

❯ cat example.ini
RequestMemory = 32999
RequestCpus = 16
queue
❯ condor_submit -i example.ini
Submitting job(s).
1 job(s) submitted to cluster 1883.
Waiting for job to start...
Welcome to [email protected]!

Then in a 2nd process,

❯ condor_ssh_to_job 1883
Welcome to [email protected]!
Connection to condor-job.wn3806190.tier2.hep.manchester.ac.uk closed by remote host.
Connection to condor-job.wn3806190.tier2.hep.manchester.ac.uk closed.

Then immediately in the 1st process,

bash-4.2$ Connection to condor-job.wn3806190.tier2.hep.manchester.ac.uk closed by remote host.
Connection to condor-job.wn3806190.tier2.hep.manchester.ac.uk closed.

interactive node: `Failed to read address of starter for this job`

When requesting an interactive node, on some assigned host, this error will occur. MWE:

$ cat example.ini
RequestMemory = 32999
RequestCpus = 16
use_x509userproxy = True
queue
$ condor_submit -i example.ini
Submitting job(s).
1 job(s) submitted to cluster 2231.
Failed to read address of starter for this job
$ condor_q 2231 -long
RemoteHost = "[email protected]"
...

simonsobs-uk / data-centre Goto Github PK

data-centre's Introduction

SO:UK Data Centre

data-centre's People

Contributors

Stargazers

Watchers

Forkers

data-centre's Issues

Introduction and Summary

Project Objective

System Requirements

Login Process

Maintenance and Security

Recommend Projects

Recommend Topics

Recommend Org