Code Monkey home page Code Monkey logo

mpibind's Introduction

A Memory-Driven Mapping Algorithm for Heterogeneous Systems

mpibind is a memory-driven algorithm to map parallel hybrid applications to the underlying hardware resources transparently, efficiently, and portably. Unlike other mappings, its primary design point is the memory system, including the cache hierarchy. Compute elements are selected based on a memory mapping and not vice versa. In addition, mpibind embodies a global awareness of hybrid programming abstractions as well as heterogeneous systems with accelerators.

Getting started

The easiest way to get mpibind is using spack.

spack install mpibind

# On systems with NVIDIA GPUs
spack install mpibind+cuda

# On systems with AMD GPUs
spack install mpibind+rocm

# More details
spack info mpibind

Alternatively, one can build the package manually as described below.

Building and installing

This project uses GNU Autotools.

$ ./bootstrap

$ ./configure --prefix=<install_dir>

$ make

$ make install

If building from a release tarball, please specify MPIBIND_VERSION appropriately. For example:

$ MPIBIND_VERSION=0.15.1 ./bootstrap

$ ./configure --prefix=<install_dir>

$ make

$ make install

The resulting library is <install_dir>/lib/libmpibind and a simple program using it is src/main.c

Test suite

$ make check

Dependencies

  • GNU Autotools is the build system.

  • hwloc version 2 is required to detect the machine topology.

    Before building mpibind, make sure hwloc can be detected with pkg-config:

    pkg-config --variable=libdir --modversion hwloc
    

    If this fails, add hwloc's pkg-config directory to PKG_CONFIG_PATH, e.g.,

    export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:<hwloc-prefix>/lib/pkgconfig
    
  • libtap is required to build the test suite.

    To verify tap can be detected with pkg-config, follow a similar procedure as for hwloc above.

Contributing

Contributions for bug fixes and new features are welcome and follow the GitHub fork and pull model. Contributors develop on a branch of their personal fork and create pull requests to merge their changes into the main repository.

The steps are similar to those of the Flux framework:

  1. Fork mpibind.
  2. Clone your fork: git clone [email protected]:[username]/mpibind.git
  3. Create a topic branch for your changes: git checkout -b new_feature
  4. Create feature or add fix (and add tests if possible)
  5. Make sure everything still passes: make check
  6. Push the branch to your GitHub repo: git push origin new_feature
  7. Create a pull request against mpibind and describe what your changes do and why you think it should be merged. List any outstanding todo items.

Authors

mpibind was created by Edgar A. León.

Citing mpibind

To reference mpibind, please cite one of the following papers:

  • Edgar A. León and Matthieu Hautreux. Achieving Transparency Mapping Parallel Applications: A Memory Hierarchy Affair. In International Symposium on Memory Systems, MEMSYS'18, Washington, DC, October 2018. ACM.

  • Edgar A. León. mpibind: A Memory-Centric Affinity Algorithm for Hybrid Applications. In International Symposium on Memory Systems, MEMSYS'17, Washington, DC, October 2017. ACM.

  • Edgar A. León, Ian Karlin, and Adam T. Moody. System Noise Revisited: Enabling Application Scalability and Reproducibility with SMT. In International Parallel & Distributed Processing Symposium, IPDPS'16, Chicago, IL, May 2016. IEEE.

Other references:

  • J. P. Dahm, D. F. Richards, A. Black, A. D. Bertsch, L. Grinberg, I. Karlin, S. Kokkila-Schumacher, E. A. León, R. Neely, R. Pankajakshan, and O. Pearce. Sierra Center of Excellence: Lessons learned. In IBM Journal of Research and Development, vol. 64, no. 3/4, May-July 2020.

  • Edgar A. León. Cross-Architecture Affinity of Supercomputers. In International Supercomputing Conference (Research Poster), ISC’19, Frankfurt, Germany, June 2019.

  • Edgar A. León. Mapping MPI+X Applications to Multi-GPU Architectures: A Performance-Portable Approach. In GPU Technology Conference, GTC'18, San Jose, CA, March 2018.

Bibtex file.

License

mpibind is distributed under the terms of the MIT license. All new contributions must be made under this license.

See LICENSE and NOTICE for details.

SPDX-License-Identifier: MIT.

LLNL-CODE-812647.

mpibind's People

Contributors

eleon avatar gonsie avatar grondo avatar nickhrdy avatar xorjane avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

mpibind's Issues

add manual page for flux

Problem: A user noted that mpibind options are not documented outside of tutorial information.

It may be helpful for mpibind to provide a flux-mpibind(1) man page that contains

  • SYNOPSIS showing various flux run -o mpibind=... options
  • DESCRIPTION describing mpibind's role
  • SHELL OPTIONS section describing options in detail, following a similar pattern to flux-shell(1)
  • SEE ALSO referring to flux-shell(1), flux-run(1).

Test Suite for Python Bindings

Per our 8/13/2020 discussion we would like to test the python bindings by leveraging the work Nick has done in testing the C bindings of mpibind. To make this happen, the following needs to be done:

  • Add wrapper functions around some hwloc functions that return parameters for tests
  • Mimic Nick's test generation on python side using wrapped hwloc functions
  • Write a function that translates mpibind mapping on python side into expected file format
  • Fix python tap runner

mpibind triggers hwloc assertion error in hwloc_topology_restrict(3)

After the merge of #31, mpibind now triggers an assertion failure in hwloc_topology_restrict(3) under some circumstances.

For example, in the following scenario we have a Flux instance running at depth=2 with assigned OS cores 44-47 (and associated threads). Things seem to work as expected until we try to use all the cores:

$ flux mini run -o initrc=mpibind-flux.lua -o mpibind=verbose:2 -n1 --label-io grep Cpus_allowed_list /proc/self/status
0.060s: flux-shell[0]: mpibind: 
mpibind: task  0 nths  2 gpus 6,7,5,4 cpus 47,95
0: Cpus_allowed_list:	47,95
$ flux mini run -o initrc=mpibind-flux.lua -o mpibind=verbose:2 -n2 --label-io grep Cpus_allowed_list /proc/self/status
0.053s: flux-shell[0]: mpibind: 
mpibind: task  0 nths  1 gpus 6,7 cpus 46
mpibind: task  1 nths  1 gpus 5,4 cpus 47
1: Cpus_allowed_list:	47
0: Cpus_allowed_list:	46
$ flux mini run -o initrc=mpibind-flux.lua -o mpibind=verbose:2 -n3 --label-io grep Cpus_allowed_list /proc/self/status
0.054s: flux-shell[0]: mpibind: 
mpibind: task  0 nths  1 gpus 6,7 cpus 45
mpibind: task  1 nths  1 gpus 5 cpus 46
mpibind: task  2 nths  1 gpus 4 cpus 47
2: Cpus_allowed_list:	47
1: Cpus_allowed_list:	46
0: Cpus_allowed_list:	45
$ flux mini run -o initrc=mpibind-flux.lua -o mpibind=verbose:2 -n4 --label-io grep Cpus_allowed_list /proc/self/status
0.053s: flux-shell[0]: stderr: flux-shell: topology.c:4186: restrict_object_by_cpuset: Assertion `!hwloc_bitmap_intersects(obj->complete_nodeset, droppednodeset) || hwloc_bitmap_iszero(obj->complete_cpuset)' failed.
flux-job: task(s) Aborted

This assertion failure is triggered by hwloc_topology_restrict(3) called in the mpibind shell plugin to restrict the topology cpuset loaded from MPIBIND_TOPOFILE to the current cpu affinity mask of the process.

If we unset MPIBIND_TOPOFILE in the environment such that the mpibind shell plugin skips the hwloc_topology_restrict(3) call (since the topology is already restricted when it is fetched from the job shell), then we hit the assertion failure in the hwloc_topology_restrict in mpibind.c here instead:

if ( hwloc_topology_restrict(hdl->topo, set, flags) ) {

It is unclear at this point why the assertion is being triggered. Since this is an assert and not an error from libhwloc, this could be a libhwloc bug, however, we should be able to figure out some way to work around it.

Drop OpenCL in favor of RSMI for AMD GPUs

Use RSMI to detect AMD GPUs rather than OpenCL.

Initially, OpenCL was the only way to identify AMD GPUs, but now the GPUs are tagged as RSMI devices. Several systems have hwloc configured to recognize AMD GPUs as RSMI devices only, thus, on these systems, mpibind would not recognize the GPUs.

mpibind.c:936:20: error: argument 1 range [...] exceeds maximum object size

In Ubuntu 20.04, with GCC 9, I'm seeing the following error during compilation:

  CC       mpibind.lo
In function ‘distrib_greedy’,
    inlined from ‘mpibind_distrib’ at mpibind.c:1008:10:
mpibind.c:936:20: error: argument 1 range [18446744071562067968, 18446744073709551615] exceeds maximum object size 9223372036854775807 [-Werror=alloc-size-larger-than=]
  936 |   numas_per_task = calloc(ntasks, sizeof(int));
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~

This can be resolved by using size_t for ntasks (I guess the cast of signed int to unsigned size_t in calloc() is what is causing GCC the confusion here)

diff --git a/src/mpibind.c b/src/mpibind.c
index 881adec..506f3b5 100644
--- a/src/mpibind.c
+++ b/src/mpibind.c
@@ -913,7 +913,7 @@ void greedy_singleton(hwloc_topology_t topo,
  * associated with a single NUMA domain. 
  */ 
 static
-int distrib_greedy(hwloc_topology_t topo, int ntasks, int *nthreads_pt, 
+int distrib_greedy(hwloc_topology_t topo, size_t ntasks, int *nthread
s_pt,
 		   hwloc_bitmap_t *cpus_pt, hwloc_bitmap_t *gpus_pt)
 {
   int i, task, num_numas; 

mpibind Flux plugin

Create a flux plugin for mpibind that will replace the affinity plugin in flux's scheduler.

mpibind at a job level proof of concept

Creating a tool that generates flux jobspec using mpibind logic for resource allocation is a good first step for exploring possible avenues for applying mpibind at a job level. As a bonus artifact, we can create python bindings for mpibind.

mpibind + "import torch" reduces CPU affinity

When using mpibind to assign procs to CPU cores, an import torch statement reduces the affinity to only use the first core of the assigned set.

For example, an import_test.py script:

import psutil
p = psutil.Process()
print("cpus before", p.cpu_affinity(), flush=True)
import torch
print("cpus after", p.cpu_affinity(), flush=True)

Run with:

flux alloc -N 1 -q pdebug
flux run --label-io -o mpibind=verbose -N 1 -n 2 -c 8 -g 1 -x python import_test.py

Leads to the following output:

mpibind: task  0 nths 64 gpus 4,5,2,3 cpus 0-31,64-95
mpibind: task  1 nths 64 gpus 6,7,0,1 cpus 32-63,96-127
0: cpus before [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95]
1: cpus before [32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127]
0: cpus after [0]
1: cpus after [32]

This does not happen with -o mpibind=off.

flux run --label-io -o mpibind=off -N 1 -n 2 -c 8 -g 1 -x python import_test.py

From there, I found that this also does not happen if one unsets both OMP_PLACES and OMP_PROC_BIND before importing torch:

import os
del os.environ['OMP_PLACES']
del os.environ['OMP_PROC_BIND']

import psutil
p = psutil.Process()
print("cpus before", p.cpu_affinity(), flush=True)
import torch
print("cpus after", p.cpu_affinity(), flush=True)

Having either variable set is enough to impact affinity.

One can also force the affinity back by capturing it before and restoring it after import torch as a work around:

import psutil
p = psutil.Process()
cpus = p.cpu_affinity()
import torch
p.cpu_affinity(cpus)

I'm not sure what package within import torch processes those variables. I suspect there may be a way to avoid that in PyTorch.

Is it possible to avoid setting these OMP variables in mpibind?

add option to disable GPU binding on Flux

A user requested a way to tell mpibind not to bind gpus, e.g. not set ROCR_VISIBLE_DEVICES

I see there is a gpu:0 option for slurm. Maybe this could be added for Flux? Or maybe there is already a way to do this and I am not seeing it.

nthreads instead of in_nthreads when checking input parameters in mpibind()

In the mpibind() function (mpibind.c:1285) when checking input parameters hdl->nthreads is checked in the conditional instead of hdl->in_nthreads

  /* Input parameters check */
  if (hdl->ntasks <= 0 || hdl->nthreads < 0) {
    fprintf(stderr, "Error: ntasks %d or nthreads %d out of range\n",
	    hdl->ntasks, hdl->in_nthreads);
    return 1;
  }

Errors when using make becuase of -Werror

@eleon I pulled the autotools changes and I'm getting these errors since -Werror is specificed:

[hardy]$ make
Making all in src
make[1]: Entering directory `~/mpibind/src'
  CC       mpibind.lo
mpibind.c: In function 'mpibind_distrib':
mpibind.c:663:7: error: 'obj' may be used uninitialized in this function [-Werror=maybe-uninitialized]
       hwloc_obj_type_snprintf(str, sizeof(str), obj, 1); 
       ^
mpibind.c:609:15: note: 'obj' was declared here
   hwloc_obj_t obj;
               ^
mpibind.c:788:12: error: 'io_numa_os_ids' may be used uninitialized in this function [-Werror=maybe-uninitialized]
       if (!hwloc_bitmap_isset(io_numa_os_ids, obj->os_index))
            ^
mpibind.c:754:18: note: 'io_numa_os_ids' was declared here
   hwloc_bitmap_t io_numa_os_ids; 
                  ^
cc1: all warnings being treated as errors
make[1]: *** [mpibind.lo] Error 1
make[1]: Leaving directory `~/mpibind/src'
make: *** [all-recursive] Error 1

This happens when I build locally and on corona. Do you get the same errors when you try to make?

Set environment variable when mpibind is on

Feature request. Programs may benefit from knowing whether mpibind has been applied. For example, before doing affinity, they could check if mpibind has been applied and avoid binding their workers. When mpibind is not on, they could apply affinity.

When mpibind is called, it could set an environment variable such as MPIBIND_IS_ON=1 Programs could then check whether this variable is set in the environment.

Tagging @benson31 for awareness.

Environment variable option to disable mpibind

In the case of Flux, we don't want Flux processes to be bound by mpibind when launched under Slurm. Instead, we only want mpibind to be activated once Flux has started. In parts of Flux's documentation, we currently recommend users launch Flux under Slurm with srun --mpibind=off flux start, but this only applies to LLNL clusters (open issue).

An alternate solution would be to suggest users to do MPIBIND=off srun flux start or set MPIBIND=off in an lmod file, but this requires mpibind to support providing options via environment variables. Once Flux is started, we could unset the MPIBIND env var when launching jobs (and reactivating mpibind via the Flux mpibind job shell plugin is also an option).

Distribute Python Bindings

It would be awesome if users could install and use the Python bindings.

Python setuptools

This was my first choice because I was familiar with using pip to configure python virtual environments. Unfortunately, when using wheel to build a whl distribution the C extension module that relies on libmpibind does not get its rpath configured correctly. setuptools does not use libtools to invoke the linker so the resulting extension module behaves differently compared to the extension module generated by our autotools setup.

Here are some steps that we could take in this direction:

  • explore the possibility of linking with the static libmpibind.a
  • It seems like the creators of setuptools are aware of this issue and their auditwheel tool provides functionality for injecting a shared object into the wheel distribution. The auditwheel package is not installed on lc systems so that would require some addtional configuration

Spack

I have not looked into this too much yet, but it seems like it could be a good alternative to setuptools.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.