Code Monkey home page Code Monkey logo

portals4's People

Contributors

bwbarrett avatar kevin-pedretti avatar mjleven avatar regrant avatar rpears0n avatar swelch avatar

Watchers

 avatar

portals4's Issues

Counting events delivered into overflow list broken

test_*_*_put_overflow_ct, committed into the test harness in r224, fails.  The 
test creates a list entry in the overflow list which disables unexpected 
headers but counts message delivery.  The counting event will never be 
incremented in this case, causing the test to abort after a timeout (if the 
timeout is made infinite, the test will hang forever).


Original issue reported on code.google.com by [email protected] on 9 Jul 2013 at 7:12

compilation error (minor)

ptl_conn.c:93 -> pthread_cond_destroy(&conn->move_wait);

Missing #if WITH_TRANSPORT_IB || WITH_TRANSPORT_UDP


Original issue reported on code.google.com by [email protected] on 15 Oct 2012 at 1:55

missing PTL_FASTLOCK_UNLOCK

Following issue 34 resolution, the function tgt_get_match can now return 
STATE_TGT_DROP without unlocking pt->lock (ptl_tgt.c:686)

Original issue reported on code.google.com by [email protected] on 2 May 2013 at 8:05

Stress test of PUT/GET hangs inside the portals library

We have a test case where each thread writes into each others shared space. On 
8 threads this test occasionally hangs. If I convert test into the strict 
version (wait for completion of every put/get) test runs without hangs (tried a 
loop with 100 runs). 

I created a simplified C version of the test that shows this behavior and will 
send it separately.  

On our IB system that has 4 nodes this test runs for 8 number of threads. But, 
the run is not that smooth (based on the printfs). I believe that it would hang 
occasionally if I disable printfs in the code.

However, for 9 threads and more it just hangs. I think this is related to the 
fact that for 9 threads we have 3 of them running on the same node (1). I did 
backtrace on all of the threads and they are spinning here:

#1  0x00007ffff7ddc7d5 in rdma_send_message (buf=0x7ffff40e2540, 
from_init=<optimized out>)
    at ../../../../p4-ref/src/ib/ptl_rdma.c:129
129                 pthread_yield();
(gdb) l
124         /* If the high water mark is reached, wait until we go back to
125          * the low watermark (=1/2 high WM). */
126         if (atomic_read(&buf->conn->rdma.num_req_posted) >= limit) {
127             limit /= 2;
128             while (atomic_read(&buf->conn->rdma.num_req_posted) >= limit) {
129                 pthread_yield();
130                 SPINLOCK_BODY();
131             }
132         }

Is it possible that we are running out of some system resources? And not 
recovering.




Original issue reported on code.google.com by [email protected] on 30 Sep 2012 at 6:24

printf format warnings from ptl_udp.c

With svn r2173 and gcc version "4.1.2 20080704 (Red Hat 4.1.2-44)" I see the 
following warnings:

  CC     libportals_ib_la-ptl_udp.lo
../../../../source/ptl4-trunk/src/ib/ptl_udp.c: In function 'udp_send':
../../../../source/ptl4-trunk/src/ib/ptl_udp.c:401: warning: format '%llu' 
expects type 'long long unsigned int', but argument 7 has type 'uint64_t'
../../../../source/ptl4-trunk/src/ib/ptl_udp.c:539: warning: format '%i' 
expects type 'int', but argument 6 has type 'size_t'
../../../../source/ptl4-trunk/src/ib/ptl_udp.c:570: warning: format '%i' 
expects type 'int', but argument 6 has type 'size_t'
../../../../source/ptl4-trunk/src/ib/ptl_udp.c: In function 'udp_receive':
../../../../source/ptl4-trunk/src/ib/ptl_udp.c:889: warning: format '%llu' 
expects type 'long long unsigned int', but argument 7 has type 'uint64_t'

re lines 401 and 889:
 options incude use of the PRIu64 macro, or an explicit cast
re lines 539 and 570: 
 options include use of %zi, or an explicit cast

If the widths of the variables being printed may vary (perhaps why I see these 
on this platform and not on another), then the explicit casts are the safest 
approach.

-Paul

Original issue reported on code.google.com by [email protected] on 13 Apr 2013 at 4:24

Flow control tests fail on dual core platfroms

On small core count platforms, there are random failures in the flow control 
tests (test_flowctl_noeq) due to the reference implementation returning the 
remote PT_DISABLED error code in both SEND and ACK events.

Original issue reported on code.google.com by [email protected] on 20 Nov 2012 at 6:09

PTL_EVENT_PT_DISABLED not delivered when unexpected list full

Once unexpected list is full, subsequent operations are dropped and 
ni_fail_type of ack/reply events are set to PTL_NI_PT_DISABLED as expected. 
However, PTL_EVENT_PT_DISABLED is not delivered on target. Works fine with 
other flow control triggering conditions.

Original issue reported on code.google.com by [email protected] on 12 Jun 2013 at 9:46

Shared memory won't pass physical addressing tests

What steps will reproduce the problem?
1. Configure with shared memory enabled
2. Make
3. Make check

What is the expected output? What do you see instead?
Tests Pass; tests segfault instead.

Segfault in get_transport_buf for rank 0.

Tested on both MacOSX 10.7.4 and Linux 2.6.32.


Original issue reported on code.google.com by [email protected] on 27 Feb 2013 at 6:52

make check THREADS=4 when > 1 rank per remote node.

What steps will reproduce the problem?
cd Portals4-svn.2129

B=/usr/mpi/gcc/openmpi-1.6

./configure --with-implementation=ib --enable-ib-shmem CFLAGS="-ggdb -Wall -O2 
-I$B/include/" LDFLAGS="-L$B/lib64/"


make check THREADS=4    # localhost, PASSes all tests.

Modify test/Makefile to use localhost (ib0) IB-interface and remote host 'ib1' 
Infiniband interface

TESTS_ENVIRONMENT = $(top_builddir)/src/runtime/hydra/yod.hydra -hosts ib0,ib1 
-np $(THREADS)

make check THREADS=2
PASSes all tests.

make check THREADS=4    #hangs in various tests.

cd Portals4-svn.2129

B=/usr/mpi/gcc/openmpi-1.6

./configure --with-implementation=ib --enable-ib-shmem CFLAGS="-ggdb -Wall -O2 
-I$B/include/" LDFLAGS="-L$B/lib64/"


make check THREADS=4    # localhost, PASSes all tests.

Modify test/Makefile to use localhost (ib0) IB-interface and remote host 'ib1' 
Infiniband interface

TESTS_ENVIRONMENT = $(top_builddir)/src/runtime/hydra/yod.hydra -hosts ib0,ib1 
-np $(THREADS)

make check THREADS=2
PASSes all tests.

make check THREADS=4    #hangs in various tests.

<...>
PASS: test_ME_put_multiple_large_overlap
^CCtrl-C caught... cleaning up processes
FAIL: test_LE_get
^CCtrl-C caught... cleaning up processes
FAIL: test_ME_get
PASS: test_LE_atomic
PASS: test_ME_atomic
^CCtrl-C caught... cleaning up processes
FAIL: test_LE_fetchatomic
^CCtrl-C caught... cleaning up processes
FAIL: test_ME_fetchatomic
^CCtrl-C caught... cleaning up processes
FAIL: test_LE_swap
^CCtrl-C caught... cleaning up processes
FAIL: test_ME_swap
^CCtrl-C caught... cleaning up processes
FAIL: test_event
PASS: test_LE_put_truncate
PASS: test_ME_put_truncate
^Z
[1]+  Stopped                 make check THREADS=4

kill %1

make check THREADS=8 # same failure pattern

What is the expected output? What do you see instead?
PASS: test_LE_get
PASS: test_ME_get

What version of the product are you using? On what operating system?
Portals4 svn.2129 on RHEL 6.3 uname -r '2.6.32-279.19.1.el6.x86_64'



Original issue reported on code.google.com by [email protected] on 8 Feb 2013 at 10:10

max_unexpected_headers limit ignored

A high number of operations on overflow list, exceeding the aforementioned 
limit, will cause a crash instead of incrementing the drop count register.

Original issue reported on code.google.com by [email protected] on 26 Apr 2013 at 2:03

Unlink / in_use race

Occasionally, the test_ME_unlink test will fail because PtlMEUnlink returns 
PTL_IN_USE even when the list entry is no longer in use (all expected events 
have been delivered).  Retrying the unlink causes it to succeed, but that's 
less than ideal.


Original issue reported on code.google.com by [email protected] on 15 Apr 2013 at 1:31

3 of 95 tests failed in test/sfw/test_n2/test-suite.log

What steps will reproduce the problem?
1. checkout r2235
2. make
3. make check

What is the expected output? What do you see instead?
Expected to see all tests pass.  I see 
...
FAIL: test_events_logical-021.xml
...
FAIL: test_events_logical-027.xml
...
FAIL: test_rdma_move_logical-026.xml
================================================
3 of 95 tests failed
See test/sfw/test_n2/test-suite.log
Please report to [email protected]
================================================

=====================================================
   portals4 1.0a1: test/sfw/test_n2/test-suite.log
=====================================================

3 of 95 tests failed.

.. contents:: :depth: 2


FAIL: test_events_logical-021.xml (exit: 1)
===========================================

found event type: 255, expecting: 0
        Total Errors 3


FAIL: test_events_logical-027.xml (exit: 1)
===========================================

found event type: 255, expecting: 8
        Total Errors 3


FAIL: test_rdma_move_logical-026.xml (exit: 1)
==============================================

found event type: 255, expecting: 2
        Total Errors 3



What version of the product are you using? On what operating system?
Using r2235 on Linux workstation running RHEL5 using gcc 4.1.2.  Configured 
using
./configure --prefix=$HOME/portals4-read-only-install --with-implementation=ib 
--disable-ib-ib --enable-ib-shmem --with-ev=$HOME/libev-4.15


Please provide any additional information below.

This might be related to the seg fault I was getting in my Global Arrays code.  
Here's the valgrind output:
==29747== Memcheck, a memory error detector
==29747== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.
==29747== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info
==29747== Command: ./testing/test
==29747== Parent PID: 29744
==29747==
==29747== Invalid read of size 8
==29747==    at 0x4C22A20: enqueue (ptl_sync.h:56)
==29747==    by 0x4C23A84: shmem_send_message (ptl_shmem.c:30)
==29747==    by 0x4C1430D: process_init (ptl_init.c:481)
==29747==    by 0x4C19DA8: PtlPut (ptl_move.c:139)
==29747==    by 0x40A65A: comex_puts (comex.c:1486)
==29747==    by 0x4055E9: test_dim (test.c:515)
==29747==    by 0x4057F1: main (test.c:1626)
==29747==  Address 0x1a4777a8e888 is not stack'd, malloc'd or (recently) free'd
==29747==
==29747==
==29747== Process terminating with default action of signal 11 (SIGSEGV)
==29747==  Access not within mapped region at address 0x1A4777A8E888
==29747==    at 0x4C22A20: enqueue (ptl_sync.h:56)
==29747==    by 0x4C23A84: shmem_send_message (ptl_shmem.c:30)
==29747==    by 0x4C1430D: process_init (ptl_init.c:481)
==29747==    by 0x4C19DA8: PtlPut (ptl_move.c:139)
==29747==    by 0x40A65A: comex_puts (comex.c:1486)
==29747==    by 0x4055E9: test_dim (test.c:515)
==29747==    by 0x4057F1: main (test.c:1626)
==29747==  If you believe this happened as a result of a stack
==29747==  overflow in your program's main thread (unlikely but
==29747==  possible), you can try to increase the size of the
==29747==  main thread stack using the --main-stacksize= flag.
==29747==  The main thread stack size used in this run was 10485760.
==29747==
==29747== HEAP SUMMARY:
==29747==     in use at exit: 2,403,300 bytes in 100 blocks
==29747==   total heap usage: 63,191 allocs, 63,091 frees, 111,330,807 bytes 
allocated
==29747==
==29747== LEAK SUMMARY:
==29747==    definitely lost: 0 bytes in 0 blocks
==29747==    indirectly lost: 0 bytes in 0 blocks
==29747==      possibly lost: 48,576 bytes in 3 blocks
==29747==    still reachable: 2,354,724 bytes in 97 blocks
==29747==         suppressed: 0 bytes in 0 blocks
==29747== Rerun with --leak-check=full to see details of leaked memory
==29747==
==29747== For counts of detected and suppressed errors, rerun with: -v
==29747== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 4 from 4)

Original issue reported on code.google.com by [email protected] on 26 Jul 2013 at 2:55

UDP test_flowctl tests hang

test_flowctl_nohdr and test_flowctl_noeq can hang when using UDP.

This is caused by running out of send/recv buffer space.


Original issue reported on code.google.com by [email protected] on 22 Jul 2013 at 4:57

Missing Portals4.h

What steps will reproduce the problem?
1. Configure with no options
2. Type make

What is the expected output? What do you see instead?
Expected a succesful compile. Instead I'm getting:
make[2]: Entering directory `/home/knusbau2/sandbox/portals/reg-build/test'
  CC     support.lo
../../svn/test/support.c:26:22: fatal error: portals4.h: No such file or 
directory
compilation terminated.
make[2]: *** [support.lo] Error 1
make[2]: Leaving directory `/home/knusbau2/sandbox/portals/reg-build/test'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/knusbau2/sandbox/portals/reg-build/test'
make: *** [all-recursive] Error 1


What version of the product are you using? On what operating system?
I'm using the latest pull from the svn repo. uname -a
Linux taub144 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 11:13:45 CDT 2011 
x86_64 x86_64 x86_64 GNU/Linux


Please provide any additional information below.


Original issue reported on code.google.com by [email protected] on 7 Mar 2013 at 8:35

max_volatile_size set wrong on IB implementation

If a max_volatile_size is requested greater than 512 bytes, the actual will be 
the requested value.  However, if an MD with volatile set is used in PtlPut, a 
PTL_ARG_INVALID is returned.  It looks like the problem is the last check in 
check_put, where a check against INLINE is made.  I can't quite figure out the 
relationship between max volatile and inline size, but there clearly is one 
that isn't being accounted for in the actuals returned from PtlNIInit.


Original issue reported on code.google.com by [email protected] on 8 May 2012 at 6:53

Occasionally missing CT updates with SHMEM

It appears that r1828 introduced a regression in the handling of CTs.  With the 
SHMEM project, I'm seeing a CT which counts ACKs to a persistent MDs not be 
updated appropriately (acks go missing).  The bcast benchmark in SHMEM works in 
r1827 and breaks in r1828, hanging in waiting for an ACK to arrive.

Original issue reported on code.google.com by [email protected] on 8 May 2012 at 10:03

Errors with log activated

some examples generate a  UNLINK(entry_h) returned PTL_IN_USE (line 174)


example of logs:

$ PTL_DEBUG=1 PTL_LOG_LEVEL=5 yod -n 2 ./test/basic/test_ME_unlink && echo $?
info  init_rdma(ptl_iface_ib.c:398): setting ni->id.phys.nid = a010104
info  init_rdma(ptl_iface_ib.c:398): setting ni->id.phys.nid = a010104
info  create_tables(ptl_ni.c:207): mapping table: 
info  create_tables(ptl_ni.c:207): mapping table: 
info  process_init(ptl_init.c:1105): [19581]0x7fbe39efc340: init state = start
info  process_init(ptl_init.c:1105): [19581]0x7fbe39efc340: init state = 
prepare_req
info  prepare_req(ptl_init.c:237): request uses physical: 0 or logical 
addressing: 1 
info  prepare_req(ptl_init.c:258): conn type: 1 
info  process_init(ptl_init.c:1105): [19581]0x7fbe39efc340: init state = 
wait_conn
info  rdma_init_connect(ptl_rdma.c:31): Initiate connect with 401010a:29642
info  process_init(ptl_init.c:1105): [19582]0x7f3853332340: init state = start
info  rdma_init_connect(ptl_rdma.c:60): Connection initiated successfully to 
401010a:29642
info  process_init(ptl_init.c:1105): [19582]0x7f3853332340: init state = 
prepare_req
info  prepare_req(ptl_init.c:237): request uses physical: 0 or logical 
addressing: 1 
info  prepare_req(ptl_init.c:258): conn type: 1 
info  process_init(ptl_init.c:1105): [19582]0x7f3853332340: init state = 
wait_conn
info  rdma_init_connect(ptl_rdma.c:31): Initiate connect with 401010a:29386
info  rdma_init_connect(ptl_rdma.c:60): Connection initiated successfully to 
401010a:29386
info  process_cm_event(ptl_conn.c:678): Rank got CM event 0 for id 0x775b30
info  process_cm_event(ptl_conn.c:678): Rank got CM event 0 for id 0x22036a0
info  process_cm_event(ptl_conn.c:678): Rank got CM event 2 for id 0x775b30
info  process_cm_event(ptl_conn.c:678): Rank got CM event 2 for id 0x22036a0
info  process_cm_event(ptl_conn.c:678): Rank got CM event 4 for id 
0x7fbe3400cce0
info  process_cm_event(ptl_conn.c:678): Rank got CM event 4 for id 
0x7f384c00cce0
info  process_cm_event(ptl_conn.c:678): Rank got CM event 8 for id 0x775b30
info  process_cm_event(ptl_conn.c:678): Rank got CM event 9 for id 0x22036a0
info  process_cm_event(ptl_conn.c:678): Rank got CM event 9 for id 
0x7fbe3400cce0
info  process_init(ptl_init.c:1105): [19582]0x7f3853332340: init state = 
send_req
info  process_init(ptl_init.c:1105): [19582]0x7f3853332340: init state = 
wait_comp
info  process_recv_rdma(ptl_recv.c:464): tid:7f385331a700 buf:0x7f3853332340: 
state = send_comp
info  process_init(ptl_init.c:1105): [19582]0x7f3853332340: init state = 
wait_comp
info  process_init(ptl_init.c:1105): [19582]0x7f3853332340: init state = 
early_send_event
info  process_init(ptl_init.c:1105): [19582]0x7f3853332340: init state = cleanup
info  process_recv_rdma(ptl_recv.c:464): tid:7fbe39ee4700 buf:0x7fbe3d588300: 
state = recv_packet_rdma
info  process_recv_rdma(ptl_recv.c:464): tid:7fbe39ee4700 buf:0x7fbe3d588300: 
state = recv_packet
info  process_recv_rdma(ptl_recv.c:464): tid:7fbe39ee4700 buf:0x7fbe3d588300: 
state = recv_req
info  process_tgt(ptl_tgt.c:1985): 0x7fbe3d588300: tgt state = tgt_start event 
mask: 0
info  process_tgt(ptl_tgt.c:1985): 0x7fbe3d588300: tgt state = tgt_get_match 
event mask: 0
info  process_tgt(ptl_tgt.c:1985): 0x7fbe3d588300: tgt state = tgt_get_length 
event mask: 16384
info  init_local_offset(ptl_tgt.c:357): buf start determined to be: 0x775b18 
info  process_tgt(ptl_tgt.c:1985): 0x7fbe3d588300: tgt state = tgt_data event 
mask: 16384
info  process_tgt(ptl_tgt.c:1985): 0x7fbe3d588300: tgt state = tgt_data_in 
event mask: 16384
info  process_tgt(ptl_tgt.c:1985): 0x7fbe3d588300: tgt state = tgt_comm_event 
event mask: 16384
info  process_tgt(ptl_tgt.c:1985): 0x7fbe3d588300: tgt state = tgt_cleanup 
event mask: 0
info  process_tgt(ptl_tgt.c:1985): 0x7fbe3d588300: tgt state = tgt_cleanup_2 
event mask: 0
info  process_recv_rdma(ptl_recv.c:464): tid:7fbe39ee4700 buf:0x7fbe3d588300: 
state = recv_repost
info  process_recv_rdma(ptl_recv.c:464): tid:7fbe39ee4700 buf:0x7fbe3d588300: 
state = recv_done
info  process_recv_rdma(ptl_recv.c:464): tid:7f385331a700 buf:0x7f3853332340: 
state = recv_done
info  process_init(ptl_init.c:1105): [19581]0x7fbe39efc340: init state = 
send_req
info  process_init(ptl_init.c:1105): [19581]0x7fbe39efc340: init state = 
wait_comp
info  process_recv_rdma(ptl_recv.c:464): tid:7fbe39ee4700 buf:0x7fbe39efc340: 
state = send_comp
info  process_init(ptl_init.c:1105): [19581]0x7fbe39efc340: init state = 
wait_comp
info  process_init(ptl_init.c:1105): [19581]0x7fbe39efc340: init state = 
early_send_event
info  process_recv_rdma(ptl_recv.c:464): tid:7f385331a700 buf:0x7f38569be300: 
state = recv_packet_rdma
info  process_recv_rdma(ptl_recv.c:464): tid:7f385331a700 buf:0x7f38569be300: 
state = recv_packet
info  process_recv_rdma(ptl_recv.c:464): tid:7f385331a700 buf:0x7f38569be300: 
state = recv_req
info  process_tgt(ptl_tgt.c:1985): 0x7f38569be300: tgt state = tgt_start event 
mask: 0
info  process_tgt(ptl_tgt.c:1985): 0x7f38569be300: tgt state = tgt_get_match 
event mask: 0
info  process_tgt(ptl_tgt.c:1985): 0x7f38569be300: tgt state = tgt_get_length 
event mask: 16384
info  init_local_offset(ptl_tgt.c:357): buf start determined to be: 0x21dce08 
info  process_tgt(ptl_tgt.c:1985): 0x7f38569be300: tgt state = tgt_data event 
mask: 16384
info  process_tgt(ptl_tgt.c:1985): 0x7f38569be300: tgt state = tgt_data_in 
event mask: 16384
info  process_init(ptl_init.c:1105): [19581]0x7fbe39efc340: init state = cleanup
info  process_recv_rdma(ptl_recv.c:464): tid:7fbe39ee4700 buf:0x7fbe39efc340: 
state = recv_done
info  process_tgt(ptl_tgt.c:1985): 0x7f38569be300: tgt state = tgt_comm_event 
event mask: 16384
info  process_tgt(ptl_tgt.c:1985): 0x7f38569be300: tgt state = tgt_cleanup 
event mask: 0
info  process_tgt(ptl_tgt.c:1985): 0x7f38569be300: tgt state = tgt_cleanup_2 
event mask: 0
info  process_recv_rdma(ptl_recv.c:464): tid:7f385331a700 buf:0x7f38569be300: 
state = recv_repost
info  process_recv_rdma(ptl_recv.c:464): tid:7f385331a700 buf:0x7f38569be300: 
state = recv_done
=> UNLINK(entry_h) returned PTL_IN_USE (line 174)

Original issue reported on code.google.com by [email protected] on 25 Apr 2013 at 11:57

Triggered ME ops need to be implemented for PPE

Plan is to implement triggered ME ops for PPE,
always compile triggered ops (remove config option),
require a special define (e.g., P4_EXPERIMENTAL) to access the triggered ops to 
make it clear that they aren't part of the normal p4 spec.

Original issue reported on code.google.com by [email protected] on 28 Feb 2013 at 9:11

PTL_EVENT_FETCH_ATOMIC unused

Fetch atomic operations deliver PTL_EVENT_ATOMIC events instead of 
PTL_EVENT_FETCH_ATOMIC ones. Also, these are not mentioned in the specification 
(4.0.1) when disabling communication events (PTL_LE/ME_EVENT_COMM_DISABLE).

Original issue reported on code.google.com by [email protected] on 16 May 2013 at 8:56

cannot specify PTL_SIZE_MAX for ptl_ni_limits_t

For PtlNIInit(...), if you specify PTL_SIZE_MAX for any of the ptl_ni_limits_t 
members which are of type ptl_size_t e.g. max_msg_size, the actual values 
returned by the reference implementation are set to the minimums.  Looking at 
the source code, src/ib/ptl_ni.c the static function set_limits(...) function 
calls chk_param(int,long).  Since the requested value is passed as an unsigned 
64bit type (ptl_size_t) it is truncated to 0 when automatically cast to a long.

Source code to reproduce:
    ptl_ni_limits_t ptl_ni_limits_requested;
    ptl_ni_limits_requested.max_entries = INT_MAX;
    ptl_ni_limits_requested.max_unexpected_headers = INT_MAX;
    ptl_ni_limits_requested.max_mds = INT_MAX;
    ptl_ni_limits_requested.max_eqs = INT_MAX;
    ptl_ni_limits_requested.max_cts = INT_MAX;
    ptl_ni_limits_requested.max_pt_index = INT_MAX;
    ptl_ni_limits_requested.max_iovecs = INT_MAX;
    ptl_ni_limits_requested.max_list_size = INT_MAX;
    ptl_ni_limits_requested.max_triggered_ops = INT_MAX;
    ptl_ni_limits_requested.max_msg_size = PTL_SIZE_MAX;
    ptl_ni_limits_requested.max_atomic_size = PTL_SIZE_MAX;
    ptl_ni_limits_requested.max_fetch_atomic_size = PTL_SIZE_MAX;
    ptl_ni_limits_requested.max_waw_ordered_size = PTL_SIZE_MAX;
    ptl_ni_limits_requested.max_war_ordered_size = PTL_SIZE_MAX;
    ptl_ni_limits_requested.max_volatile_size = PTL_SIZE_MAX;
    ptl_ni_limits_requested.features = PTL_TARGET_BIND_INACCESSIBLE;

    status = PtlNIInit(PTL_IFACE_DEFAULT,
            PTL_NI_NO_MATCHING | PTL_NI_LOGICAL,
            PTL_PID_ANY,
            &ptl_ni_limits_requested,
            &l_state.ptl_ni_limits,
            &l_state.ptl_ni_handle);
    assert(PTL_OK == status);

    printf("max_entries = %d (%d requested)\n",
            l_state.ptl_ni_limits.max_entries,
            ptl_ni_limits_requested.max_entries);
    printf("max_unexpected_headers = %d (%d requested)\n",
            l_state.ptl_ni_limits.max_unexpected_headers,
            ptl_ni_limits_requested.max_unexpected_headers);
    printf("max_mds = %d (%d requested)\n",
            l_state.ptl_ni_limits.max_mds,
            ptl_ni_limits_requested.max_mds);
    printf("max_eqs = %d (%d requested)\n",
            l_state.ptl_ni_limits.max_eqs,
            ptl_ni_limits_requested.max_eqs);
    printf("max_cts = %d (%d requested)\n",
            l_state.ptl_ni_limits.max_cts,
            ptl_ni_limits_requested.max_cts);
    printf("max_pt_index = %d (%d requested)\n",
            l_state.ptl_ni_limits.max_pt_index,
            ptl_ni_limits_requested.max_pt_index);
    printf("max_iovecs = %d (%d requested)\n",
            l_state.ptl_ni_limits.max_iovecs,
            ptl_ni_limits_requested.max_iovecs);
    printf("max_list_size = %d (%d requested)\n",
            l_state.ptl_ni_limits.max_list_size,
            ptl_ni_limits_requested.max_list_size);
    printf("max_triggered_ops = %d (%d requested)\n",
            l_state.ptl_ni_limits.max_triggered_ops,
            ptl_ni_limits_requested.max_triggered_ops);
    printf("max_msg_size = %llu (%llu requested)\n",
            (long long unsigned)l_state.ptl_ni_limits.max_msg_size,
            (long long unsigned)ptl_ni_limits_requested.max_msg_size);
    printf("max_atomic_size = %llu (%llu requested)\n",
            (long long unsigned)l_state.ptl_ni_limits.max_atomic_size,
            (long long unsigned)ptl_ni_limits_requested.max_atomic_size);
    printf("max_fetch_atomic_size = %llu (%llu requested)\n",
            (long long unsigned)l_state.ptl_ni_limits.max_fetch_atomic_size,
            (long long unsigned)ptl_ni_limits_requested.max_fetch_atomic_size);
    printf("max_waw_ordered_size = %llu (%llu requested)\n",
            (long long unsigned)l_state.ptl_ni_limits.max_waw_ordered_size,
            (long long unsigned)ptl_ni_limits_requested.max_waw_ordered_size);
    printf("max_war_ordered_size = %llu (%llu requested)\n",
            (long long unsigned)l_state.ptl_ni_limits.max_war_ordered_size,
            (long long unsigned)ptl_ni_limits_requested.max_war_ordered_size);
    printf("max_volatile_size = %llu (%lu requested)\n",
            (long long unsigned)l_state.ptl_ni_limits.max_volatile_size,
            (long long unsigned)ptl_ni_limits_requested.max_volatile_size);
    printf("features = %u (%u requested)\n",
            l_state.ptl_ni_limits.features,
            ptl_ni_limits_requested.features);


What version of the product are you using? On what operating system?

URL: http://portals4.googlecode.com/svn/trunk/src/ib
Repository Root: http://portals4.googlecode.com/svn
Repository UUID: 4e5cb362-e38f-11de-a053-3b9a60e514ca
Revision: 2209
Node Kind: directory
Schedule: normal
Last Changed Author: [email protected]
Last Changed Rev: 2209
Last Changed Date: 2013-05-29 10:26:30 -0700 (Wed, 29 May 2013)

Running on OSX 10.7.5 but I imagine this will happen almost anywhere.

Original issue reported on code.google.com by [email protected] on 11 Jun 2013 at 5:06

Invalid use of CMPXCHG16B on older CPU

As show in the fragment of configure output below, I have a system on which gcc 
accepts the -mcx16 flag and supports the CMPXCHG16B instruction, while my older 
Opteron CPU does NOT.

checking if compiler accepts -mcx16... yes
checking ia64intrin.h usability... no
checking ia64intrin.h presence... no
checking for ia64intrin.h... no
checking ia32intrin.h usability... no
checking ia32intrin.h presence... no
checking for ia32intrin.h... no
checking whether compiler supports builtin atomic CAS-32... yes
checking whether compiler supports builtin atomic CAS-64... yes
checking whether compiler supports builtin atomic CAS-ptr... yes
checking whether compiler supports builtin atomic incr... yes
checking whether ia64intrin.h is required... no
checking whether the compiler supports CMPXCHG16B... yes
checking whether the CPU supports CMPXCHG16B... no


Configure ends putting -mcx16 in CFLAGS even though it has been determined that 
the CPU doesn't support it:

$ grep mcx16 Makefile
CFLAGS = -Wall -Wno-strict-aliasing -Wmissing-prototypes -Wstrict-prototypes -g 
-O2 -mcx16

And, more importantly, the code in src/ib/ptl_lockfree.h uses 
__sync_val_compare_and_swap() on 16-byte values because SANDIA_BUILTIN_CAS128 
has been defined (by the logic in config/sandia_check_atomics.m4).

The result is lots of SIGILL failures from "make check".

An effective (for me) work-around is to add "sandia_cv_c_mcx16=no" to the 
configure command line.

The proper fix is to define SANDIA_BUILTIN_CAS128 only if the HAVE_CMPXCHG16B 
test for CPU support has ALSO passed (including the --enable-cross-cmpxchg16b 
case).

-Paul

Original issue reported on code.google.com by [email protected] on 17 Apr 2013 at 12:08

PTL_EVENT_AUTO_UNLINK posted before all comm events delivered

The PTL_EVENT_AUTO_UNLINK is getting posted immediately, before the last 
PTL_EVENT_PUT/GET/... is posted.  This makes handling flow control cases more 
difficult, since it becomes necessary to know when all potential events have 
been delivered for an auto unlinked ME/LE.

The attached patch modifies things so that the posting of the 
PTL_EVENT_AUTO_UNLINK is deferred until after the comm event is posted.  It 
works for my simple test case, but needs more thorough testing.

Original issue reported on code.google.com by [email protected] on 21 Mar 2013 at 11:09

Attachments:

Possible lost trigger operation

While running some tests I experienced occasional test hang in a test that does 
a lot of broadcasts.  I went back into our library tests and created a small 
example that manifests the problem.

Example is rather simple and involves only two threads and one triggered 
operation (enclosed code is able to run multiple threads):

Thread 0 - sets up a triggered op to send a value to Thread 1 once it received 
a message that Thread 1 is ready.
Thread 1 - sends a message to Thread 0 and waits to receive a broadcast value 
back from Thread 0.

The enclosed Makefile builds "test_broadcast" executable that does not
use triggered ops and works as expected. "test_brodacast-t" uses triggered ops 
and hangs after running for a while. If I introduce 1us delay in any of the 
threads, the test with triggered ops passes (triggered op is called before or 
after the message from Thread 1 arrived). Run as "yod -n 2 test_broadcast-t".

You might need to run the program few times. I ran the test with threads on the 
same or different nodes.

When system hangs I see that Thread 0 received the message from Thread 1 but 
somehow triggered op never took place as Thread 0 waits for MD ack, while 
Thread 1 waits for value to receive from Thread 0. The message from T1 to T0 
arrived (which is confirmed by the CTWait and the actual value in the buffer). 

Original issue reported on code.google.com by [email protected] on 19 Jul 2012 at 11:13

Attachments:

UDP does not support messages over 64K

Portals with the UDP transport does not support messages larger than a single 
UDP datagram (64K).

Support for fragmentation/reassembly of large messages using multiple UDP 
datagrams needs to be added in order to send large messages.

Original issue reported on code.google.com by [email protected] on 4 Mar 2013 at 7:27

Rev 2079 build issue

Starting with Rev 2079 we are unable to run code unless Portals IB shared 
memory is disabled. These are the options we have to use to make UPC code run:

../src/configure CFLAGS="-g -O3 -Wall" \
--with-implementation=ib --disable-ib-shmem \
--prefix=/usr/local/portals4

If I omit "--disable-ib-shmem" I get IB shmem enabled and nothing runs. It 
seems that everybody is getting stuck in the init code, even though on one test 
it got stuck on the fini code. Which implies
that we initialized and ran some code, however, with two threads only.

I'll try to narrow down the problem a bit more.

Original issue reported on code.google.com by [email protected] on 10 Oct 2012 at 5:43

Counting acks not properly delivered if SUCCESS_DISABLE and ACK_REQ

If a MD requests counting of acks, disables full events on success, and 
requests an ack with PTL_ACK_REQ, the ack is never delivered.  If everything 
remains the same, but the ack is requested with PTL_CT_ACK_REQ, the ack is 
delivered.  The test test_ct_ack checks for this behavior and hangs in the 
first scenario (even with just one process).


Original issue reported on code.google.com by [email protected] on 3 Jul 2012 at 8:49

Failing Check Tests

What steps will reproduce the problem?
0. qsub -I so that you're interactively on one of the compute nodes
1. Configure portals with the options detailed attached config log
2. run make
3. run make check

What is the expected output? What do you see instead?
Expected to see all tests passing. Instead the last 4 tests just time out. I 
have to interrupt them.

^CCtrl-C caught... cleaning up processes
FAIL: CT_LE_rtt_latency
^CCtrl-C caught... cleaning up processes
FAIL: CT_ME_rtt_latency
^CCtrl-C caught... cleaning up processes
FAIL: EQ_LE_rtt_latency
^CCtrl-C caught... cleaning up processes
FAIL: EQ_ME_rtt_latency


What version of the product are you using? On what operating system?
Portals 4 from the svn on Scientific Linux release 6.1 (Carbon)




Original issue reported on code.google.com by [email protected] on 26 Feb 2013 at 9:52

Attachments:

Occasional shut-down crash

The ptl_ME_unlink test (and others) will sometimes cause yod to report a failed 
test when it actually succeeds.  This happens most when the nodes are 
oversubscribed.  Running the test in a tight loop seems to make it happen 
within a couple of minutes.  It seems to happen most inside a slurm job.

Original issue reported on code.google.com by [email protected] on 15 Apr 2013 at 1:35

Confused by "Atomics are not implemented portably"

I have a system with older Opteron CPUs which lack the CMPXCHG16B instruction, 
combined with an older gcc w/o builtin atomics (at least according to 
configure).  As a result, configuring portals4 ends with:

checking if compiler accepts -mcx16... no
checking ia64intrin.h usability... no
checking ia64intrin.h presence... no
checking for ia64intrin.h... no
checking ia32intrin.h usability... no
checking ia32intrin.h presence... no
checking for ia32intrin.h... no
checking whether compiler supports builtin atomic CAS-32... no
checking whether compiler supports builtin atomic CAS-64... no
checking whether compiler supports builtin atomic CAS-ptr... no
checking whether compiler supports builtin atomic incr... no
checking whether the compiler supports CMPXCHG16B... yes
checking whether the CPU supports CMPXCHG16B... no
configure: error: Atomics are not implemented portably

So, I am left confused because "Atomics are not implemented portably" doesn't 
give me any clue what to do next.

Is CMPXCHG16B required (in which case I am out of luck)?
Would installing a newer compiler work?

Original issue reported on code.google.com by [email protected] on 13 Apr 2013 at 4:12

make check fails in test/swf/test_n1

After building Rev 2081 of portals library I noticed that some of the tests do 
not run due to the makefile problems. I cannot recall when was the last time 
this worked.

# make check
[...]
Average time around the loop: 4.09444 microseconds
Average catch-to-toss latency: 4.09444 microseconds
PASS: EQ_LE_rtt_latency
Final value of potato = 1000000
Total time: 4.05016 secs
Average time around the loop: 4.05016 microseconds
Average catch-to-toss latency: 4.05016 microseconds
PASS: EQ_ME_rtt_latency
======================
All 34 tests passed
(4 tests were not run)
======================
make[3]: Leaving directory `/eng/upc/portals4/bld/thor/test'
make[2]: Leaving directory `/eng/upc/portals4/bld/thor/test'
Making check in sfw
make[2]: Entering directory `/eng/upc/portals4/bld/thor/test/sfw'
Making check in test_n1
make[3]: Entering directory `/eng/upc/portals4/bld/thor/test/sfw/test_n1'
make  check-TESTS
make[4]: Entering directory `/eng/upc/portals4/bld/thor/test/sfw/test_n1'
make[4]: *** No rule to make target `*.xml', needed by `check-TESTS'.  Stop.
make[4]: Leaving directory `/eng/upc/portals4/bld/thor/test/sfw/test_n1'
make[3]: *** [check-am] Error 2
make[3]: Leaving directory `/eng/upc/portals4/bld/thor/test/sfw/test_n1'
make[2]: *** [check-recursive] Error 1
make[2]: Leaving directory `/eng/upc/portals4/bld/thor/test/sfw'
make[1]: *** [check-recursive] Error 1
make[1]: Leaving directory `/eng/upc/portals4/bld/thor/test'
make: *** [check-recursive] Error 1

Original issue reported on code.google.com by [email protected] on 10 Oct 2012 at 4:21

New tests and missing portals4.h

Looks like a regression of issue #20 on some newly added tests.
With automake-1.11 and autoconf-2.68 I had to make the changes below in order 
to get the new stuff compiled.

Additionally, WHY are these two new directories built by "make" instead of 
"make check"?  Is that intentional?

-Paul

Index: test/benchmarks/msg_rate/Makefile.inc
===================================================================
--- test/benchmarks/msg_rate/Makefile.inc       (revision 2194)
+++ test/benchmarks/msg_rate/Makefile.inc       (working copy)
@@ -10,4 +10,4 @@
     msg_rate/test_prepostME.c             \
     msg_rate/test_prepostLE.c

-P4msgrate_CPPFLAGS = $(AM_CPPFLAGS) -Imsg_rate
+P4msgrate_CPPFLAGS = $(AM_CPPFLAGS) -Imsg_rate -I$(top_srcdir)/include
Index: test/benchmarks/rtt_latency/Makefile.inc
===================================================================
--- test/benchmarks/rtt_latency/Makefile.inc    (revision 2194)
+++ test/benchmarks/rtt_latency/Makefile.inc    (working copy)
@@ -8,13 +8,13 @@
 noinst_PROGRAMS += $(RTT_TESTS)

 CT_LE_rtt_latency_SOURCES = rtt_latency/ct_hotpotato.c
-CT_LE_rtt_latency_CPPFLAGS = $(AM_CPPFLAGS) -DINTERFACE=0
+CT_LE_rtt_latency_CPPFLAGS = $(AM_CPPFLAGS) -DINTERFACE=0 
-I$(top_srcdir)/include

 CT_ME_rtt_latency_SOURCES = rtt_latency/ct_hotpotato.c
-CT_ME_rtt_latency_CPPFLAGS = $(AM_CPPFLAGS) -DINTERFACE=1
+CT_ME_rtt_latency_CPPFLAGS = $(AM_CPPFLAGS) -DINTERFACE=1 
-I$(top_srcdir)/include

 EQ_LE_rtt_latency_SOURCES = rtt_latency/events_hotpotato.c
-EQ_LE_rtt_latency_CPPFLAGS = $(AM_CPPFLAGS) -DINTERFACE=0
+EQ_LE_rtt_latency_CPPFLAGS = $(AM_CPPFLAGS) -DINTERFACE=0 
-I$(top_srcdir)/include

 EQ_ME_rtt_latency_SOURCES = rtt_latency/events_hotpotato.c
-EQ_ME_rtt_latency_CPPFLAGS = $(AM_CPPFLAGS) -DINTERFACE=1
+EQ_ME_rtt_latency_CPPFLAGS = $(AM_CPPFLAGS) -DINTERFACE=1 
-I$(top_srcdir)/include


Original issue reported on code.google.com by [email protected] on 2 May 2013 at 6:10

Overflow events on PT without EQ causes segfault

In relation with issue 35.
Appending a list entry (without overflow events disabled) on the priority list 
with some headers in the unexpected list on a portal table with PTL_EQ_NONE 
results in a segfault (__check_overflow > flush_from_unexpected_list > 
tgt_overflow_event > make_target_event with NULL eq).

Original issue reported on code.google.com by [email protected] on 29 May 2013 at 1:55

Missing implementation of PTL_MD_EVENT_SEND_DISAABLE

The latest released portals4 spec [SAND2012-10087 Unlimited Release Printed 
November 2012], section 3.10 (Memory Descriptors) PDF page 53 under MD options, 
states

PTL_MD_EVENT_SEND_DISABLE Specifies that this memory descriptor should not 
generate send events
    (PTL_EVENT_SEND). This flag does not affect counting events.

portals4.h does not define PTL_MD_EVENT_SEND_DISABLE?

What is the ETA on defining and implementing this functionality?


Original issue reported on code.google.com by [email protected] on 15 Mar 2013 at 5:22

Persistent LE/ME search broken

test_persistent_search.c exposes an issue with PtlSearch with a persistent 
LE/ME and SEARCH_ONLY.  Only one PTL_EVENT_SEARCH is generated, instead of one 
per message in the unexpected queue.  This does not occur when the search is 
SEARCH_DELETE.

Original issue reported on code.google.com by [email protected] on 27 Jun 2013 at 9:47

missing asprintf prototype

Building from r2194 I see:

../../../../source/ptl4-trunk/test/sfw/main.c: In function 'main':
../../../../source/ptl4-trunk/test/sfw/main.c:150:13: warning: implicit 
declaration of function 'asprintf' [-Wimplicit-function-declaration]


Since asprintf() is a GNU extension, one must define _GNU_SOURCE before 
including stdio.h in order to get its prototype.

Original issue reported on code.google.com by [email protected] on 2 May 2013 at 6:12

Test issue

What steps will reproduce the problem?
1. cry
2. wail
3. moan

What is the expected output? What do you see instead?


Please use labels and text to provide additional information.


Original issue reported on code.google.com by [email protected] on 29 Nov 2011 at 8:07

libev.a v4.15 rejected by configure

I am using libev 4.15 which I built and installed from source.
That build was configured with --enable-static --disabled-shared.

When I configure portals4 (svn r2172) with --with-ev=[DIR] the libev I built is 
rejected due to a dependence on floor() from libm:

configure:16186: checking for ev_loop_new in -lev
configure:16211: gcc -std=gnu99 -o conftest -Wall -Wno-strict-aliasing 
-Wmissing-prototypes -Wstrict-prototypes -g -O2  
-I/global/homes/h/hargrove/portals4-install/include  
-L/global/homes/h/hargrove/portals4-install/lib conftest.c -lev  -lrt 
-lbsd-compat -lpthread -ldl  >&5
conftest.c:113: warning: function declaration isn't a prototype
conftest.c:116: warning: function declaration isn't a prototype
/global/homes/h/hargrove/portals4-install/lib/libev.a(ev.o): In function 
`periodic_recalc':
/global/homes/h/hargrove/portals4-read-only/libev-4.15/ev.c:3069: undefined 
reference to `floor'
collect2: ld returned 1 exit status

I can work around this by adding LIBS=-lm to the configure command line, but 
that should not be necessary.  Instead the configure script should be careful 
to include any dependencies in LIBS.

-Paul


Original issue reported on code.google.com by [email protected] on 11 Apr 2013 at 1:33

Multiple NIs in the same Portals client bring about performance issues

What steps will reproduce the problem?
1. Build Portals for UDP+PPE
2. Build OpenSHMEM to use Portals + PMI
3. Run any OpenSHMEM test

What is the expected output? What do you see instead?

Well, it's not that this behavior is unexpected, but the problem is that any 
Portals client that opens up multiple NIs is going to be in danger of losing 
substantial performance from the UDP layer due to socket multiplexing.

The way it's currently architected, a Portals client can create up to 4 NIs 
(for each combination of (NI_PHYSICAL | NI_LOGICAL) x (NI_MATCHING | 
NI_NO_MATCHING) for the same NID/PID, and each of these NIs are multiplexed 
over the same socket. When the progress thread is running, it loops over every 
NI that it has and calls a recvfrom on the socket. Because each NI is sharing 
the same socket, every NI can receive data sent to that specific Portals 
NID/PID. This can result in the same data being received from the socket up to 
4 times - the first 3 times, Portals will realize, based on information in the 
buf_t, that it should not have received the packet and will drop it.

This is ok from a functional standpoint because the data is MSG_PEEK'd from the 
socket to begin with, and so it remains in an OS buffer for subsequent 
retrieval, but performance seems like it could be a big problem. Not only are 
cycles wasted processing data for the wrong NIs, but each NI is unnecessarily 
sharing UDP buffer space allocated for the socket in the OS.

The OpenSHMEM tests bring about this situation because they open up a logical 
NI to map all ranks on the node, but they also open up a physical NI to service 
PMI requests. When running these tests, you should see that _every_ PMI message 
delivered via UDP to the the process is first received and dropped by the 
logical NI before being handled by the physical NI.

Original issue reported on code.google.com by [email protected] on 24 Jul 2013 at 8:19

Can't unlink an overflow list entry with unexpected headers disabled

1- an operation is performed on an overflow list entry with unexpected headers 
disabled
2- the unexpected list is empty, as expected
3- appending a list entry on the priority list doesn't trigger an overflow 
event (I guess that's also to be expected since the unexpected list is empty, 
but the specification is not really clear on how this feature affects the 
normal workflow)
4- attempting to unlink the overflow list entry results in a PTL_IN_USE
5- if the overflow list entry is set with the USE_ONCE option, it auto-unlinks 
fine

Original issue reported on code.google.com by [email protected] on 17 May 2013 at 12:18

ib/ptl_tgt.c warnings at r2172

Rather then reopen issue #9, please note the compiler warnings coming from the 
current SVN trunk.  This is without any special settings of CFLAGS (defaults 
include -Wall and -O2).

  CC     libportals_ib_la-ptl_tgt.lo
../../../src/ib/ptl_tgt.c: In function 'process_tgt':
../../../src/ib/ptl_tgt.c:1597: warning: 'ack_hdr' may be used uninitialized in 
this function
../../../src/ib/ptl_tgt.c:1596: warning: 'ack_buf' may be used uninitialized in 
this function
../../../src/ib/ptl_tgt.c:1121: warning: 'was_done' may be used uninitialized 
in this function



Original issue reported on code.google.com by [email protected] on 11 Apr 2013 at 2:22

Status Register Naming Inconsistency

What steps will reproduce the problem?
1. write code with a call to PtlNIStatus which quireis the 
PTL_SR_PERMISSION_VIOLATIONS or PTL_SR_OPERATION_VIOLATIONS index


What is the expected output? What do you see instead?
I expect to see a correct compile. Instead I see:

../../git/src/net/portals4/portals4.c: In function 
'qthread_internal_net_driver_print_status':
../../git/src/net/portals4/portals4.c:423:21: error: 
'PTL_SR_PERMISSION_VIOLATIONS' undeclared (first use in this function)
../../git/src/net/portals4/portals4.c:423:21: note: each undeclared identifier 
is reported only once for each function it appears in
../../git/src/net/portals4/portals4.c:424:21: error: 
'PTL_SR_OPERATION_VIOLATIONS' undeclared (first use in this function)


This is because the portals implementation has mistakenly name them 
PTL_SR_PERMISSIONS_VIOLATIONS and PTL_SR_OPERATIONS_VIOLATIONS (note the extra 
"S" at the end of PERMISSION). Either the documentation should be changed or 
the implementation should be corrected.

What version of the product are you using? On what operating system?
I'm using the lastest SVN version on Scientific Linux.

Original issue reported on code.google.com by [email protected] on 27 Mar 2013 at 5:08

svn.2081 'make check' fails when ranks are off node, OK when on node.

What steps will reproduce the problem?
Portals4 svn.2081
    ./configure  --with-implementation=ib --with-knem=/opt/knem --enable-fast

Test/Makefile
    TESTS_ENVIRONMENT=$(top_builddir)/src/runtime/hydra/yod.hydra -f $(top_builddir)/test/hostfile -np $(THREADS)

make check THREADS=2

4 of 38 tests failed...

What is the expected output? What do you see instead?

38 of 38 tests passed.

4 of 38 tests failed...control-C'ed to continue test suite.

As long as the Portals test ranks are node local then all 'make check' tests 
pass.
As soon as ranks go off node, Portals 'make check' tests hang.
Same story when running portals-shmem 'make check' tests over Portals svn.2081; 
localhost OK, offnode circular_shift hangs.

All of the above tests (Portals or portals-shmem) works fine for Portals 
svn.2069.

What version of the product are you using? On what operating system?
RHEL 6.3

Original issue reported on code.google.com by [email protected] on 10 Oct 2012 at 10:40

Remove compiler warnings from IB and Hydra source files

We build portals library with the CFLAGS="-g -O3 -Wall" and code needs to be 
cleaned up so we can make without any compiler warnings. I have to say that 
some recent updates did cleanup IB source code substantially.

The following warnings come out of the IB source code (Rev 2081):

  CC     libportals_ib_la-ptl_eq_common.lo
../../../../p4-ref/src/ib/ptl_conn.c: In function 'process_connect_request':
../../../../p4-ref/src/ib/ptl_conn.c:514:6: warning: variable 'ret' set but not 
used [-Wunused-but-set-variable]
../../../../p4-ref/src/ib/ptl_ct.c: In function 'PtlCTPoll':
../../../../p4-ref/src/ib/ptl_ct.c:435:6: warning: 'ni' may be used 
uninitialized in this function [-Wuninitialized] 

Runtime and hydra source code have many other warnings. Sample is like this:

../../../../../p4-ref/src/runtime/hydra/utils/sock/sock.c: In function 
'HYDU_sock_remote_access':
../../../../../p4-ref/src/runtime/hydra/utils/sock/sock.c:638:3: warning: label 
'fn_fail' defined but not used [-Wunused-label]
  CC     string.lo
../../../../../p4-ref/src/runtime/hydra/utils/string/string.c: In function 
'HYDU_strsplit':
../../../../../p4-ref/src/runtime/hydra/utils/string/string.c:92:9: warning: 
zero-length gnu_printf format string [-Wformat-zero-length]
  CC     topo.lo
../../../../../p4-ref/src/runtime/hydra/tools/topo/topo.c: In function 
'init_topolib':
../../../../../p4-ref/src/runtime/hydra/tools/topo/topo.c:45:3: warning: label 
'fn_fail' defined but not used [-Wunused-label]
../../../../../p4-ref/src/runtime/hydra/tools/topo/topo.c: In function 
'handle_user_binding':
../../../../../p4-ref/src/runtime/hydra/tools/topo/topo.c:101:3: warning: label 
'fn_fail' defined but not used [-Wunused-label]
../../../../../p4-ref/src/runtime/hydra/tools/topo/topo.c: In function 
'HYDT_topo_finalize':
../../../../../p4-ref/src/runtime/hydra/tools/topo/topo.c:625:3: warning: label 
'fn_fail' defined but not used [-Wunused-label]
../../../../../p4-ref/src/runtime/hydra/tools/topo/topo.c: In function 
'assign_proc_units':
../../../../../p4-ref/src/runtime/hydra/tools/topo/topo.c:162:5: warning: 
'bindmap' may be used uninitialized in this function [-Wuninitialized]
  CC     bsci_init.lo

Original issue reported on code.google.com by [email protected] on 10 Oct 2012 at 4:17

PtlPut() to a variable that is also a target of PtlSwap() is getting lost

We implemented a simple spin lock on int variable:

- acquire - PtlSwap - with PTL_CSWAP operation - expect value of 0, write value 
of 1
- release - PtlPut - write value of 0

Occasionally our test fails with lock being in the acqure state, even though it 
was released with a PtlPut. Happens once in a 100 runs.

I changed the code to use PtlSwap (PTL_CSWAP, expect 1 and set it to 0) instead 
of PtlPut, with an error abort if CSWAP operation fails. This seems to work as 
I was not able to crash the test.


Original issue reported on code.google.com by [email protected] on 30 Aug 2012 at 4:34

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.