Code Monkey home page Code Monkey logo

dml's Issues

Possible use-after-free when handler is destroyed before operation completes?

Consider the following code:

{
    auto h = dml::submit<path>(...);
}

The handler returned from submit may be destroyed before the operation completes. I don't see any mention in the documentation that the handler should be kept alive until the operation completes, so I would assume this code is valid.

However, I believe this can cause use-after-free in the hardware path since DML will try to write the completion status to the descriptor (which, if I understand correctly, is owned by the handler). Possibly the same is true for C API and dml_finalize_job().

Is my understanding correct or this is already handled by the DML somehow? If not, I think the best way to avoid this would be to wait for the operation to complete inside the handler destructor.

HW path reports error

I'm unable to use the HW path for mem move even after configuring the DSA devices:

$ sudo ./hl_mem_move_example hardware_path
Executing using dml::hardware path
Starting dml::mem_move example...
Copy 1KB of data from source into destination...
dml-diag: DML version TODO
dml-diag: Struct size: 3328 B
dml-diag: loading driver: libaccel-config.so.1
Failure occurred.

When manually calling dml::memmove, I get error code 16 that corresponds to internal library error. Is there a way to debug this? Any help would be really appreciated. Thanks!

System Configuration

Processor: Intel(R) Xeon(R) Silver 4416+

I have configured DSA using the python script:

$ sudo python3 accel_conf.py --load=../configs/1n1d1e1w-s-n1.conf
Filter:
Disabling active devices
    dsa0 - done
Loading configuration - done
Additional configuration steps
    Force block on fault: False
Enabling configured devices
    dsa0 - done
        wq0.0 - done
Checking configuration
    node: 0; device: dsa0; group: group0.0
        wqs:     wq0.0
        engines: engine0.0

I'm also running relatively recent kernel version:

$  uname -a
Linux machinename 6.8.0-rc7 #1 SMP PREEMPT_DYNAMIC Thu Mar  7 11:11:46 PST 2024 x86_64 x86_64 x86_64 GNU/Linux

Kernel cmdline:

$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.8.0-rc7 root=UUID=4f739d8f-4f15-4fc3-b419-bbb0202131b3 ro splash earlyprintk=ttyS1,115200 console=ttyS1,115200 c
onsole=ttyS0,115200 memmap=8G!16G nokaslr movable_node=2 intel_iommu=on,sm_on iommu=on vt.handoff=7

lspci output for one of the two devices available:

$ sudo lspci -vvv -s 75:01.0
75:01.0 System peripheral: Intel Corporation Device 0b25
        Subsystem: Intel Corporation Device 0000
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        NUMA node: 0
        IOMMU group: 1
        Region 0: Memory at 21bffff50000 (64-bit, prefetchable) [size=64K]
        Region 2: Memory at 21bffff20000 (64-bit, prefetchable) [size=128K]
        Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0
                        ExtTag+ RBE+ FLReset+
                DevCtl: CorrErr- NonFatalErr- FatalErr+ UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 512 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis+ NROPrPrP- LTR+
                         10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis+ LTR- OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
        Capabilities: [80] MSI-X: Enable+ Count=9 Masked-
                Vector table: BAR=0 offset=00002000
                PBA: BAR=0 offset=00003000
        Capabilities: [90] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UESvrt: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [150 v1] Latency Tolerance Reporting
                Max snoop latency: 0ns
                Max no snoop latency: 0ns
        Capabilities: [160 v1] Transaction Processing Hints
                Device specific mode supported
                Steering table in TPH capability structure
        Capabilities: [170 v1] Virtual Channel
                Caps:   LPEVC=1 RefClk=100ns PATEntryBits=1
                Arb:    Fixed+ WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=01
                        Status: NegoPending- InProgress-
                VC1:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=1 ArbSelect=Fixed TC/VC=02
                        Status: NegoPending- InProgress-
        Capabilities: [200 v1] Designated Vendor-Specific: Vendor=8086 ID=0005 Rev=0 Len=24 <?>
        Capabilities: [220 v1] Address Translation Service (ATS)
                ATSCap: Invalidate Queue Depth: 00
                ATSCtl: Enable+, Smallest Translation Unit: 00
        Capabilities: [230 v1] Process Address Space ID (PASID)
                PASIDCap: Exec- Priv+, Max PASID Width: 14
                PASIDCtl: Enable+ Exec- Priv+
        Capabilities: [240 v1] Page Request Interface (PRI)
                PRICtl: Enable+ Reset-
                PRISta: RF- UPRGI- Stopped+
                Page Request Capacity: 00000200, Page Request Allocation: 00000200
        Kernel driver in use: idxd
        Kernel modules: idxd

Error when using hardware path

When I run the command, It shows error like this:
./ll_crc_example hardware_path
The example will be run on the hardware path.
Starting CRC job example.
Caclulating CRC for region of size 1KB.
An error (100) occured during job execution.

So I try 2 steps:

  1. Check the .so is ok
    ldd /usr/bin/accel-config
    linux-vdso.so.1 (0x00007fffe05d5000)
    libaccel-config.so.1 => /usr/lib64/libaccel-config.so.1 (0x00007f14c4bdf000)
    libjson-c.so.4 => /lib/x86_64-linux-gnu/libjson-c.so.4 (0x00007f14c4bba000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f14c49c8000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f14c4c0e000)
    2.sudo python3 accel_conf.py --load=../configs/1n1d1e1w-s-n1.conf
    Filter:
    No active devices
    Loading configuration - done
    Additional configuration steps
    Force block on fault: False
    Enabling configured devices
    dsa0 - error
    wq0.0 - error

failed in dsa0/wq0.0
enabled 0 wq(s) out of 1


Checking configuration
No active devices

How should I do to fix it? And after step2 should it be ok to run the command ./ll_crc_example hardware_path?

Incorrect handling of CRC

In CRC and COPY_WITH_CRC commands it is not allowed to set initial CRC values after PAGE_FAULT with no bytes processed.
update_crc_for_continuation() and update_copy_crc_for_continuation() should include the conditon:

if (crc_record.bytes_completed() != 0) {
    crc_dsc.crc_seed() = crc_record.crc_value();
}

HW path does not work with job_api_example

I change job_api_example to HW path

int main(const int argc, char **const argv)
{
// Variables
dml_job_t *dml_job_ptr = NULL;
uint32_t total_fails = 0u;

// Allocate dml_job_t
dml_job_ptr = init_dml_job(DML_PATH_SW);

change this into DML_PATH_HW, then I just got a lot of error and fail without any log to show me root cause or suggest next action.

[root@localhost dml_job_api]# ./job_api_samples 
Intel(R) Data Mover Library Job API Examples

============================== LEGALS ==============================

Copyright (C) 2021 Intel Corporation

SPDX-License-Identifier: MIT====================================================================

------------------------------------------
	Run example # 1

	 Example of using Intel DML DML_OP_NOP operation 
	 --- Buffers size to DML_OP_NOP operation: 128
	 --- DML_OP_NOP property: no any specific properties 

	Example return: FAIL (Status 100)

------------------------------------------
	Run example # 2

	 Example of using Intel DML DML_OP_MEM_MOVE operation 
	 --- Buffers size to DML_OP_MEM_MOVE operation: 128
	 --- DML_OP_MEM_MOVE property: none 

	Example return: FAIL (Status 100)

------------------------------------------
	Run example # 3

	 Example of using Intel DML DML_OP_FILL operation 
	 --- Buffers size to DML_OP_FILL operation: 128
	 --- DML_OP_FILL property: no any specific properties 

	Example return: FAIL (Status 100)

------------------------------------------
	Run example # 4

	 Example of using Intel DML DML_OP_COMPARE_PATTERN operation 
	 --- Buffers size to DML_OP_COMPARE_PATTERN operation: 128
	 --- DML_OP_COMPARE_PATTERN property: none

	 Array is equal to pattern 
	 --- Status : 100
	 --- Result : 0
	 --- Offset : 0

	 Array is NOT equal to pattern 
	 --- Status : 100
	 --- Result : 0
	 --- Offset : 0

	Example return: FAIL (Status 100)

------------------------------------------
	Run example # 5

	 Example of using Intel DML DML_OP_DIF_UPDATE operation 
	 --- Buffers size to DML_OP_DIF_UPDATE operation: 4104
	 --- DML_OP_DIF_UPDATE property: BLOCK_SIZE is 4096 

	Example return: FAIL (Status 100)

------------------------------------------
	Run example # 6

	 Example of using Intel DML DML_OP_DIF_INSERT operation 
	 --- Buffers size to DML_OP_DIF_INSERT operation: 4096 and 4104
	 --- DML_OP_DIF_INSERT property: BLOCK_SIZE is 4096 

	Example return: FAIL (Status 100)

------------------------------------------
	Run example # 7

	 Example of using Intel DML DML_OP_DIF_CHECK operation 
	 --- Buffers size to DML_OP_DIF_CHECK operation: 4096 and 4104
	 --- DML_OP_DIF_CHECK property: BLOCK_SIZE is 4096 

	Example return: FAIL (Status 100)

------------------------------------------
	Run example # 8

	 Example of using Intel DML DML_OP_DIF_STRIP operation 
	 --- Buffers size to DML_OP_DIF_STRIP operation: 4104 and 4096
	 --- DML_OP_DIF_STRIP property: BLOCK_SIZE is 4096 

	Example return: FAIL (Status 100)

------------------------------------------
	Run example # 9

	 Example of using Intel DML DML_OP_CACHE_FLUSH operation 
	 --- Buffers size to DML_OP_CACHE_FLUSH operation: 128
	 --- DML_OP_CACHE_FLUSH property: none 

	Example return: FAIL (Status 100)

------------------------------------------
	Run example # 10

	 Example of using Intel DML DML_OP_BATCH operation 
	 --- Buffers size to DML_OP_BATCH operation: 128
	 --- DML_OP_BATCH property: none 

	Example return: FAIL (Status 100)

------------------------------------------
	Run example # 11

	 Example of using Intel DML DML_OP_CRC operation 
	 --- Buffers size to DML_OP_CRC operation: 128
	 --- DML_OP_CRC property: none 

	Example return: FAIL (Status 100)

------------------------------------------
	Run example # 12

	 Example of using Intel DML DML_OP_CRC_COPY operation 
	 --- Buffers size to DML_OP_CRC_COPY operation: 128
	 --- DML_OP_CRC_COPY property: none 

	Example return: FAIL (Status 100)

------------------------------------------
	Run example # 13

	 Example of using Intel DML DML_OP_CRC operation with dml_submit_job 
	 --- Buffers size to DML_OP_CRC operation: 128
	 --- DML_OP_CRC property: no any specific properties 

	Example return: FAIL (Status 100)

------------------------------------------
	Run example # 14

	 Example of using Intel DML DML_OP_DELTA_CREATE operation 
	 --- Buffers size to DML_OP_DELTA_CREATE operation: 128
	 --- DML_OP_DELTA_CREATE property: none 

	Example return: FAIL (Status 100)

------------------------------------------
	Run example # 15

	 Example of using Intel DML DML_OP_DELTA_APPLY operation 
	 --- Buffers size to DML_OP_DELTA_APPLY operation: 128
	 --- DML_OP_DELTA_APPLY property: none 

	Example return: FAIL (Status 100)

====== Examples Execution Completed ======
	 --- Total Samples run:                     15
	 --- Samples completed with OK status:      0
	 --- Samples completed with FAIL status:    15
[root@localhost dml_job_api]# 

Incorrect patterns in update_**_for_continuation

FILL, to avoid "shifted pattern" on page boundary (I assume that 16B pattern is not handled yet and 8 is OK).

uint32_t processed = 8 * (fill_record.bytes_completed() / 8);
fill_dsc.transfer_size() -= processed;
fill_dsc.destination_address() += processed;

Does it apply to COMPARE_PATTERN?

DIF_INSERT:

uint32_t blocks_processed = bytes_completed / block_size;
bytes_completed -= blocks_processed  * block_size;
source_address += blocks_processed  * block_size;
destination_address += blocks_processed  * (block_size + DSA_DIF_SIZE);

Very similar pattern should be for DIF_STRIP.

compiler optimization issue when allocation the transfer buffer on heap memory

example code: mem_move
change to allocation source and destination to allocate from stack to heap and init the buffer:

uint8_t* source = (uint8_t *)malloc(BUFFER_SIZE);
uint8_t* destination = (uint8_t *)malloc(BUFFER_SIZE);
memset(source, 1, BUFFER_SIZE);
memset(destination, 0, BUFFER_SIZE);

compile and run , return error 102

if change the compile optimization running successful, like:
cmake -DCMAKE_BUILD_TYPE=Debug ..
or
cmake -DCMAKE_BUILD_TYPE=-O0 ..

Fails to build from source (FTBFS) on Windows

Release: v1.1.0
Environment: Windows (GCC 13.2.1)

This appears to be caused by:

#if defined(__linux__)
#include "libaccel_config.h"
#endif

Because of the #if, the necessary definitions are missing, causing a lot of errors:

FAILED: sources/core/src/hw_dispatcher/CMakeFiles/dml_hw_dispatcher.dir/hw_configuration_driver.c.obj 
/usr/local/bin/x86_64-w64-mingw32ucrt-gcc -DDML_GIT_REVISION=\"N/A\" -D_FORTIFY_SOURCE=2 -I/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/../../../../include -I/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/. -O3 -mcrtdll=ucrt -D_UCRT -O2 -DNDEBUG -fPIC -fstack-protector --param=ssp-buffer-size=8 -fstack-clash-protection -MD -MT sources/core/src/hw_dispatcher/CMakeFiles/dml_hw_dispatcher.dir/hw_configuration_driver.c.obj -MF sources/core/src/hw_dispatcher/CMakeFiles/dml_hw_dispatcher.dir/hw_configuration_driver.c.obj.d -o sources/core/src/hw_dispatcher/CMakeFiles/dml_hw_dispatcher.dir/hw_configuration_driver.c.obj -c /build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c
In file included from /build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:9:
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_configuration_driver.h:56:47: warning: 'struct accfg_ctx' declared inside parameter list will not be visible outside of this definition or declaration
   56 | int32_t DML_HW_API(driver_new_context)(struct accfg_ctx **ctx);
      |                                               ^~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_configuration_driver.h:58:66: warning: 'struct accfg_ctx' declared inside parameter list will not be visible outside of this definition or declaration
   58 | struct accfg_device *DML_HW_API(context_get_first_device)(struct accfg_ctx *ctx);
      |                                                                  ^~~~~~~~~
In file included from /build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_configuration_driver.h:10:
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: error: conflicting types for 'dsa_driver_new_context'; have 'int32_t(struct accfg_ctx **)' {aka 'int(struct accfg_ctx **)'}
   49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
      |                                         ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:134:9: note: in expansion of macro 'DML_HW_API'
  134 | int32_t DML_HW_API(driver_new_context)(struct accfg_ctx **ctx)
      |         ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: note: previous declaration of 'dsa_driver_new_context' with type 'int32_t(struct accfg_ctx **)' {aka 'int(struct accfg_ctx **)'}
   49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
      |                                         ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_configuration_driver.h:56:9: note: in expansion of macro 'DML_HW_API'
   56 | int32_t DML_HW_API(driver_new_context)(struct accfg_ctx **ctx);
      |         ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: error: conflicting types for 'dsa_context_get_first_device'; have 'struct accfg_device *(struct accfg_ctx *)'
   49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
      |                                         ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:143:22: note: in expansion of macro 'DML_HW_API'
  143 | struct accfg_device *DML_HW_API(context_get_first_device)(struct accfg_ctx *ctx)
      |                      ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: note: previous declaration of 'dsa_context_get_first_device' with type 'struct accfg_device *(struct accfg_ctx *)'
   49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
      |                                         ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_configuration_driver.h:58:22: note: in expansion of macro 'DML_HW_API'
   58 | struct accfg_device *DML_HW_API(context_get_first_device)(struct accfg_ctx *ctx);
      |                      ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: error: return type is an incomplete type
   49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
      |                                         ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:188:21: note: in expansion of macro 'DML_HW_API'
  188 | enum accfg_wq_state DML_HW_API(work_queue_get_state)(struct accfg_wq *wq)
      |                     ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: error: conflicting types for 'dsa_work_queue_get_state'; have 'void(struct accfg_wq *)'
   49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
      |                                         ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:188:21: note: in expansion of macro 'DML_HW_API'
  188 | enum accfg_wq_state DML_HW_API(work_queue_get_state)(struct accfg_wq *wq)
      |                     ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: note: previous declaration of 'dsa_work_queue_get_state' with type 'enum accfg_wq_state(struct accfg_wq *)'
   49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
      |                                         ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_configuration_driver.h:82:21: note: in expansion of macro 'DML_HW_API'
   82 | enum accfg_wq_state DML_HW_API(work_queue_get_state)(struct accfg_wq *wq);
      |                     ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c: In function 'dsa_work_queue_get_state':
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:193:12: warning: 'return' with a value, in function returning void
  193 |     return -1;
      |            ^
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: note: declared here
   49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
      |                                         ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:188:21: note: in expansion of macro 'DML_HW_API'
  188 | enum accfg_wq_state DML_HW_API(work_queue_get_state)(struct accfg_wq *wq)
      |                     ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c: At top level:
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: error: return type is an incomplete type
   49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
      |                                         ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:197:20: note: in expansion of macro 'DML_HW_API'
  197 | enum accfg_wq_mode DML_HW_API(work_queue_get_mode)(struct accfg_wq *wq)
      |                    ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: error: conflicting types for 'dsa_work_queue_get_mode'; have 'void(struct accfg_wq *)'
   49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
      |                                         ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:197:20: note: in expansion of macro 'DML_HW_API'
  197 | enum accfg_wq_mode DML_HW_API(work_queue_get_mode)(struct accfg_wq *wq)
      |                    ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: note: previous declaration of 'dsa_work_queue_get_mode' with type 'enum accfg_wq_mode(struct accfg_wq *)'
   49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
      |                                         ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_configuration_driver.h:84:20: note: in expansion of macro 'DML_HW_API'
   84 | enum accfg_wq_mode DML_HW_API(work_queue_get_mode)(struct accfg_wq *wq);
      |                    ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c: In function 'dsa_work_queue_get_mode':
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:202:12: warning: 'return' with a value, in function returning void
  202 |     return 2;
      |            ^
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: note: declared here
   49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
      |                                         ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:197:20: note: in expansion of macro 'DML_HW_API'
  197 | enum accfg_wq_mode DML_HW_API(work_queue_get_mode)(struct accfg_wq *wq)
      |                    ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c: At top level:
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: error: return type is an incomplete type
   49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
      |                                         ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:224:25: note: in expansion of macro 'DML_HW_API'
  224 | enum accfg_device_state DML_HW_API(device_get_state)(struct accfg_device *device)
      |                         ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: error: conflicting types for 'dsa_device_get_state'; have 'void(struct accfg_device *)'
   49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
      |                                         ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:224:25: note: in expansion of macro 'DML_HW_API'
  224 | enum accfg_device_state DML_HW_API(device_get_state)(struct accfg_device *device)
      |                         ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: note: previous declaration of 'dsa_device_get_state' with type 'enum accfg_device_state(struct accfg_device *)'
   49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
      |                                         ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_configuration_driver.h:68:25: note: in expansion of macro 'DML_HW_API'
   68 | enum accfg_device_state DML_HW_API(device_get_state)(struct accfg_device *device);
      |                         ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c: In function 'dsa_device_get_state':
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:229:12: warning: 'return' with a value, in function returning void
  229 |     return -1;
      |            ^
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: note: declared here
   49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
      |                                         ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:224:25: note: in expansion of macro 'DML_HW_API'
  224 | enum accfg_device_state DML_HW_API(device_get_state)(struct accfg_device *device)
      |                         ^~~~~~~~~~

Configured with:

cmake -GNinja -S . -B build -DCMAKE_BUILD_TYPE=Release '-DCMAKE_C_FLAGS_RELEASE=-O2 -DNDEBUG' '-DCMAKE_CXX_FLAGS_RELEASE=-O2 -DNDEBUG' -DCMAKE_INSTALL_PREFIX=/usr/x86_64-w64-mingw32ucrt/sys-root/local -DDML_BUILD_EXAMPLES=OFF -DDML_BUILD_TESTS=OFF

A question about dml_finalize_job

In example code "multi_socket_example.c", destroy a pointer but add a constant : "jobs + SOCKET_COUNT"

cleanup:
    for (uint32_t i = 0; i < SOCKET_COUNT; ++i)
    {
        dml_finalize_job(jobs + SOCKET_COUNT);
    }

why here is a constant , is not a variable offset for "jobs + i"

cleanup:
    for (uint32_t i = 0; i < SOCKET_COUNT; ++i)
    {
        dml_finalize_job(jobs + i);
    }

Incorrect handling of PAGE_FAULT_MASK

The condition in core_interconnect.cpp:138 is incorrect. Should be:

if (is_finished && (status & 0x7f) == page_fault_mask)

(to get rid of READ/WRITE page fault bit, 0x80).
Same in :107

An issue about Multi-Socket sample code

Issue descript:
Multi-Socket sample code default set socket number = 4 , running on different config/SKU has some different results
#define SOCKET_COUNT 4u

Config 1:
CPU: Intel(R) Xeon(R) Platinum 8490H
Socket : 2
DSA device per Socket: 4
Enable 1 device: dsa0 (on socket0)

Both SOCKET_COUNT equal to 1~4 can running successful.

Config 2:
CPU: Intel(R) Xeon(R) Platinum 8470
Socket : 2
DSA device per Socket: 1

setup1: // error failed to submit to node0
Enable 1 device: dsa0 (on socket0)
SOCKET_COUNT=4

setup2: // error failed to submit to node1
Enable 1 device: dsa0 (on socket0)
Enable 1 device: dsa1 (on socket1)
SOCKET_COUNT=4

setup3: // successful
Enable 1 device: dsa0 (on socket0)
Enable 1 device: dsa1 (on socket1)
SOCKET_COUNT=2

setup4: // successful
Enable 1 device: dsa0 (on socket0)
SOCKET_COUNT=4
Commented out code: current_job->numa_id = i

for (uint32_t i = 0; i < SOCKET_COUNT; ++i)
   {
       const uint32_t chunk_size = transfer_size / SOCKET_COUNT;

       dml_job_t* current_job = (dml_job_t*)((uint8_t*)jobs + (job_size * i));

       current_job->operation             = DML_OP_MEM_MOVE;
       current_job->source_first_ptr      = src + (chunk_size * i);
       current_job->destination_first_ptr = dst + (chunk_size * i);
       current_job->source_length         = chunk_size;
       current_job->flags                 = DML_FLAG_PREFETCH_CACHE;
       //current_job->numa_id               = i;
   }

Why has these different results , does any logic issue about numa node check of DML?

Question about async mode of DML

We have some question about async mode of DML, thanks for comments:
(1) If one device config two or more SWQ, does DML async mode submit the job to every WQ, how to select and allocate these jobs to these WQs ,is each WQ get the jobs balanced?
(2) If a WQ has two or more engines, the completion of the jobs is disordered or sequential, in other words, and does any flag can control the dml_wait_job() function keep the jobs exaction sequence.

Job size seems too big

Job size returned by dml_get_job_size(PATH, &job_size_ptr) function seems too big. I get the value '98816' regardless of the used path (DML_PATH_HW/DML_PATH_SW).
Is ~100kB memory allocation for job buffer within the expected range?

SIGILL on many Intel CPUs

Hi! I'm afraid that the detection of supported CPU extensions is flaky. On a bunch of machines I've tested, most examples die at an AVX instruction not supported by the CPU. This includes even Pentium 4 which you explicitly list as supported!

Machines I've tested:

uarch model
Gemini Lake J4115
Skylake i7-6700K
Pinnacle Ridge TR 2990WX
Pentium 4 f15 m4 s3
Braswell N3160
Pentium 4 f15 m4 s1 DNC (32-bit)
Phenom 2 Thuban

You check cpuid, but take only cache info. The presence of AVX/AVX2/AVX512/etc is readily available there.

Unknown buffer size limitation for CRC operation

What is the acceptable input buffer size for CRC operation ?

I play with different sizes of CRC buffer.
DML Lib does accept different sizes, but it behaves with error or even segmentation fault in some cases.

Example execution:

[bgrzesko@fl31ca105bs0411 build]$ ./examples/low-level-api/ll_crc_example_1KB hardware_path
The example will be run on the hardware path.
Starting CRC job example.
Caclulating CRC for region of size 1KB.
Calculated CRC is: 0x2cdf6e8f
Finished successfully.
[bgrzesko@fl31ca105bs0411 build]$ ./examples/low-level-api/ll_crc_example_4MB hardware_path
The example will be run on the hardware path.
Starting CRC job example.
Caclulating CRC for region of size 4MB.
An error (15) occured during job execution.
[bgrzesko@fl31ca105bs0411 build]$ ./examples/low-level-api/ll_crc_example_16MB hardware_path
Segmentation fault (core dumped)
[bgrzesko@fl31ca105bs0411 build]$ 

How to reproduce (diff -> apply and compile example) :

[bgrzesko@fl31ca105bs0411 build]$ git diff
diff --git a/examples/low-level-api/crc_example.c b/examples/low-level-api/crc_example.c
index 3c12df2..ad03704 100644
--- a/examples/low-level-api/crc_example.c
+++ b/examples/low-level-api/crc_example.c
@@ -9,7 +9,8 @@
 #include "dml/dml.h"
 #include "examples_utils.h"
 
-#define BUFFER_SIZE 1024 // 1 KB
+//#define BUFFER_SIZE 4 * 1024 * 1024 // 4 MB
+#define BUFFER_SIZE 16 * 1024 * 1024 // 16 MB
 
 /*
 * This example demonstrates how to create and run a crc operation.

Is DML a wrapper for Intel DSA?

I've read the document, but I'm still confused about the hardware and software options.

Is DML a wrapper for Intel DSA?
Since DMA also supports asynchronous data movement, does DML support DMA as well?

And the dml::software is implement by starting another thread (or corountine)?

Thanks in advance!

Debugging hardware path on Sapphire Rapids

Hi,

I am unable to run hardware mode examples/tests. I did a fresh clone from master and built using GCC.

To configure the DSA and kernel, I followed the DSA user guide. I believe I have configured the DSA correctly because I can run the dsa_perf_micros scripts
e.g.

sudo ./src/dsa_perf_micros -n128 -s16k -j -c -f -i8000 -k5 -w0 -zF,F -o3
[sudo] password for user1:
./src/dsa_perf_micros -n128 -s16k -j -c -f -i8000 -k5 -w0 -zF,F -o3
-j option is deprecated (default behavior)
blen                      16384
bstride                   16384
bstride                   16384
nb_bufs                     128
pg_size                       0
wq_type                       0
batch_sz                      1
iter                       8000
nb_cpus                       1
var_mmio                      1
dma                           1
verify                        1
misc_flags                    0
access_op[0]               Write
access_op[1]               Write
place_op[0]              Memory
place_op[1]              Memory
flags_cmask            ffffffff
flags_smask                   0
flags_nth_desc                1
nb_numa_node                 16
cpu_desc_work                 0
Memory affinity
CPUs in node 0:		-1 -1
Buffer Offsets 		0 0
GB per sec = 31.170166 cpu 6.270452 kopsrate = 1902

However, I cannot run any of the tests/examples in DML with hardware mode, e.g.

[user1@sprnode5 high-level-api]$ ./hl_mem_move_example_example hardware_path
Executing using dml::hardware path
Starting dml::mem_move example...
Copy 1KB of data from source into destination...
Failure occurred.
[user1@sprnode5 high-level-api]$ ./hl_mem_move_example_example software_path
Executing using dml::software path
Starting dml::mem_move example...
Copy 1KB of data from source into destination...
Finished successfully.

(Note I do get the same output regardless of whether I use sudo or not, I have chowned the work queues to set the group ownership to my users group.)

Similarly all tests pass with ./tests --path=sw and I get a very very large stream of unsuccessful output with ./tests --path=hw. A small sample here

Details:
CPU: Intel (R) Xeon (R) CPU Max 9480

[user1@sprnode5 dsa_perf_micros]$ uname -r
6.3.0-2.el9.elrepo.x86_64
[user1@sprnode5 dsa_perf_micros]$ cat /etc/os-release
NAME="Rocky Linux"
VERSION="9.1 (Blue Onyx)"
[user1@sprnode5 dsa_perf_micros]$ gcc --version
gcc (GCC) 12.2.0

Full DSA config here

Is there anything in the setup I am forgetting/missing?

Thanks in advance,
Hamish

segfault with mutli-thread since the port ptr become null

use the PR #18

performance test to run multi-thread with the following command
./examples/dml_example_c_api_perftest 128 10000 4096 0 16
allocatate the 4096 aligned src=0x7f3a7a388000, dst=0x7f3a79b86000

jobs=0x7f3a7c82f010
jobs=0x7f3a7c83ac10
jobs=0x7f3a7c846810
jobs=0x7f3a7c852410
jobs=0x7f3a7c85e010
jobs=0x7f3a7c869c10
jobs=0x7f3a7c875810
jobs=0x7f3a7c881410
jobs=0x7f3a7c88d010
jobs=0x7f3a7c898c10
Starting example for multi-job memory move jobs=0x7f3a7c846810:
jobs=0x7f3a7c8a4810
Starting example for multi-job memory move jobs=0x7f3a7c83ac10:
Starting example for multi-job memory move jobs=0x7f3a7c82f010:
jobs=0x7f3a7c8b0410
jobs=0x7f3a7c8bc010
Starting example for multi-job memory move jobs=0x7f3a7c869c10:
Starting example for multi-job memory move jobs=0x7f3a7c85e010:
jobs=0x7f3a7c8c7c10
Starting example for multi-job memory move jobs=0x7f3a7c852410:
jobs=0x7f3a7c8d3810
Starting example for multi-job memory move jobs=0x7f3a7c8c7c10:
jobs=0x7f3a7c8df410
Starting example for multi-job memory move jobs=0x7f3a7c8d3810:
Starting example for multi-job memory move jobs=0x7f3a7c898c10:
Starting example for multi-job memory move jobs=0x7f3a7c8b0410:
Starting example for multi-job memory move jobs=0x7f3a7c8df410:
Starting example for multi-job memory move jobs=0x7f3a7c881410:
Starting example for multi-job memory move jobs=0x7f3a7c88d010:
Starting example for multi-job memory move jobs=0x7f3a7c8a4810:
Starting example for multi-job memory move jobs=0x7f3a7c8bc010:
Starting example for multi-job memory move jobs=0x7f3a7c875810:
Segmentation fault (core dumped)

using gdb we will find that the port ptr is null:
Thread 13 "dml_example_c_a" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffef1f4700 (LWP 123148)]
0x0000000000422057 in dml::core::dispatcher::hw_queue::enqueue_descriptor (this=0x62dc38 dml::core::dispatcher::instance+56, desc_ptr=0x7ffff7fac4c0) at /home/dennis/DML/sources/core/src/hw_dispatcher/hw_queue.cpp:92
92 : "a"(current_place_ptr), "d"(desc_ptr));
Missing separate debuginfos, use: yum debuginfo-install libgcc-8.5.0-4.el8_5.x86_64 libpmem-1.6.1-1.el8.x86_64 libstdc++-8.5.0-4.el8_5.x86_64 libuuid-2.32.1-28.el8.x86_64
(gdb) p current_place_ptr
$1 = (void *) 0x0
(gdb)

From the logic, the port ptr can't be null since the portal_mask_never change, but in this case the portal_mask_ changed to zero, so suspect some data overflow overwrite these data can cause the issue.

examples not getting compiled by default

As per the documentation here, the examples should be compiled, however, it is not specified in the CMakeLists.txt.

Please add this line to CMakeLists.txt to get the examples compiled when the DML is compiled.

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.