intel / dml Goto Github PK
View Code? Open in Web Editor NEWIntel® Data Mover Library (Intel® DML)
Home Page: https://intel.github.io/DML/
License: MIT License
Intel® Data Mover Library (Intel® DML)
Home Page: https://intel.github.io/DML/
License: MIT License
Consider the following code:
{
auto h = dml::submit<path>(...);
}
The handler returned from submit may be destroyed before the operation completes. I don't see any mention in the documentation that the handler should be kept alive until the operation completes, so I would assume this code is valid.
However, I believe this can cause use-after-free in the hardware path since DML will try to write the completion status to the descriptor (which, if I understand correctly, is owned by the handler). Possibly the same is true for C API and dml_finalize_job().
Is my understanding correct or this is already handled by the DML somehow? If not, I think the best way to avoid this would be to wait for the operation to complete inside the handler destructor.
libdmlhl.a gets intalled into /usr/lib instead of /usr/lib64 (for system where libdir is lib64)
I'm unable to use the HW path for mem move even after configuring the DSA devices:
$ sudo ./hl_mem_move_example hardware_path
Executing using dml::hardware path
Starting dml::mem_move example...
Copy 1KB of data from source into destination...
dml-diag: DML version TODO
dml-diag: Struct size: 3328 B
dml-diag: loading driver: libaccel-config.so.1
Failure occurred.
When manually calling dml::memmove, I get error code 16 that corresponds to internal library error. Is there a way to debug this? Any help would be really appreciated. Thanks!
Processor: Intel(R) Xeon(R) Silver 4416+
I have configured DSA using the python script:
$ sudo python3 accel_conf.py --load=../configs/1n1d1e1w-s-n1.conf
Filter:
Disabling active devices
dsa0 - done
Loading configuration - done
Additional configuration steps
Force block on fault: False
Enabling configured devices
dsa0 - done
wq0.0 - done
Checking configuration
node: 0; device: dsa0; group: group0.0
wqs: wq0.0
engines: engine0.0
I'm also running relatively recent kernel version:
$ uname -a
Linux machinename 6.8.0-rc7 #1 SMP PREEMPT_DYNAMIC Thu Mar 7 11:11:46 PST 2024 x86_64 x86_64 x86_64 GNU/Linux
Kernel cmdline:
$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.8.0-rc7 root=UUID=4f739d8f-4f15-4fc3-b419-bbb0202131b3 ro splash earlyprintk=ttyS1,115200 console=ttyS1,115200 c
onsole=ttyS0,115200 memmap=8G!16G nokaslr movable_node=2 intel_iommu=on,sm_on iommu=on vt.handoff=7
lspci
output for one of the two devices available:
$ sudo lspci -vvv -s 75:01.0 75:01.0 System peripheral: Intel Corporation Device 0b25 Subsystem: Intel Corporation Device 0000 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes NUMA node: 0 IOMMU group: 1 Region 0: Memory at 21bffff50000 (64-bit, prefetchable) [size=64K] Region 2: Memory at 21bffff20000 (64-bit, prefetchable) [size=128K] Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0 ExtTag+ RBE+ FLReset+ DevCtl: CorrErr- NonFatalErr- FatalErr+ UnsupReq- RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 512 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- DevCap2: Completion Timeout: Not Supported, TimeoutDis+ NROPrPrP- LTR+ 10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis+ LTR- OBFF Disabled, AtomicOpsCtl: ReqEn- Capabilities: [80] MSI-X: Enable+ Count=9 Masked- Vector table: BAR=0 offset=00002000 PBA: BAR=0 offset=00003000 Capabilities: [90] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [100 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- UESvrt: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [150 v1] Latency Tolerance Reporting Max snoop latency: 0ns Max no snoop latency: 0ns Capabilities: [160 v1] Transaction Processing Hints Device specific mode supported Steering table in TPH capability structure Capabilities: [170 v1] Virtual Channel Caps: LPEVC=1 RefClk=100ns PATEntryBits=1 Arb: Fixed+ WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01 Status: NegoPending- InProgress- VC1: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=1 ArbSelect=Fixed TC/VC=02 Status: NegoPending- InProgress- Capabilities: [200 v1] Designated Vendor-Specific: Vendor=8086 ID=0005 Rev=0 Len=24 <?> Capabilities: [220 v1] Address Translation Service (ATS) ATSCap: Invalidate Queue Depth: 00 ATSCtl: Enable+, Smallest Translation Unit: 00 Capabilities: [230 v1] Process Address Space ID (PASID) PASIDCap: Exec- Priv+, Max PASID Width: 14 PASIDCtl: Enable+ Exec- Priv+ Capabilities: [240 v1] Page Request Interface (PRI) PRICtl: Enable+ Reset- PRISta: RF- UPRGI- Stopped+ Page Request Capacity: 00000200, Page Request Allocation: 00000200 Kernel driver in use: idxd Kernel modules: idxd
When I run the command, It shows error like this:
./ll_crc_example hardware_path
The example will be run on the hardware path.
Starting CRC job example.
Caclulating CRC for region of size 1KB.
An error (100) occured during job execution.
So I try 2 steps:
failed in dsa0/wq0.0
enabled 0 wq(s) out of 1
Checking configuration
No active devices
How should I do to fix it? And after step2 should it be ok to run the command ./ll_crc_example hardware_path?
In CRC and COPY_WITH_CRC commands it is not allowed to set initial CRC values after PAGE_FAULT with no bytes processed.
update_crc_for_continuation() and update_copy_crc_for_continuation() should include the conditon:
if (crc_record.bytes_completed() != 0) {
crc_dsc.crc_seed() = crc_record.crc_value();
}
the shared library gets installed as /usr/lib64/libdml.so instead of /usr/lib64/libdml.so.1.0
this violates all linux standards and should be fixed (and prevents future compatibility managability)
The descriptor
struct contains excessive bytes
array initialization in default ctor. I believe the following code can be safely removed: https://github.com/intel/DML/blob/develop/include/dml/detail/common/types.hpp#L21-L24. The bytes
array is zero-initialized.
The same is also applicable for completion_record
.
I am not sure about performance impact - I hope compiler will optimize it out when compiled with -O2.
I change job_api_example to HW path
int main(const int argc, char **const argv)
{
// Variables
dml_job_t *dml_job_ptr = NULL;
uint32_t total_fails = 0u;
// Allocate dml_job_t
dml_job_ptr = init_dml_job(DML_PATH_SW);
change this into DML_PATH_HW, then I just got a lot of error and fail without any log to show me root cause or suggest next action.
[root@localhost dml_job_api]# ./job_api_samples
Intel(R) Data Mover Library Job API Examples
============================== LEGALS ==============================
Copyright (C) 2021 Intel Corporation
SPDX-License-Identifier: MIT====================================================================
------------------------------------------
Run example # 1
Example of using Intel DML DML_OP_NOP operation
--- Buffers size to DML_OP_NOP operation: 128
--- DML_OP_NOP property: no any specific properties
Example return: FAIL (Status 100)
------------------------------------------
Run example # 2
Example of using Intel DML DML_OP_MEM_MOVE operation
--- Buffers size to DML_OP_MEM_MOVE operation: 128
--- DML_OP_MEM_MOVE property: none
Example return: FAIL (Status 100)
------------------------------------------
Run example # 3
Example of using Intel DML DML_OP_FILL operation
--- Buffers size to DML_OP_FILL operation: 128
--- DML_OP_FILL property: no any specific properties
Example return: FAIL (Status 100)
------------------------------------------
Run example # 4
Example of using Intel DML DML_OP_COMPARE_PATTERN operation
--- Buffers size to DML_OP_COMPARE_PATTERN operation: 128
--- DML_OP_COMPARE_PATTERN property: none
Array is equal to pattern
--- Status : 100
--- Result : 0
--- Offset : 0
Array is NOT equal to pattern
--- Status : 100
--- Result : 0
--- Offset : 0
Example return: FAIL (Status 100)
------------------------------------------
Run example # 5
Example of using Intel DML DML_OP_DIF_UPDATE operation
--- Buffers size to DML_OP_DIF_UPDATE operation: 4104
--- DML_OP_DIF_UPDATE property: BLOCK_SIZE is 4096
Example return: FAIL (Status 100)
------------------------------------------
Run example # 6
Example of using Intel DML DML_OP_DIF_INSERT operation
--- Buffers size to DML_OP_DIF_INSERT operation: 4096 and 4104
--- DML_OP_DIF_INSERT property: BLOCK_SIZE is 4096
Example return: FAIL (Status 100)
------------------------------------------
Run example # 7
Example of using Intel DML DML_OP_DIF_CHECK operation
--- Buffers size to DML_OP_DIF_CHECK operation: 4096 and 4104
--- DML_OP_DIF_CHECK property: BLOCK_SIZE is 4096
Example return: FAIL (Status 100)
------------------------------------------
Run example # 8
Example of using Intel DML DML_OP_DIF_STRIP operation
--- Buffers size to DML_OP_DIF_STRIP operation: 4104 and 4096
--- DML_OP_DIF_STRIP property: BLOCK_SIZE is 4096
Example return: FAIL (Status 100)
------------------------------------------
Run example # 9
Example of using Intel DML DML_OP_CACHE_FLUSH operation
--- Buffers size to DML_OP_CACHE_FLUSH operation: 128
--- DML_OP_CACHE_FLUSH property: none
Example return: FAIL (Status 100)
------------------------------------------
Run example # 10
Example of using Intel DML DML_OP_BATCH operation
--- Buffers size to DML_OP_BATCH operation: 128
--- DML_OP_BATCH property: none
Example return: FAIL (Status 100)
------------------------------------------
Run example # 11
Example of using Intel DML DML_OP_CRC operation
--- Buffers size to DML_OP_CRC operation: 128
--- DML_OP_CRC property: none
Example return: FAIL (Status 100)
------------------------------------------
Run example # 12
Example of using Intel DML DML_OP_CRC_COPY operation
--- Buffers size to DML_OP_CRC_COPY operation: 128
--- DML_OP_CRC_COPY property: none
Example return: FAIL (Status 100)
------------------------------------------
Run example # 13
Example of using Intel DML DML_OP_CRC operation with dml_submit_job
--- Buffers size to DML_OP_CRC operation: 128
--- DML_OP_CRC property: no any specific properties
Example return: FAIL (Status 100)
------------------------------------------
Run example # 14
Example of using Intel DML DML_OP_DELTA_CREATE operation
--- Buffers size to DML_OP_DELTA_CREATE operation: 128
--- DML_OP_DELTA_CREATE property: none
Example return: FAIL (Status 100)
------------------------------------------
Run example # 15
Example of using Intel DML DML_OP_DELTA_APPLY operation
--- Buffers size to DML_OP_DELTA_APPLY operation: 128
--- DML_OP_DELTA_APPLY property: none
Example return: FAIL (Status 100)
====== Examples Execution Completed ======
--- Total Samples run: 15
--- Samples completed with OK status: 0
--- Samples completed with FAIL status: 15
[root@localhost dml_job_api]#
example:
/usr/dml/cmake/DmlConfig.cmake
which is very much an invalid location!
From the current examples, there is no detailed guide and code on how to use the hardware path.
FILL, to avoid "shifted pattern" on page boundary (I assume that 16B pattern is not handled yet and 8 is OK).
uint32_t processed = 8 * (fill_record.bytes_completed() / 8);
fill_dsc.transfer_size() -= processed;
fill_dsc.destination_address() += processed;
Does it apply to COMPARE_PATTERN?
DIF_INSERT:
uint32_t blocks_processed = bytes_completed / block_size;
bytes_completed -= blocks_processed * block_size;
source_address += blocks_processed * block_size;
destination_address += blocks_processed * (block_size + DSA_DIF_SIZE);
Very similar pattern should be for DIF_STRIP.
example code: mem_move
change to allocation source and destination to allocate from stack to heap and init the buffer:
uint8_t* source = (uint8_t *)malloc(BUFFER_SIZE);
uint8_t* destination = (uint8_t *)malloc(BUFFER_SIZE);
memset(source, 1, BUFFER_SIZE);
memset(destination, 0, BUFFER_SIZE);
compile and run , return error 102
if change the compile optimization running successful, like:
cmake -DCMAKE_BUILD_TYPE=Debug ..
or
cmake -DCMAKE_BUILD_TYPE=-O0 ..
Link to Issue Reporting is wrong on
https://github.com/intel/DML/tree/develop#how-to-report-issues
/builddir/build/BUILD/DML-1.0.0/sources/c_api/../../include/dml/detail/ml/buffer.hpp:53:36: error: ‘exchange’ is not a member of ‘std’
std::exchange is in the utility header which isn't included
Release: v1.1.0
Environment: Windows (GCC 13.2.1)
This appears to be caused by:
Because of the #if
, the necessary definitions are missing, causing a lot of errors:
FAILED: sources/core/src/hw_dispatcher/CMakeFiles/dml_hw_dispatcher.dir/hw_configuration_driver.c.obj
/usr/local/bin/x86_64-w64-mingw32ucrt-gcc -DDML_GIT_REVISION=\"N/A\" -D_FORTIFY_SOURCE=2 -I/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/../../../../include -I/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/. -O3 -mcrtdll=ucrt -D_UCRT -O2 -DNDEBUG -fPIC -fstack-protector --param=ssp-buffer-size=8 -fstack-clash-protection -MD -MT sources/core/src/hw_dispatcher/CMakeFiles/dml_hw_dispatcher.dir/hw_configuration_driver.c.obj -MF sources/core/src/hw_dispatcher/CMakeFiles/dml_hw_dispatcher.dir/hw_configuration_driver.c.obj.d -o sources/core/src/hw_dispatcher/CMakeFiles/dml_hw_dispatcher.dir/hw_configuration_driver.c.obj -c /build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c
In file included from /build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:9:
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_configuration_driver.h:56:47: warning: 'struct accfg_ctx' declared inside parameter list will not be visible outside of this definition or declaration
56 | int32_t DML_HW_API(driver_new_context)(struct accfg_ctx **ctx);
| ^~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_configuration_driver.h:58:66: warning: 'struct accfg_ctx' declared inside parameter list will not be visible outside of this definition or declaration
58 | struct accfg_device *DML_HW_API(context_get_first_device)(struct accfg_ctx *ctx);
| ^~~~~~~~~
In file included from /build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_configuration_driver.h:10:
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: error: conflicting types for 'dsa_driver_new_context'; have 'int32_t(struct accfg_ctx **)' {aka 'int(struct accfg_ctx **)'}
49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
| ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:134:9: note: in expansion of macro 'DML_HW_API'
134 | int32_t DML_HW_API(driver_new_context)(struct accfg_ctx **ctx)
| ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: note: previous declaration of 'dsa_driver_new_context' with type 'int32_t(struct accfg_ctx **)' {aka 'int(struct accfg_ctx **)'}
49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
| ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_configuration_driver.h:56:9: note: in expansion of macro 'DML_HW_API'
56 | int32_t DML_HW_API(driver_new_context)(struct accfg_ctx **ctx);
| ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: error: conflicting types for 'dsa_context_get_first_device'; have 'struct accfg_device *(struct accfg_ctx *)'
49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
| ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:143:22: note: in expansion of macro 'DML_HW_API'
143 | struct accfg_device *DML_HW_API(context_get_first_device)(struct accfg_ctx *ctx)
| ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: note: previous declaration of 'dsa_context_get_first_device' with type 'struct accfg_device *(struct accfg_ctx *)'
49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
| ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_configuration_driver.h:58:22: note: in expansion of macro 'DML_HW_API'
58 | struct accfg_device *DML_HW_API(context_get_first_device)(struct accfg_ctx *ctx);
| ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: error: return type is an incomplete type
49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
| ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:188:21: note: in expansion of macro 'DML_HW_API'
188 | enum accfg_wq_state DML_HW_API(work_queue_get_state)(struct accfg_wq *wq)
| ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: error: conflicting types for 'dsa_work_queue_get_state'; have 'void(struct accfg_wq *)'
49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
| ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:188:21: note: in expansion of macro 'DML_HW_API'
188 | enum accfg_wq_state DML_HW_API(work_queue_get_state)(struct accfg_wq *wq)
| ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: note: previous declaration of 'dsa_work_queue_get_state' with type 'enum accfg_wq_state(struct accfg_wq *)'
49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
| ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_configuration_driver.h:82:21: note: in expansion of macro 'DML_HW_API'
82 | enum accfg_wq_state DML_HW_API(work_queue_get_state)(struct accfg_wq *wq);
| ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c: In function 'dsa_work_queue_get_state':
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:193:12: warning: 'return' with a value, in function returning void
193 | return -1;
| ^
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: note: declared here
49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
| ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:188:21: note: in expansion of macro 'DML_HW_API'
188 | enum accfg_wq_state DML_HW_API(work_queue_get_state)(struct accfg_wq *wq)
| ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c: At top level:
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: error: return type is an incomplete type
49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
| ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:197:20: note: in expansion of macro 'DML_HW_API'
197 | enum accfg_wq_mode DML_HW_API(work_queue_get_mode)(struct accfg_wq *wq)
| ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: error: conflicting types for 'dsa_work_queue_get_mode'; have 'void(struct accfg_wq *)'
49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
| ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:197:20: note: in expansion of macro 'DML_HW_API'
197 | enum accfg_wq_mode DML_HW_API(work_queue_get_mode)(struct accfg_wq *wq)
| ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: note: previous declaration of 'dsa_work_queue_get_mode' with type 'enum accfg_wq_mode(struct accfg_wq *)'
49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
| ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_configuration_driver.h:84:20: note: in expansion of macro 'DML_HW_API'
84 | enum accfg_wq_mode DML_HW_API(work_queue_get_mode)(struct accfg_wq *wq);
| ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c: In function 'dsa_work_queue_get_mode':
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:202:12: warning: 'return' with a value, in function returning void
202 | return 2;
| ^
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: note: declared here
49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
| ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:197:20: note: in expansion of macro 'DML_HW_API'
197 | enum accfg_wq_mode DML_HW_API(work_queue_get_mode)(struct accfg_wq *wq)
| ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c: At top level:
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: error: return type is an incomplete type
49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
| ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:224:25: note: in expansion of macro 'DML_HW_API'
224 | enum accfg_device_state DML_HW_API(device_get_state)(struct accfg_device *device)
| ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: error: conflicting types for 'dsa_device_get_state'; have 'void(struct accfg_device *)'
49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
| ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:224:25: note: in expansion of macro 'DML_HW_API'
224 | enum accfg_device_state DML_HW_API(device_get_state)(struct accfg_device *device)
| ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: note: previous declaration of 'dsa_device_get_state' with type 'enum accfg_device_state(struct accfg_device *)'
49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
| ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_configuration_driver.h:68:25: note: in expansion of macro 'DML_HW_API'
68 | enum accfg_device_state DML_HW_API(device_get_state)(struct accfg_device *device);
| ^~~~~~~~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c: In function 'dsa_device_get_state':
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:229:12: warning: 'return' with a value, in function returning void
229 | return -1;
| ^
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/legacy_headers/hardware_definitions.h:49:41: note: declared here
49 | #define DML_HW_API(name) DML_HW_STDCALL dsa_##name /**< Declaration macros to manipulate function name */
| ^~~~
/build/BUILD/DML-develop/sources/core/src/hw_dispatcher/hw_configuration_driver.c:224:25: note: in expansion of macro 'DML_HW_API'
224 | enum accfg_device_state DML_HW_API(device_get_state)(struct accfg_device *device)
| ^~~~~~~~~~
Configured with:
cmake -GNinja -S . -B build -DCMAKE_BUILD_TYPE=Release '-DCMAKE_C_FLAGS_RELEASE=-O2 -DNDEBUG' '-DCMAKE_CXX_FLAGS_RELEASE=-O2 -DNDEBUG' -DCMAKE_INSTALL_PREFIX=/usr/x86_64-w64-mingw32ucrt/sys-root/local -DDML_BUILD_EXAMPLES=OFF -DDML_BUILD_TESTS=OFF
In example code "multi_socket_example.c", destroy a pointer but add a constant : "jobs + SOCKET_COUNT"
cleanup:
for (uint32_t i = 0; i < SOCKET_COUNT; ++i)
{
dml_finalize_job(jobs + SOCKET_COUNT);
}
why here is a constant , is not a variable offset for "jobs + i"
cleanup:
for (uint32_t i = 0; i < SOCKET_COUNT; ++i)
{
dml_finalize_job(jobs + i);
}
The condition in core_interconnect.cpp:138 is incorrect. Should be:
if (is_finished && (status & 0x7f) == page_fault_mask)
(to get rid of READ/WRITE page fault bit, 0x80).
Same in :107
Issue descript:
Multi-Socket sample code default set socket number = 4 , running on different config/SKU has some different results
#define SOCKET_COUNT 4u
Config 1:
CPU: Intel(R) Xeon(R) Platinum 8490H
Socket : 2
DSA device per Socket: 4
Enable 1 device: dsa0 (on socket0)
Both SOCKET_COUNT equal to 1~4 can running successful.
Config 2:
CPU: Intel(R) Xeon(R) Platinum 8470
Socket : 2
DSA device per Socket: 1
setup1: // error failed to submit to node0
Enable 1 device: dsa0 (on socket0)
SOCKET_COUNT=4
setup2: // error failed to submit to node1
Enable 1 device: dsa0 (on socket0)
Enable 1 device: dsa1 (on socket1)
SOCKET_COUNT=4
setup3: // successful
Enable 1 device: dsa0 (on socket0)
Enable 1 device: dsa1 (on socket1)
SOCKET_COUNT=2
setup4: // successful
Enable 1 device: dsa0 (on socket0)
SOCKET_COUNT=4
Commented out code: current_job->numa_id = i
for (uint32_t i = 0; i < SOCKET_COUNT; ++i)
{
const uint32_t chunk_size = transfer_size / SOCKET_COUNT;
dml_job_t* current_job = (dml_job_t*)((uint8_t*)jobs + (job_size * i));
current_job->operation = DML_OP_MEM_MOVE;
current_job->source_first_ptr = src + (chunk_size * i);
current_job->destination_first_ptr = dst + (chunk_size * i);
current_job->source_length = chunk_size;
current_job->flags = DML_FLAG_PREFETCH_CACHE;
//current_job->numa_id = i;
}
Why has these different results , does any logic issue about numa node check of DML?
We have some question about async mode of DML, thanks for comments:
(1) If one device config two or more SWQ, does DML async mode submit the job to every WQ, how to select and allocate these jobs to these WQs ,is each WQ get the jobs balanced?
(2) If a WQ has two or more engines, the completion of the jobs is disordered or sequential, in other words, and does any flag can control the dml_wait_job() function keep the jobs exaction sequence.
Job size returned by dml_get_job_size(PATH, &job_size_ptr)
function seems too big. I get the value '98816' regardless of the used path (DML_PATH_HW/DML_PATH_SW)
.
Is ~100kB memory allocation for job buffer within the expected range?
Hi! I'm afraid that the detection of supported CPU extensions is flaky. On a bunch of machines I've tested, most examples die at an AVX instruction not supported by the CPU. This includes even Pentium 4 which you explicitly list as supported!
Machines I've tested:
uarch | model | |
---|---|---|
Gemini Lake | J4115 | ✗ |
Skylake | i7-6700K | ✓ |
Pinnacle Ridge | TR 2990WX | ✓ |
Pentium 4 | f15 m4 s3 | ✗ |
Braswell | N3160 | ✗ |
Pentium 4 | f15 m4 s1 | DNC (32-bit) |
Phenom 2 | Thuban | ✗ |
You check cpuid, but take only cache info. The presence of AVX/AVX2/AVX512/etc is readily available there.
What is the acceptable input buffer size for CRC operation ?
I play with different sizes of CRC buffer.
DML Lib does accept different sizes, but it behaves with error or even segmentation fault in some cases.
Example execution:
[bgrzesko@fl31ca105bs0411 build]$ ./examples/low-level-api/ll_crc_example_1KB hardware_path
The example will be run on the hardware path.
Starting CRC job example.
Caclulating CRC for region of size 1KB.
Calculated CRC is: 0x2cdf6e8f
Finished successfully.
[bgrzesko@fl31ca105bs0411 build]$ ./examples/low-level-api/ll_crc_example_4MB hardware_path
The example will be run on the hardware path.
Starting CRC job example.
Caclulating CRC for region of size 4MB.
An error (15) occured during job execution.
[bgrzesko@fl31ca105bs0411 build]$ ./examples/low-level-api/ll_crc_example_16MB hardware_path
Segmentation fault (core dumped)
[bgrzesko@fl31ca105bs0411 build]$
How to reproduce (diff -> apply and compile example) :
[bgrzesko@fl31ca105bs0411 build]$ git diff
diff --git a/examples/low-level-api/crc_example.c b/examples/low-level-api/crc_example.c
index 3c12df2..ad03704 100644
--- a/examples/low-level-api/crc_example.c
+++ b/examples/low-level-api/crc_example.c
@@ -9,7 +9,8 @@
#include "dml/dml.h"
#include "examples_utils.h"
-#define BUFFER_SIZE 1024 // 1 KB
+//#define BUFFER_SIZE 4 * 1024 * 1024 // 4 MB
+#define BUFFER_SIZE 16 * 1024 * 1024 // 16 MB
/*
* This example demonstrates how to create and run a crc operation.
I've read the document, but I'm still confused about the hardware and software options.
Is DML a wrapper for Intel DSA?
Since DMA also supports asynchronous data movement, does DML support DMA as well?
And the dml::software
is implement by starting another thread (or corountine)?
Thanks in advance!
Hi,
I am unable to run hardware mode examples/tests. I did a fresh clone from master and built using GCC.
To configure the DSA and kernel, I followed the DSA user guide. I believe I have configured the DSA correctly because I can run the dsa_perf_micros scripts
e.g.
sudo ./src/dsa_perf_micros -n128 -s16k -j -c -f -i8000 -k5 -w0 -zF,F -o3
[sudo] password for user1:
./src/dsa_perf_micros -n128 -s16k -j -c -f -i8000 -k5 -w0 -zF,F -o3
-j option is deprecated (default behavior)
blen 16384
bstride 16384
bstride 16384
nb_bufs 128
pg_size 0
wq_type 0
batch_sz 1
iter 8000
nb_cpus 1
var_mmio 1
dma 1
verify 1
misc_flags 0
access_op[0] Write
access_op[1] Write
place_op[0] Memory
place_op[1] Memory
flags_cmask ffffffff
flags_smask 0
flags_nth_desc 1
nb_numa_node 16
cpu_desc_work 0
Memory affinity
CPUs in node 0: -1 -1
Buffer Offsets 0 0
GB per sec = 31.170166 cpu 6.270452 kopsrate = 1902
However, I cannot run any of the tests/examples in DML with hardware mode, e.g.
[user1@sprnode5 high-level-api]$ ./hl_mem_move_example_example hardware_path
Executing using dml::hardware path
Starting dml::mem_move example...
Copy 1KB of data from source into destination...
Failure occurred.
[user1@sprnode5 high-level-api]$ ./hl_mem_move_example_example software_path
Executing using dml::software path
Starting dml::mem_move example...
Copy 1KB of data from source into destination...
Finished successfully.
(Note I do get the same output regardless of whether I use sudo or not, I have chowned the work queues to set the group ownership to my users group.)
Similarly all tests pass with ./tests --path=sw
and I get a very very large stream of unsuccessful output with ./tests --path=hw
. A small sample here
Details:
CPU: Intel (R) Xeon (R) CPU Max 9480
[user1@sprnode5 dsa_perf_micros]$ uname -r
6.3.0-2.el9.elrepo.x86_64
[user1@sprnode5 dsa_perf_micros]$ cat /etc/os-release
NAME="Rocky Linux"
VERSION="9.1 (Blue Onyx)"
[user1@sprnode5 dsa_perf_micros]$ gcc --version
gcc (GCC) 12.2.0
Is there anything in the setup I am forgetting/missing?
Thanks in advance,
Hamish
Please update comments, or define cache-control flags for other operations.
Line 133 in 2050879
use the PR #18
performance test to run multi-thread with the following command
./examples/dml_example_c_api_perftest 128 10000 4096 0 16
allocatate the 4096 aligned src=0x7f3a7a388000, dst=0x7f3a79b86000
jobs=0x7f3a7c82f010
jobs=0x7f3a7c83ac10
jobs=0x7f3a7c846810
jobs=0x7f3a7c852410
jobs=0x7f3a7c85e010
jobs=0x7f3a7c869c10
jobs=0x7f3a7c875810
jobs=0x7f3a7c881410
jobs=0x7f3a7c88d010
jobs=0x7f3a7c898c10
Starting example for multi-job memory move jobs=0x7f3a7c846810:
jobs=0x7f3a7c8a4810
Starting example for multi-job memory move jobs=0x7f3a7c83ac10:
Starting example for multi-job memory move jobs=0x7f3a7c82f010:
jobs=0x7f3a7c8b0410
jobs=0x7f3a7c8bc010
Starting example for multi-job memory move jobs=0x7f3a7c869c10:
Starting example for multi-job memory move jobs=0x7f3a7c85e010:
jobs=0x7f3a7c8c7c10
Starting example for multi-job memory move jobs=0x7f3a7c852410:
jobs=0x7f3a7c8d3810
Starting example for multi-job memory move jobs=0x7f3a7c8c7c10:
jobs=0x7f3a7c8df410
Starting example for multi-job memory move jobs=0x7f3a7c8d3810:
Starting example for multi-job memory move jobs=0x7f3a7c898c10:
Starting example for multi-job memory move jobs=0x7f3a7c8b0410:
Starting example for multi-job memory move jobs=0x7f3a7c8df410:
Starting example for multi-job memory move jobs=0x7f3a7c881410:
Starting example for multi-job memory move jobs=0x7f3a7c88d010:
Starting example for multi-job memory move jobs=0x7f3a7c8a4810:
Starting example for multi-job memory move jobs=0x7f3a7c8bc010:
Starting example for multi-job memory move jobs=0x7f3a7c875810:
Segmentation fault (core dumped)
using gdb we will find that the port ptr is null:
Thread 13 "dml_example_c_a" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffef1f4700 (LWP 123148)]
0x0000000000422057 in dml::core::dispatcher::hw_queue::enqueue_descriptor (this=0x62dc38 dml::core::dispatcher::instance+56, desc_ptr=0x7ffff7fac4c0) at /home/dennis/DML/sources/core/src/hw_dispatcher/hw_queue.cpp:92
92 : "a"(current_place_ptr), "d"(desc_ptr));
Missing separate debuginfos, use: yum debuginfo-install libgcc-8.5.0-4.el8_5.x86_64 libpmem-1.6.1-1.el8.x86_64 libstdc++-8.5.0-4.el8_5.x86_64 libuuid-2.32.1-28.el8.x86_64
(gdb) p current_place_ptr
$1 = (void *) 0x0
(gdb)
From the logic, the port ptr can't be null since the portal_mask_never change, but in this case the portal_mask_ changed to zero, so suspect some data overflow overwrite these data can cause the issue.
As per the documentation here, the examples should be compiled, however, it is not specified in the CMakeLists.txt.
Please add this line to CMakeLists.txt to get the examples compiled when the DML is compiled.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.