<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I did a benchmark, the one <a class="user-mention notranslate" data-hovercard-type="us

Mask TODO about ntex HOT 7 CLOSED

ntex-rs commented on July 23, 2024

Mask TODO

from ntex.

Comments (7)

botika commented on July 23, 2024 2

I think a match from 0 to 8 and a for as default, may win a little more. I don't know, I'll look at it. Very good work on these benchmarks, thank you.

from ntex.

fafhrd91 commented on July 23, 2024

I am not very proficient in asm, new code works with bytes, old code works with qword. should qwords be faster, I don't know.
you should run benchmarks

from ntex.

botika commented on July 23, 2024

@pickfire It's only four steps in the loop, I think it would go faster without the for.

fn xor_short(buf: ShortSlice<'_>, mut mask: u64) {
    // Unsafe: we know that a `ShortSlice` fits in a u64
    unsafe {
        match buf.0.len() {
            8 => {
                *(buf.0.as_mut_ptr() as *mut u64) ^= mask;
            }
            4 => {
                *(buf.0.as_mut_ptr() as *mut u32) ^= mask as u32;
            }
            2 => {
                *(buf.0.as_mut_ptr() as *mut u16) ^= mask as u16;
            }
            1 => {
                *(buf.0.as_mut_ptr() as *mut u8) ^= mask as u8;
            }
            _ => {
                for i in buf.0.iter_mut() {
                    *i ^= mask as u8;
                    mask >>= 8;
                }
            }
        }
    }
}

This is what works best for me.

from ntex.

pickfire commented on July 23, 2024

I wrote a new one (I was doing cast since I didn't know as_mut_ptr could be used this way)

pub fn xor_short_unsafe(buf: ShortSlice<'_>, mask: u64) {
    // This may have flipped other bits, not sure if this is safe
    unsafe {
        *(buf.0.as_mut_ptr() as *mut u64) ^= mask;
    }
}

This should be faster since it works with qword but not sure if it's safe to flip other bytes, can the cpu flip multiple bytes at once?

example::xor_short_unsafe:
        xor     qword ptr [rdi], rdx
        ret

The assembly is also short.

@botika The way you showed have a lot of branches, as well if the length is in the odds, we have to go through those branches too while getting the same thing, I wonder if we could do like a binary way from 8, 4, 2, 1 and keep getting smaller. Didn't benchmark it to know which one is faster.

from ntex.

botika commented on July 23, 2024

The latter is not safe for len <= 4, theoretically it would be for len < 8, but it is the same.

from ntex.

pickfire commented on July 23, 2024

I did a benchmark, the one @botika showed seemed to be slightly faster when it is exactly 16 elements, when it is 15 elements it behaves as slow as fallback. I think some other parts may cause the slowness in the least timings. Update godbolt https://rust.godbolt.org/z/bTK85W

test apply_mask_01          ... bench:      16,459 ns/iter (+/- 164)
test apply_mask_02          ... bench:      16,940 ns/iter (+/- 301)
test apply_mask_03          ... bench:      19,552 ns/iter (+/- 239)
test apply_mask_04          ... bench:      16,478 ns/iter (+/- 84)
test apply_mask_07          ... bench:      19,957 ns/iter (+/- 412)
test apply_mask_08          ... bench:      13,183 ns/iter (+/- 489)
test apply_mask_15          ... bench:      22,591 ns/iter (+/- 351)
test apply_mask_16          ... bench:      22,584 ns/iter (+/- 36)
test apply_mask_fallback_01 ... bench:      10,907 ns/iter (+/- 510)
test apply_mask_fallback_02 ... bench:      11,325 ns/iter (+/- 484)
test apply_mask_fallback_03 ... bench:      11,642 ns/iter (+/- 299)
test apply_mask_fallback_04 ... bench:      11,979 ns/iter (+/- 318)
test apply_mask_fallback_07 ... bench:      13,423 ns/iter (+/- 393)
test apply_mask_fallback_08 ... bench:      13,708 ns/iter (+/- 337)
test apply_mask_fallback_15 ... bench:      16,172 ns/iter (+/- 693)
test apply_mask_fallback_16 ... bench:      16,555 ns/iter (+/- 326)
test apply_mask_multi_01    ... bench:      12,532 ns/iter (+/- 316)
test apply_mask_multi_02    ... bench:      12,553 ns/iter (+/- 363)
test apply_mask_multi_03    ... bench:      14,420 ns/iter (+/- 513)
test apply_mask_multi_04    ... bench:      12,406 ns/iter (+/- 540)
test apply_mask_multi_07    ... bench:      14,992 ns/iter (+/- 345)
test apply_mask_multi_08    ... bench:      13,272 ns/iter (+/- 620)
test apply_mask_multi_15    ... bench:      15,604 ns/iter (+/- 314)
test apply_mask_multi_16    ... bench:      16,316 ns/iter (+/- 221)
test apply_mask_new_01      ... bench:      13,514 ns/iter (+/- 58)
test apply_mask_new_02      ... bench:      13,798 ns/iter (+/- 241)
test apply_mask_new_03      ... bench:      13,950 ns/iter (+/- 1,403)
test apply_mask_new_04      ... bench:      13,760 ns/iter (+/- 265)
test apply_mask_new_07      ... bench:      14,833 ns/iter (+/- 122)
test apply_mask_new_08      ... bench:      13,458 ns/iter (+/- 436)
test apply_mask_new_15      ... bench:      15,204 ns/iter (+/- 95)
test apply_mask_new_16      ... bench:      15,656 ns/iter (+/- 186)
test apply_mask_unsafe_01   ... bench:      12,949 ns/iter (+/- 59)
test apply_mask_unsafe_02   ... bench:      12,285 ns/iter (+/- 89)
test apply_mask_unsafe_03   ... bench:      12,944 ns/iter (+/- 22)
test apply_mask_unsafe_04   ... bench:      12,987 ns/iter (+/- 44)
test apply_mask_unsafe_07   ... bench:      12,699 ns/iter (+/- 98)
test apply_mask_unsafe_08   ... bench:      13,198 ns/iter (+/- 766)
test apply_mask_unsafe_15   ... bench:      13,451 ns/iter (+/- 28)
test apply_mask_unsafe_16   ... bench:      13,711 ns/iter (+/- 22)

Fallback seemed to perform quite well with smaller mask. It didn't even have panic code from what I see, uses qword while

code

#![feature(test)]

use std::ptr::copy_nonoverlapping;
use std::slice;

pub struct ShortSlice<'a>(&'a mut [u8]);

impl<'a> ShortSlice<'a> {
    unsafe fn new(slice: &'a mut [u8]) -> Self {
        // Sanity check for debug builds
        debug_assert!(slice.len() < 8);
        ShortSlice(slice)
    }
    fn len(&self) -> usize {
        self.0.len()
    }
}

#[inline]
#[allow(clippy::needless_pass_by_value)]
pub fn xor_short(buf: ShortSlice<'_>, mask: u64) {
    // Unsafe: we know that a `ShortSlice` fits in a u64
    unsafe {
        let (ptr, len) = (buf.0.as_mut_ptr(), buf.0.len());
        let mut b: u64 = 0;
        #[allow(trivial_casts)]
        copy_nonoverlapping(ptr, &mut b as *mut _ as *mut u8, len);
        b ^= mask;
        #[allow(trivial_casts)]
        copy_nonoverlapping(&b as *const _ as *const u8, ptr, len);
    }
}

#[inline]
pub fn xor_short_new(buf: ShortSlice<'_>, mut mask: u64) {
    for i in buf.0.iter_mut() {
        *i ^= mask as u8;
        mask >>= 8;
    }
}

#[inline]
pub fn xor_short_multi(buf: ShortSlice<'_>, mut mask: u64) {
    // Unsafe: we know that a `ShortSlice` fits in a u64
    unsafe {
        match buf.0.len() {
            8 => {
                *(buf.0.as_mut_ptr() as *mut u64) ^= mask;
            }
            4 => {
                *(buf.0.as_mut_ptr() as *mut u32) ^= mask as u32;
            }
            2 => {
                *(buf.0.as_mut_ptr() as *mut u16) ^= mask as u16;
            }
            1 => {
                *(buf.0.as_mut_ptr() as *mut u8) ^= mask as u8;
            }
            _ => {
                for i in buf.0.iter_mut() {
                    *i ^= mask as u8;
                    mask >>= 8;
                }
            }
        }
    }
}

#[inline]
pub fn xor_short_unsafe(buf: ShortSlice<'_>, mask: u64) {
    // This may have flipped other bits, not sure if this is
    unsafe {
        *(buf.0.as_mut_ptr() as *mut u64) ^= mask;
    }
}

#[allow(clippy::cast_lossless)]
pub fn apply_mask(buf: &mut [u8], mask_u32: u32) {
    // Extend the mask to 64 bits
    let mut mask_u64 = ((mask_u32 as u64) << 32) | (mask_u32 as u64);
    // Split the buffer into three segments
    let (head, mid, tail) = align_buf(buf);

    // Initial unaligned segment
    let head_len = head.len();
    if head_len > 0 {
        xor_short(head, mask_u64);
        if cfg!(target_endian = "big") {
            mask_u64 = mask_u64.rotate_left(8 * head_len as u32);
        } else {
            mask_u64 = mask_u64.rotate_right(8 * head_len as u32);
        }
    }
    // Aligned segment
    for v in mid {
        *v ^= mask_u64;
    }
    // Final unaligned segment
    if tail.len() > 0 {
        xor_short(tail, mask_u64);
    }
}

#[allow(clippy::cast_lossless)]
pub fn apply_mask_new(buf: &mut [u8], mask_u32: u32) {
    // Extend the mask to 64 bits
    let mut mask_u64 = ((mask_u32 as u64) << 32) | (mask_u32 as u64);
    // Split the buffer into three segments
    let (head, mid, tail) = align_buf(buf);

    // Initial unaligned segment
    let head_len = head.len();
    if head_len > 0 {
        xor_short_new(head, mask_u64);
        if cfg!(target_endian = "big") {
            mask_u64 = mask_u64.rotate_left(8 * head_len as u32);
        } else {
            mask_u64 = mask_u64.rotate_right(8 * head_len as u32);
        }
    }
    // Aligned segment
    for v in mid {
        *v ^= mask_u64;
    }
    // Final unaligned segment
    if tail.len() > 0 {
        xor_short_new(tail, mask_u64);
    }
}

#[allow(clippy::cast_lossless)]
pub fn apply_mask_multi(buf: &mut [u8], mask_u32: u32) {
    // Extend the mask to 64 bits
    let mut mask_u64 = ((mask_u32 as u64) << 32) | (mask_u32 as u64);
    // Split the buffer into three segments
    let (head, mid, tail) = align_buf(buf);

    // Initial unaligned segment
    let head_len = head.len();
    if head_len > 0 {
        xor_short_multi(head, mask_u64);
        if cfg!(target_endian = "big") {
            mask_u64 = mask_u64.rotate_left(8 * head_len as u32);
        } else {
            mask_u64 = mask_u64.rotate_right(8 * head_len as u32);
        }
    }
    // Aligned segment
    for v in mid {
        *v ^= mask_u64;
    }
    // Final unaligned segment
    if tail.len() > 0 {
        xor_short_multi(tail, mask_u64);
    }
}

#[allow(clippy::cast_lossless)]
pub fn apply_mask_unsafe(buf: &mut [u8], mask_u32: u32) {
    // Extend the mask to 64 bits
    let mut mask_u64 = ((mask_u32 as u64) << 32) | (mask_u32 as u64);
    // Split the buffer into three segments
    let (head, mid, tail) = align_buf(buf);

    // Initial unaligned segment
    let head_len = head.len();
    if head_len > 0 {
        xor_short_unsafe(head, mask_u64);
        if cfg!(target_endian = "big") {
            mask_u64 = mask_u64.rotate_left(8 * head_len as u32);
        } else {
            mask_u64 = mask_u64.rotate_right(8 * head_len as u32);
        }
    }
    // Aligned segment
    for v in mid {
        *v ^= mask_u64;
    }
    // Final unaligned segment
    if tail.len() > 0 {
        xor_short_unsafe(tail, mask_u64);
    }
}

#[inline]
// Splits a slice into three parts: an unaligned short head and tail, plus an aligned
// u64 mid section.
fn align_buf(buf: &mut [u8]) -> (ShortSlice<'_>, &mut [u64], ShortSlice<'_>) {
    let start_ptr = buf.as_ptr() as usize;
    let end_ptr = start_ptr + buf.len();

    // Round *up* to next aligned boundary for start
    let start_aligned = (start_ptr + 7) & !0x7;
    // Round *down* to last aligned boundary for end
    let end_aligned = end_ptr & !0x7;

    if end_aligned >= start_aligned {
        // We have our three segments (head, mid, tail)
        let (tmp, tail) = buf.split_at_mut(end_aligned - start_ptr);
        let (head, mid) = tmp.split_at_mut(start_aligned - start_ptr);

        // Unsafe: we know the middle section is correctly aligned, and the outer
        // sections are smaller than 8 bytes
        unsafe { (ShortSlice::new(head), cast_slice(mid), ShortSlice(tail)) }
    } else {
        // We didn't cross even one aligned boundary!

        // Unsafe: The outer sections are smaller than 8 bytes
        unsafe { (ShortSlice::new(buf), &mut [], ShortSlice::new(&mut [])) }
    }
}

#[inline]
// Unsafe: caller must ensure the buffer has the correct size and alignment
unsafe fn cast_slice(buf: &mut [u8]) -> &mut [u64] {
    // Assert correct size and alignment in debug builds
    debug_assert!(buf.len().trailing_zeros() >= 3);
    debug_assert!((buf.as_ptr() as usize).trailing_zeros() >= 3);

    slice::from_raw_parts_mut(buf.as_mut_ptr() as *mut u64, buf.len() >> 3)
}

pub fn apply_mask_fallback(buf: &mut [u8], mask: [u8; 4]) {
    for (i, byte) in buf.iter_mut().enumerate() {
        *byte ^= mask[i & 3];
    }
}

extern crate test;

use test::bench::{Bencher, black_box};

macro_rules! bench {
    ($fnname:ident, $func:ident, $unmasked:expr) => {
        #[bench]
        fn $fnname(b: &mut Bencher) {
            let mask = [0x6d, 0xb6, 0xb2, 0x80];
            let mask_u32 = u32::from_le_bytes(mask);

            b.iter(|| {
                for _ in 0..1000 {
                    let mut unmasked = black_box($unmasked);

                    $func(&mut unmasked, mask_u32);
                }
            });
        }
    }
}

macro_rules! bench_fallback {
    ($fnname:ident, $unmasked:expr) => {
        #[bench]
        fn $fnname(b: &mut Bencher) {
            let mask = [0x6d, 0xb6, 0xb2, 0x80];

            b.iter(|| {
                for _ in 0..1000 {
                    let mut unmasked = black_box($unmasked);

                    apply_mask_fallback(&mut unmasked, mask);
                }
            });
        }
    }
}

bench!(apply_mask_16, apply_mask, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9, 0x12]);
bench!(apply_mask_new_16, apply_mask_new, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9, 0x12]);
bench!(apply_mask_multi_16, apply_mask_multi, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9, 0x12]);
bench!(apply_mask_unsafe_16, apply_mask_unsafe, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9, 0x12]);
bench_fallback!(apply_mask_fallback_16, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9, 0x12]);

bench!(apply_mask_15, apply_mask, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9]);
bench!(apply_mask_new_15, apply_mask_new, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9]);
bench!(apply_mask_multi_15, apply_mask_multi, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9]);
bench!(apply_mask_unsafe_15, apply_mask_unsafe, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9]);
bench_fallback!(apply_mask_fallback_15, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9]);

bench!(apply_mask_08, apply_mask, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82]);
bench!(apply_mask_new_08, apply_mask_new, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82]);
bench!(apply_mask_multi_08, apply_mask_multi, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82]);
bench!(apply_mask_unsafe_08, apply_mask_unsafe, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82]);
bench_fallback!(apply_mask_fallback_08, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82]);

bench!(apply_mask_07, apply_mask, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81]);
bench!(apply_mask_new_07, apply_mask_new, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81]);
bench!(apply_mask_multi_07, apply_mask_multi, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81]);
bench!(apply_mask_unsafe_07, apply_mask_unsafe, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81]);
bench_fallback!(apply_mask_fallback_07, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81]);

bench!(apply_mask_04, apply_mask, vec![0xf3, 0x00, 0x01, 0x02]);
bench!(apply_mask_new_04, apply_mask_new, vec![0xf3, 0x00, 0x01, 0x02]);
bench!(apply_mask_multi_04, apply_mask_multi, vec![0xf3, 0x00, 0x01, 0x02]);
bench!(apply_mask_unsafe_04, apply_mask_unsafe, vec![0xf3, 0x00, 0x01, 0x02]);
bench_fallback!(apply_mask_fallback_04, vec![0xf3, 0x00, 0x01, 0x02]);

bench!(apply_mask_03, apply_mask, vec![0xf3, 0x00, 0x01]);
bench!(apply_mask_new_03, apply_mask_new, vec![0xf3, 0x00, 0x01]);
bench!(apply_mask_multi_03, apply_mask_multi, vec![0xf3, 0x00, 0x01]);
bench!(apply_mask_unsafe_03, apply_mask_unsafe, vec![0xf3, 0x00, 0x01]);
bench_fallback!(apply_mask_fallback_03, vec![0xf3, 0x00, 0x01]);

bench!(apply_mask_02, apply_mask, vec![0xf3, 0x00]);
bench!(apply_mask_new_02, apply_mask_new, vec![0xf3, 0x00]);
bench!(apply_mask_multi_02, apply_mask_multi, vec![0xf3, 0x00]);
bench!(apply_mask_unsafe_02, apply_mask_unsafe, vec![0xf3, 0x00]);
bench_fallback!(apply_mask_fallback_02, vec![0xf3, 0x00]);

bench!(apply_mask_01, apply_mask, vec![0xf3]);
bench!(apply_mask_new_01, apply_mask_new, vec![0xf3]);
bench!(apply_mask_multi_01, apply_mask_multi, vec![0xf3]);
bench!(apply_mask_unsafe_01, apply_mask_unsafe, vec![0xf3]);
bench_fallback!(apply_mask_fallback_01, vec![0xf3]);

from ntex.

fafhrd91 commented on July 23, 2024

close on no activity

from ntex.

Mask TODO about ntex HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent