Code Monkey home page Code Monkey logo

Comments (7)

botika avatar botika commented on July 23, 2024 2

I think a match from 0 to 8 and a for as default, may win a little more. I don't know, I'll look at it. Very good work on these benchmarks, thank you.

from ntex.

fafhrd91 avatar fafhrd91 commented on July 23, 2024

I am not very proficient in asm, new code works with bytes, old code works with qword. should qwords be faster, I don't know.
you should run benchmarks

from ntex.

botika avatar botika commented on July 23, 2024

@pickfire It's only four steps in the loop, I think it would go faster without the for.

fn xor_short(buf: ShortSlice<'_>, mut mask: u64) {
    // Unsafe: we know that a `ShortSlice` fits in a u64
    unsafe {
        match buf.0.len() {
            8 => {
                *(buf.0.as_mut_ptr() as *mut u64) ^= mask;
            }
            4 => {
                *(buf.0.as_mut_ptr() as *mut u32) ^= mask as u32;
            }
            2 => {
                *(buf.0.as_mut_ptr() as *mut u16) ^= mask as u16;
            }
            1 => {
                *(buf.0.as_mut_ptr() as *mut u8) ^= mask as u8;
            }
            _ => {
                for i in buf.0.iter_mut() {
                    *i ^= mask as u8;
                    mask >>= 8;
                }
            }
        }
    }
}

This is what works best for me.

from ntex.

pickfire avatar pickfire commented on July 23, 2024

I wrote a new one (I was doing cast since I didn't know as_mut_ptr could be used this way)

pub fn xor_short_unsafe(buf: ShortSlice<'_>, mask: u64) {
    // This may have flipped other bits, not sure if this is safe
    unsafe {
        *(buf.0.as_mut_ptr() as *mut u64) ^= mask;
    }
}

This should be faster since it works with qword but not sure if it's safe to flip other bytes, can the cpu flip multiple bytes at once?

example::xor_short_unsafe:
        xor     qword ptr [rdi], rdx
        ret

The assembly is also short.

@botika The way you showed have a lot of branches, as well if the length is in the odds, we have to go through those branches too while getting the same thing, I wonder if we could do like a binary way from 8, 4, 2, 1 and keep getting smaller. Didn't benchmark it to know which one is faster.

from ntex.

botika avatar botika commented on July 23, 2024

The latter is not safe for len <= 4, theoretically it would be for len < 8, but it is the same.

from ntex.

pickfire avatar pickfire commented on July 23, 2024

I did a benchmark, the one @botika showed seemed to be slightly faster when it is exactly 16 elements, when it is 15 elements it behaves as slow as fallback. I think some other parts may cause the slowness in the least timings. Update godbolt https://rust.godbolt.org/z/bTK85W

test apply_mask_01          ... bench:      16,459 ns/iter (+/- 164)
test apply_mask_02          ... bench:      16,940 ns/iter (+/- 301)
test apply_mask_03          ... bench:      19,552 ns/iter (+/- 239)
test apply_mask_04          ... bench:      16,478 ns/iter (+/- 84)
test apply_mask_07          ... bench:      19,957 ns/iter (+/- 412)
test apply_mask_08          ... bench:      13,183 ns/iter (+/- 489)
test apply_mask_15          ... bench:      22,591 ns/iter (+/- 351)
test apply_mask_16          ... bench:      22,584 ns/iter (+/- 36)
test apply_mask_fallback_01 ... bench:      10,907 ns/iter (+/- 510)
test apply_mask_fallback_02 ... bench:      11,325 ns/iter (+/- 484)
test apply_mask_fallback_03 ... bench:      11,642 ns/iter (+/- 299)
test apply_mask_fallback_04 ... bench:      11,979 ns/iter (+/- 318)
test apply_mask_fallback_07 ... bench:      13,423 ns/iter (+/- 393)
test apply_mask_fallback_08 ... bench:      13,708 ns/iter (+/- 337)
test apply_mask_fallback_15 ... bench:      16,172 ns/iter (+/- 693)
test apply_mask_fallback_16 ... bench:      16,555 ns/iter (+/- 326)
test apply_mask_multi_01    ... bench:      12,532 ns/iter (+/- 316)
test apply_mask_multi_02    ... bench:      12,553 ns/iter (+/- 363)
test apply_mask_multi_03    ... bench:      14,420 ns/iter (+/- 513)
test apply_mask_multi_04    ... bench:      12,406 ns/iter (+/- 540)
test apply_mask_multi_07    ... bench:      14,992 ns/iter (+/- 345)
test apply_mask_multi_08    ... bench:      13,272 ns/iter (+/- 620)
test apply_mask_multi_15    ... bench:      15,604 ns/iter (+/- 314)
test apply_mask_multi_16    ... bench:      16,316 ns/iter (+/- 221)
test apply_mask_new_01      ... bench:      13,514 ns/iter (+/- 58)
test apply_mask_new_02      ... bench:      13,798 ns/iter (+/- 241)
test apply_mask_new_03      ... bench:      13,950 ns/iter (+/- 1,403)
test apply_mask_new_04      ... bench:      13,760 ns/iter (+/- 265)
test apply_mask_new_07      ... bench:      14,833 ns/iter (+/- 122)
test apply_mask_new_08      ... bench:      13,458 ns/iter (+/- 436)
test apply_mask_new_15      ... bench:      15,204 ns/iter (+/- 95)
test apply_mask_new_16      ... bench:      15,656 ns/iter (+/- 186)
test apply_mask_unsafe_01   ... bench:      12,949 ns/iter (+/- 59)
test apply_mask_unsafe_02   ... bench:      12,285 ns/iter (+/- 89)
test apply_mask_unsafe_03   ... bench:      12,944 ns/iter (+/- 22)
test apply_mask_unsafe_04   ... bench:      12,987 ns/iter (+/- 44)
test apply_mask_unsafe_07   ... bench:      12,699 ns/iter (+/- 98)
test apply_mask_unsafe_08   ... bench:      13,198 ns/iter (+/- 766)
test apply_mask_unsafe_15   ... bench:      13,451 ns/iter (+/- 28)
test apply_mask_unsafe_16   ... bench:      13,711 ns/iter (+/- 22)

Fallback seemed to perform quite well with smaller mask. It didn't even have panic code from what I see, uses qword while

code
#![feature(test)]

use std::ptr::copy_nonoverlapping;
use std::slice;

pub struct ShortSlice<'a>(&'a mut [u8]);

impl<'a> ShortSlice<'a> {
    unsafe fn new(slice: &'a mut [u8]) -> Self {
        // Sanity check for debug builds
        debug_assert!(slice.len() < 8);
        ShortSlice(slice)
    }
    fn len(&self) -> usize {
        self.0.len()
    }
}

#[inline]
#[allow(clippy::needless_pass_by_value)]
pub fn xor_short(buf: ShortSlice<'_>, mask: u64) {
    // Unsafe: we know that a `ShortSlice` fits in a u64
    unsafe {
        let (ptr, len) = (buf.0.as_mut_ptr(), buf.0.len());
        let mut b: u64 = 0;
        #[allow(trivial_casts)]
        copy_nonoverlapping(ptr, &mut b as *mut _ as *mut u8, len);
        b ^= mask;
        #[allow(trivial_casts)]
        copy_nonoverlapping(&b as *const _ as *const u8, ptr, len);
    }
}

#[inline]
pub fn xor_short_new(buf: ShortSlice<'_>, mut mask: u64) {
    for i in buf.0.iter_mut() {
        *i ^= mask as u8;
        mask >>= 8;
    }
}

#[inline]
pub fn xor_short_multi(buf: ShortSlice<'_>, mut mask: u64) {
    // Unsafe: we know that a `ShortSlice` fits in a u64
    unsafe {
        match buf.0.len() {
            8 => {
                *(buf.0.as_mut_ptr() as *mut u64) ^= mask;
            }
            4 => {
                *(buf.0.as_mut_ptr() as *mut u32) ^= mask as u32;
            }
            2 => {
                *(buf.0.as_mut_ptr() as *mut u16) ^= mask as u16;
            }
            1 => {
                *(buf.0.as_mut_ptr() as *mut u8) ^= mask as u8;
            }
            _ => {
                for i in buf.0.iter_mut() {
                    *i ^= mask as u8;
                    mask >>= 8;
                }
            }
        }
    }
}

#[inline]
pub fn xor_short_unsafe(buf: ShortSlice<'_>, mask: u64) {
    // This may have flipped other bits, not sure if this is
    unsafe {
        *(buf.0.as_mut_ptr() as *mut u64) ^= mask;
    }
}

#[allow(clippy::cast_lossless)]
pub fn apply_mask(buf: &mut [u8], mask_u32: u32) {
    // Extend the mask to 64 bits
    let mut mask_u64 = ((mask_u32 as u64) << 32) | (mask_u32 as u64);
    // Split the buffer into three segments
    let (head, mid, tail) = align_buf(buf);

    // Initial unaligned segment
    let head_len = head.len();
    if head_len > 0 {
        xor_short(head, mask_u64);
        if cfg!(target_endian = "big") {
            mask_u64 = mask_u64.rotate_left(8 * head_len as u32);
        } else {
            mask_u64 = mask_u64.rotate_right(8 * head_len as u32);
        }
    }
    // Aligned segment
    for v in mid {
        *v ^= mask_u64;
    }
    // Final unaligned segment
    if tail.len() > 0 {
        xor_short(tail, mask_u64);
    }
}

#[allow(clippy::cast_lossless)]
pub fn apply_mask_new(buf: &mut [u8], mask_u32: u32) {
    // Extend the mask to 64 bits
    let mut mask_u64 = ((mask_u32 as u64) << 32) | (mask_u32 as u64);
    // Split the buffer into three segments
    let (head, mid, tail) = align_buf(buf);

    // Initial unaligned segment
    let head_len = head.len();
    if head_len > 0 {
        xor_short_new(head, mask_u64);
        if cfg!(target_endian = "big") {
            mask_u64 = mask_u64.rotate_left(8 * head_len as u32);
        } else {
            mask_u64 = mask_u64.rotate_right(8 * head_len as u32);
        }
    }
    // Aligned segment
    for v in mid {
        *v ^= mask_u64;
    }
    // Final unaligned segment
    if tail.len() > 0 {
        xor_short_new(tail, mask_u64);
    }
}

#[allow(clippy::cast_lossless)]
pub fn apply_mask_multi(buf: &mut [u8], mask_u32: u32) {
    // Extend the mask to 64 bits
    let mut mask_u64 = ((mask_u32 as u64) << 32) | (mask_u32 as u64);
    // Split the buffer into three segments
    let (head, mid, tail) = align_buf(buf);

    // Initial unaligned segment
    let head_len = head.len();
    if head_len > 0 {
        xor_short_multi(head, mask_u64);
        if cfg!(target_endian = "big") {
            mask_u64 = mask_u64.rotate_left(8 * head_len as u32);
        } else {
            mask_u64 = mask_u64.rotate_right(8 * head_len as u32);
        }
    }
    // Aligned segment
    for v in mid {
        *v ^= mask_u64;
    }
    // Final unaligned segment
    if tail.len() > 0 {
        xor_short_multi(tail, mask_u64);
    }
}

#[allow(clippy::cast_lossless)]
pub fn apply_mask_unsafe(buf: &mut [u8], mask_u32: u32) {
    // Extend the mask to 64 bits
    let mut mask_u64 = ((mask_u32 as u64) << 32) | (mask_u32 as u64);
    // Split the buffer into three segments
    let (head, mid, tail) = align_buf(buf);

    // Initial unaligned segment
    let head_len = head.len();
    if head_len > 0 {
        xor_short_unsafe(head, mask_u64);
        if cfg!(target_endian = "big") {
            mask_u64 = mask_u64.rotate_left(8 * head_len as u32);
        } else {
            mask_u64 = mask_u64.rotate_right(8 * head_len as u32);
        }
    }
    // Aligned segment
    for v in mid {
        *v ^= mask_u64;
    }
    // Final unaligned segment
    if tail.len() > 0 {
        xor_short_unsafe(tail, mask_u64);
    }
}

#[inline]
// Splits a slice into three parts: an unaligned short head and tail, plus an aligned
// u64 mid section.
fn align_buf(buf: &mut [u8]) -> (ShortSlice<'_>, &mut [u64], ShortSlice<'_>) {
    let start_ptr = buf.as_ptr() as usize;
    let end_ptr = start_ptr + buf.len();

    // Round *up* to next aligned boundary for start
    let start_aligned = (start_ptr + 7) & !0x7;
    // Round *down* to last aligned boundary for end
    let end_aligned = end_ptr & !0x7;

    if end_aligned >= start_aligned {
        // We have our three segments (head, mid, tail)
        let (tmp, tail) = buf.split_at_mut(end_aligned - start_ptr);
        let (head, mid) = tmp.split_at_mut(start_aligned - start_ptr);

        // Unsafe: we know the middle section is correctly aligned, and the outer
        // sections are smaller than 8 bytes
        unsafe { (ShortSlice::new(head), cast_slice(mid), ShortSlice(tail)) }
    } else {
        // We didn't cross even one aligned boundary!

        // Unsafe: The outer sections are smaller than 8 bytes
        unsafe { (ShortSlice::new(buf), &mut [], ShortSlice::new(&mut [])) }
    }
}

#[inline]
// Unsafe: caller must ensure the buffer has the correct size and alignment
unsafe fn cast_slice(buf: &mut [u8]) -> &mut [u64] {
    // Assert correct size and alignment in debug builds
    debug_assert!(buf.len().trailing_zeros() >= 3);
    debug_assert!((buf.as_ptr() as usize).trailing_zeros() >= 3);

    slice::from_raw_parts_mut(buf.as_mut_ptr() as *mut u64, buf.len() >> 3)
}

pub fn apply_mask_fallback(buf: &mut [u8], mask: [u8; 4]) {
    for (i, byte) in buf.iter_mut().enumerate() {
        *byte ^= mask[i & 3];
    }
}

extern crate test;

use test::bench::{Bencher, black_box};

macro_rules! bench {
    ($fnname:ident, $func:ident, $unmasked:expr) => {
        #[bench]
        fn $fnname(b: &mut Bencher) {
            let mask = [0x6d, 0xb6, 0xb2, 0x80];
            let mask_u32 = u32::from_le_bytes(mask);

            b.iter(|| {
                for _ in 0..1000 {
                    let mut unmasked = black_box($unmasked);

                    $func(&mut unmasked, mask_u32);
                }
            });
        }
    }
}

macro_rules! bench_fallback {
    ($fnname:ident, $unmasked:expr) => {
        #[bench]
        fn $fnname(b: &mut Bencher) {
            let mask = [0x6d, 0xb6, 0xb2, 0x80];

            b.iter(|| {
                for _ in 0..1000 {
                    let mut unmasked = black_box($unmasked);

                    apply_mask_fallback(&mut unmasked, mask);
                }
            });
        }
    }
}

bench!(apply_mask_16, apply_mask, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9, 0x12]);
bench!(apply_mask_new_16, apply_mask_new, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9, 0x12]);
bench!(apply_mask_multi_16, apply_mask_multi, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9, 0x12]);
bench!(apply_mask_unsafe_16, apply_mask_unsafe, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9, 0x12]);
bench_fallback!(apply_mask_fallback_16, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9, 0x12]);

bench!(apply_mask_15, apply_mask, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9]);
bench!(apply_mask_new_15, apply_mask_new, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9]);
bench!(apply_mask_multi_15, apply_mask_multi, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9]);
bench!(apply_mask_unsafe_15, apply_mask_unsafe, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9]);
bench_fallback!(apply_mask_fallback_15, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9]);

bench!(apply_mask_08, apply_mask, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82]);
bench!(apply_mask_new_08, apply_mask_new, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82]);
bench!(apply_mask_multi_08, apply_mask_multi, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82]);
bench!(apply_mask_unsafe_08, apply_mask_unsafe, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82]);
bench_fallback!(apply_mask_fallback_08, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82]);

bench!(apply_mask_07, apply_mask, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81]);
bench!(apply_mask_new_07, apply_mask_new, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81]);
bench!(apply_mask_multi_07, apply_mask_multi, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81]);
bench!(apply_mask_unsafe_07, apply_mask_unsafe, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81]);
bench_fallback!(apply_mask_fallback_07, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81]);

bench!(apply_mask_04, apply_mask, vec![0xf3, 0x00, 0x01, 0x02]);
bench!(apply_mask_new_04, apply_mask_new, vec![0xf3, 0x00, 0x01, 0x02]);
bench!(apply_mask_multi_04, apply_mask_multi, vec![0xf3, 0x00, 0x01, 0x02]);
bench!(apply_mask_unsafe_04, apply_mask_unsafe, vec![0xf3, 0x00, 0x01, 0x02]);
bench_fallback!(apply_mask_fallback_04, vec![0xf3, 0x00, 0x01, 0x02]);

bench!(apply_mask_03, apply_mask, vec![0xf3, 0x00, 0x01]);
bench!(apply_mask_new_03, apply_mask_new, vec![0xf3, 0x00, 0x01]);
bench!(apply_mask_multi_03, apply_mask_multi, vec![0xf3, 0x00, 0x01]);
bench!(apply_mask_unsafe_03, apply_mask_unsafe, vec![0xf3, 0x00, 0x01]);
bench_fallback!(apply_mask_fallback_03, vec![0xf3, 0x00, 0x01]);

bench!(apply_mask_02, apply_mask, vec![0xf3, 0x00]);
bench!(apply_mask_new_02, apply_mask_new, vec![0xf3, 0x00]);
bench!(apply_mask_multi_02, apply_mask_multi, vec![0xf3, 0x00]);
bench!(apply_mask_unsafe_02, apply_mask_unsafe, vec![0xf3, 0x00]);
bench_fallback!(apply_mask_fallback_02, vec![0xf3, 0x00]);

bench!(apply_mask_01, apply_mask, vec![0xf3]);
bench!(apply_mask_new_01, apply_mask_new, vec![0xf3]);
bench!(apply_mask_multi_01, apply_mask_multi, vec![0xf3]);
bench!(apply_mask_unsafe_01, apply_mask_unsafe, vec![0xf3]);
bench_fallback!(apply_mask_fallback_01, vec![0xf3]);

from ntex.

fafhrd91 avatar fafhrd91 commented on July 23, 2024

close on no activity

from ntex.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.