milankl / bitinformation.jl Goto Github PK

Information between bits and bytes.

License: MIT License

Julia 100.00%

bitinformation.jl's Introduction

BitInformation.jl

BitInformation.jl is a package for bitwise information analysis and manipulation in Julia arrays. Based on counting the occurrences of bits in floats (or generally any bits type) across various dimensions, this package calculates quantities like the bitwise real information content, the mutual information, the redundancy or preserved information between arrays. From v0.5 onwards masked arrays are also supported.

For bitwise manipulation, BitInformation.jl also implements various rounding modes (IEEE round,shave,set_one, etc.) efficiently with bitwise operations for any number of bits. E.g. round(x,i) implements IEEE's round to nearest tie-to-even for any float retaining i mantissa bits. Furthermore, transormations like XOR-delta, bittranspose (aka bit shuffle), or signed/biased exponents are implemented.

If you'd like to propose changes, or contribute in any form create a pull request or raise an issue. Contributions are highly appreciated!

Functionality

For an overview of the functionality and explanation see the documentation.

Installation

BitInformation.jl is registered in the Julia Registry, so just do

julia>] add BitInformation

where ] opens the package manager. The latest version is automatically installed.

Funding

This project is funded by the Copernicus Programme through the ECMWF summer of weather code 2020 and 2021

Reference

If you use this package, please cite the following publication

M Klöwer, M Razinger, JJ Dominguez, PD Düben and TN Palmer, 2021. Compressing atmospheric data into its real information content. Nature Computational Science 1, 713–724. 10.1038/s43588-021-00156-2

bitinformation.jl's People

Contributors

Stargazers

Watchers

Forkers

farhadrclass dmey jeffreysarnoff

bitinformation.jl's Issues

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Compressing zarr data store for simulation data

Hi,

I'm trying to work out if/how to compress simulation data stored in zarr using the xbitinfo python wrapper. For now on a test case of ~900 MB, stored as float32:

The information content looks like this:

(had to be done in two steps - different longitude coordinates)

By default, precipitation is saved in kg/m^2/s, with goes from zero (plus a few negative points) to 1e-3.
Following the basic pipeline led to compressed precipitation looked quite a bit different (more clunky, like a filled colour plot) than the uncompressed:

Physically, I prefer precipitation instead in mm/day, involving multiplying by a factor of 86400. I've done a bit of experimenting with converting the data to float64 and/or multiplying with a factor (as int -> data becomes float64; as float -> data stays float32), which yields quite different results in how many bits the algorithm likes to keep:

I was wondering if you had some insights. For example

Would you recommend converting 32- to 64 bit before doing the compression?
I guess you would recommend applying factors before compression?
What's your intuition on when to accept the result of the algorithm, and when to choose a different threshold (especially with such heterogenous data that seems a little tricky)?
I'm potentially interested in applying this compression to a pipeline that keeps adding data to a zarr archive as the simulations complete. Have you tested the "appending to a zarr archive" functionalities with xbitinfo?

I also did a bit of testing of saving the tutorial (https://xbitinfo.readthedocs.io/en/latest/quick-start.html) as .nc and .zarr. I noticed that while for netcdf, the file size decreases with "better" compression

this is not the case with zarr, where weirdly the uncompressed is smallest, the initial compression increases the size, and the the rounded compression is smaller, but almost a factor of 2 larger than netcdf.

Did you find this too, and do you have an explanation?

Thanks!
Matthias

Check for NaNs and raise warning

As just discussed with @observingClouds and @ayoubft, we should have a check that elements in an array have only finite values and raise a warning if. While the algorithm does not care about the meaning of the bitpatterns the user may not get what they expect if there are many NaNs in the dataset. Example, currently we have

julia> A = rand(100,100);
julia> A[1] = NaN
NaN

julia> bitinformation(A)
64-element Vector{Float64}:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 ⋮

but we shoud throw something like

julia> for a in A
           if !isfinite(a)
               @warn "Non-finite value detected."
           end
       end
┌ Warning: Non-finite value detected.
└ @ Main REPL[8]:3

Applying BitInformation to compress WRF model results

As mentioned by @rkouznetsov in @#25 (comment), he has applied the BitInformation approach to compressing the WRF model, using this NCKS script:

#!/bin/bash
infile=$1
outfile=$2

list3d='U,V,T,QVAPOR,QCLOUD,QRAIN,QICE,QSNOW,QGRAUP,CLDFRA'

prc3d="--ppc U,V,T=.3 --ppc QVAPOR,QCLOUD,QRAIN,QICE,QSNOW,QGRAUP=.7 --ppc CLDFRA=.3"
  list2d='Q2,T2,PSFC,U10,V10,TSLB,SMOIS,SEAICE,SNOW,SNOWH,SSTSK,LAI,TSK,RAINC,RAINNC,SWDOWN,SWDNB,ALBEDO,UST,PBLH,HFX,QFX,ACHFX,SST'

prc2d='--ppc U10,V10,T2,SMOIS,SEAICE,SSTSK,LAI,TSK,RAINC,RAINNC,ALBEDO,SST,TSLB=.2 --ppc Q2,PBLH,UST=2'
  gridvars="SINALPHA,COSALPHA,MAPFAC_MX,MAPFAC_MY,XLONG,XLAT,XLONG_U,XLAT_U,XLONG_V,XLAT_V,MAPFAC_UX,MAPFAC_UY,MAPFAC_VX,MAPFAC_VY"
  othervars='XLON.*,XLAT.*,Times,ZNU,ZNW,ZS,DZS,HFX_FORCE,LH_FORCE,TSK_FORCE,HFX_FORCE_TEND,LH_FORCE_TEND,TSK_FORCE_TEND,FNM,FNP,RDNW,RDN,DNW,DN,CFN,CFN1,THIS_IS_AN_IDEAL_RUN,RDX,RDY,RESM,ZETATOP,CF1,CF2,CF3,ITIMESTEP,XTIME,P_TOP,T00,P00,TLP,TISO,TLP_STRAT,P_STRAT,SAVE_TOPO_FROM_REAL,ISEEDARR_SPPT,ISEEDARR_SKEBS,ISEEDARR_RAND_PERTURB,ISEEDARRAY_SPP_CONV,ISEEDARRAY_SPP_PBL,ISEEDARRAY_SPP_LSM,C1H,C2H,C1F,C2F,C3H,C4H,C3F,C4F'

ncks -4 -L 5 --baa=5 -v $gridvars,$othervars,$list3d,$list2d $prc3d $prc2d --cnk_dmn bottom_top,1 $infile $outfile

Understanding latitudinal bounds of bitrounding absolute error

I try and fail to understand the error introduced by bitrounding comparing poles and tropics.

I take surface temperature tos data from CMIP6 historical MPI-ESM1.2-HR, the analysis says I should keep 7 mantissa bits and bitround. Finally, I plot the time anomaly (to get rid of the pole-tropics gradient). Similar to Klöwer et al. Fig 3, I enforce decreasing levels of real bitwise information from left to right. (Decreasing by keepbits would be more useful.)

Below is the absolute error due to bitrounding.

Note: keepbits=6 might have been too strong here, as the absolute error shows spatial features and not pure random noise.

What I cannot explain is the very low absolute errors in the high latitudes compared to higher rounding errors in the tropics and the quite abrupt cutoff at frontal lines. Anyone an idea? The temperature values are the pole are lower than in the tropics, but that doesnt explain this IMO.

data: CMIP6 tos historical last 10 years

all slides: https://docs.google.com/presentation/d/1VeFqeCTqsFc68dN4KuKsqMXVpieFpQblu6kLioijf_U/edit#slide=id.g1257457c99f_0_78

Improve Error message when `dim` in `bitinformation(data, dim)` too short

What do you think about improving the error message when the dimension is too short to run bitinformation over it? i.e. when dim only have one or two elements, lets say the user wants to run bitinformation over a time dimension (where physically adjacent is recommended and should rather use lon or lat).

Right now the error message is: Mask has 347040 unmasked values, 0 entries are adjacent. which is correct. But maybe the user can get a more explanatory return message or at least warning like You try to calculate bitinformation along a (too) short dimension of {length(data[dim])} elements. Maybe even continuing ... bitinformation assumes that adjacents analysis elements are also physically adjacent...

Came up in observingClouds/xbitinfo#97 (comment)

Smallest chunk based on statistics of random information

@rsignell-usgs asked: How small could I make the chunks if I wanted to recalculate for every chunk of a dataset?

Answer based on the statistics of random information:

TL;DR: Use n>1280 data points to calculate the keepbits of a 32-bit number format for only a small chance that the estimated bits of information are significantly higher than the true real information.

Long story, definitions:

random information is the information that is created because, by chance, a finite bitstream of length n can have non-zero entropy as the occurrences of 0,1 bits isn’t exactly n/2,n/2 times. In other words: There’s a non-zero chance that 10 coin flips produce more or less than 5 tails. Each deviation from 50/50 tails/head will create entropy, such that random information causes a higher estimate of the true real bitwise information. Think about the real information + false information = entropy, but the random information is the (positive only) error bar on the real information.

assumptions:

32-bit number format with 1 bit containing 1 bit of information that’s 100% of true real information. (This is a worst case assumption: 1 bit of information spread across several bits would reduce the number of bits which can introduce random information. So does the relative contribution from other bit positions if there’s already more than 1 bit of real information)
We don’t want the expected random information to add more than q=0.001 i.e. 0.1% of information (one magnitude lower than the 99% we use for keepbits)

Then:
If for a given bit we use a 1% confidence interval for that bit to introduce random information then for 31 bits, there’s

julia> p = 1-(0.99^31)
0.26769663034560265

a p = 0.27... , i.e. ~27% chance that any bit position introduces random information. The acceptable expected random information is q = 0.001 in total, so q = p*h with h being the entropy a given bit position can add as random information:

julia> h = q/p
0.0037355718624809604

this h is reached once there’s at least about n=1280 data points

julia> n = 1280
julia> BitInformation.binom_free_entropy(n,confidence)
0.0037423512353189636

@observingClouds this ☝🏼 is also relevant for our ambitions to make keepbits spatio-temporally variable.

Bitinformation of masked data

As requested by @aaronspring in #25, we need to extend the bitinformation function with a method for masked arrays. This is an issue to discuss a potential implementation of this and to discuss progress. The methods we would need are

function bitinformation(A::AbstractArray{NF},    # input array A
                        mask::BitArray,          # mask of A
                        kwargs...) where {T<:Union{Integer,AbstractFloat}}

Then I further suggest separating out the permutation for information along dimension dim,

function roll_dim_forward(A::AbstractArray,dim::Int)

which is basically this bit

BitInformation.jl/src/mutual_information.jl

Lines 66 to 73 in fa2dd50

    
           if dim > 1 
        
               permu = collect(1:ndims(A))             # permutation 
        
               perm0 = vcat(permu[2:end],permu[1])     # used to permute permu 
        
               for _ in 1:dim-1                       # create permutation array 
        
                   permute!(permu,perm0) 
        
               end 
        
               A = PermutedDimsArray(A,permu)          # permute A, such that desired dim is 1st dim     
        
           end

and moving the zeroing of insignificant information into its own function, like so

function set_insignificant_zero!(H::Vector,nelements::Int,confidence::Real)

which should basically include these lines

BitInformation.jl/src/mutual_information.jl

Lines 48 to 54 in fa2dd50

    
           if set_zero_insignificant 
        
               Hf = binom_free_entropy(nelements,confidence)   # free entropy of random 50/50 at trial size 
        
                                                               # get chance p for 1 (or 0) from binom distr 
        
               @inbounds for i in eachindex(M) 
        
                   M[i] = M[i] <= Hf ? 0 : M[i]                # set zero what's insignificant 
        
               end                           
        
           end

Method definition triggers warning in precompilation

[ Info: Precompiling BitInformation [de688a37-743e-4ac2-a6f0-bd62414d1aa7]
WARNING: Method definition (::Type{Base.BitArray{N} where N})(AbstractArray{T, N} where N) where {T} in module Base at bitarray.jl:503 overwritten in module BitInformation at /Users/milan/.julia/packages/BitInformation/VpaaY/src/bitarray.jl:1.
  ** incremental compilation may be fatally broken for this module **

How to implement boundary conditions with `masked_value`

Hi,
great package!

I just realised that the set_zero_insignificant argument is always true when masked_value is supplied. I expect

using BitInformation
bitinformation(Array{Float32}([1,2,3,4]), set_zero_insignificant=false)
32-element Vector{Float64}:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.25162916738782287
 0.25162916738782287
 0.0
 0.0
....

be the same as

bitinformation(Array{Float32}([1,2,3,4]), set_zero_insignificant=false, masked_value=convert(Float32,NaN))
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
....

This is however not the case for me. I tested this behaviour with version 0.5.0 and 0.5.1. There is probably only the set_zero_insignificant-keyword missing in own of the function definitions, but my eyes have not yet adapted to the Julia language to pin-point this in the code🤣 .

Discuss best practices for `xr.Dataset.to_netcdf()`

I want to use BitInformation and BitRound on NetCDF output. Anyone with experience with xarray?

What I do: see also #25 (comment)

My routine:

run plot_bitinformation.jl and save keepbits between 0 and 23 to a dict with json
use code from https://github.com/zarr-developers/numcodecs/pull/299/files to apply BitRound for each variable
ds_rounded.to_netcdf(path, encoding, engine='h5netcdf')

I use engine="h5netcdf" because I can use {"compression": True} in encoding. I do this because some compression is needed for BitRound to be effective on disk storage. (I have only very little knowledge of compression and netcdf4 basics.) But probably there is a better way. What I do not want is that the compression impacts the read performance too much, i.e. reading the compressed data should not take much more (subjective I know) time that compared to before (or is this the tradeoff anyways).

https://xarray.pydata.org/en/latest/generated/xarray.Dataset.to_netcdf.html

# https://github.com/zarr-developers/numcodecs/pull/299/files
import numpy as np
from numcodecs.abc import Codec
from numcodecs.compat import ensure_ndarray, ndarray_copy

class BitRound(Codec):
    codec_id = 'bitround'

    def __init__(self, keepbits: int):
        if (keepbits < 0) or (keepbits > 23):
            raise ValueError("keepbits must be between 0 and 23")
        self.keepbits = keepbits

    def encode(self, buf):
        if self.keepbits == 23:
            return buf
        # TODO: figure out if we need to make a copy
        # Currently this appears to be overwriting the input buffer
        # Is that the right behavior?
        a = ensure_ndarray(buf).view()
        assert a.dtype == np.float32
        b = a.view(dtype=np.int32)
        maskbits = 23 - self.keepbits
        mask = (0xFFFFFFFF >> maskbits) << maskbits
        half_quantum1 = (1 << (maskbits - 1)) - 1
        b += ((b >> maskbits) & 1) + half_quantum1
        b &= mask
        return b

    def decode(self, buf, out=None):
        data = ensure_ndarray(buf).view(np.float32)
        out = ndarray_copy(data, out)
        return out

def bitround(data, keepbits):
    codec = BitRound(keepbits=keepbits)
    data = data.copy()  # otherwise overwrites the input
    encoded = codec.encode(data)
    return codec.decode(encoded)


# my code below
import xarray as xr
import sys
print('Require three args: python BitRound_file.py ifile.nc keepbits.json ofile')
assert len(sys.argv) == 3+1
_, ifile, keepbits_json, ofile = sys.argv

import json
print("keepbits_json",keepbits_json)
keepbits_loopup = json.load(open(keepbits_json))

ds = xr.open_dataset(ifile)
ds_round = ds.copy()
for v in ds.data_vars:
    if v in keepbits_loopup.keys():
        keepbits = keepbits_loopup[v]
        # here corrections could be applied
        # some hard fixes maybe
        # if keepbits < 2: 
        #     keepbits = 5
        ds_round[v].values = bitround(ds[v].values,keepbits)

ds_round=ds_round.assign_coords(ds.coords)

encoding={v:{'compression':True} for v in ds.data_vars if v in keepbits_loopup.keys()}
print('BitRound_file.py writes ', ofile)
ds_round.to_netcdf(ofile, mode='w', engine='h5netcdf', encoding=encoding)

EDIT:

Result:

>>> ds_rounded[v].encoding
{'zlib': True, 'shuffle': False, 'complevel': 4, 'fletcher32': False, 'contiguous': False, 'chunksizes': (1, 5, 28, 64), 'source': '/work/bm1124/m300524/experiments/asp_esmControl_PMassim_1850_ATMsfn_over_2006/outdata/hamocc/asp_esmControl_PMassim_1850_ATMsfn_over_2006_hamocc_data_3d_ym_18590101_18591231_comp99.nc', 'original_shape': (1, 40, 220, 256), 'dtype': dtype('float32'), '_FillValue': nan, 'coordinates': 'lon lat'}

Incorrect round away from zero for keepbits=significand_bits

Originally posted by @milankl in observingClouds/xbitinfo#29 (comment)

In Ryan's numcodecs.bitround zarr-developers/numcodecs#299 it seems that a simple escape is used for keepbits=23 (because no rounding should take place).

def encode(self, buf):
        if self.keepbits == 23:
            return buf

I can easily add that for BitInformation.jl too. Because at the moment we have

julia> bitstring.(A,:split)
10-element Vector{String}:
 "0 01111101 00100011110101011001100"
 "1 01111101 11111100000100010010110"
 "0 01111111 00100110111101011010101"
 "0 01111100 00000101000001101001010"
 "1 01111111 10100000111000010100000"
 "0 01111100 01011110111011000110101"
 "0 01111010 01100000001000101011100"
 "1 01111110 00101101011101001110111"
 "1 01111101 10000010000111001000111"
 "1 01111101 11000011000111110011100"

julia> bitstring.(round(A,22),:split)    # keepbits=22, all correct
10-element Vector{String}:
 "0 01111101 00100011110101011001100"    # no rounding
 "1 01111101 11111100000100010010110"    # no rounding
 "0 01111111 00100110111101011010100"    # round to zero=even (tie)
 "0 01111100 00000101000001101001010"    # no rounding
 "1 01111111 10100000111000010100000"    # no rounding
 "0 01111100 01011110111011000110100"    # round to zero=even (tie)
 "0 01111010 01100000001000101011100"    # no rounding
 "1 01111110 00101101011101001111000"    # round away from zero (with carry)
 "1 01111101 10000010000111001001000"    # round away from zero (with carry)
 "1 01111101 11000011000111110011100"    # no rounding

julia> bitstring.(round(A,23),:split)    # keepbits=23
10-element Vector{String}:
 "0 01111101 00100011110101011001100"    # no rounding, correct
 "1 01111101 11111100000100010010110"    # no rounding, correct
 "0 01111111 00100110111101011010110"    # round away from zero, incorrect
 "0 01111100 00000101000001101001010"    # no rounding, correct
 "1 01111111 10100000111000010100000"    # no rounding, correct
 "0 01111100 01011110111011000110110"    # round away from zero, incorrect
 "0 01111010 01100000001000101011100"    # no rounding, correct
 "1 01111110 00101101011101001111000"    # round away from zero, incorrect
 "1 01111101 10000010000111001001000"    # round away from zero, incorrect
 "1 01111101 11000011000111110011100"    # no rounding, correct

Meaning for the edge case of keepbits=23 there's still some rounding away from zero possible, which obviously shouldn't happen. I'll see in a patch release how to best deal with that

@inbounds for array rounding

BitInformation.jl/src/round.jl

Line 48 in b335571

function Base.round(X::AbstractArray{Float32},nsb::Integer)

@inbounds is missing in these loops which shaves off about 2x of benchmarks

julia> @btime round(A,5);
  11.977 ms (2 allocations: 38.15 MiB)

julia> @btime round2(A,5);
  5.447 ms (2 allocations: 38.15 MiB)

with round2 being

function round2(X::AbstractArray{Float32},nsb::Integer)
           semask = setmask32(nsb)
           s = shift32(nsb)
           shmask = ~mask32(nsb)

           Y = similar(X)                              # preallocate
           @inbounds for i in eachindex(X)
               Y[i] = round(X[i],semask,s,shmask)
           end
           
           return Y
       end

BitInformation for data previously reduced in precision

We have a large dataset from WRF, where the providers already reduced precision on some variables before passing to us. For example:

ALBEDO:number_of_significant_digits = 5 ;          
LAI:number_of_significant_digits = 3 ;
LH:number_of_significant_digits = 5 ;   
...

Can we use this knowledge somehow to create an alternate algorithm for calculating the appropriate keepbits?

Where is the best place to discuss usage/interpretation/best practices?

@milankl , where is the best place to discuss usage/interpretation/best practices with usage of these tools?

For example, I've got two questions applying these techniques to a new CONUS WRF run. I've got a CONUS404 Jupyter notebook where I started with your tutorial and wrote a little loop to go through all the variables:

I'm wondering what it means that some variables don't have continuous distributions in many of the variables in the cell [8] figure. Some are continuous like V but others like SST are pretty strange looking. Is this expected for some variables or does this mean I did something wrong, or didn't use enough data?

I only used one 2D field here, so I was thinking of making a more robust assessment by selecting 100 fields or 1000 fields through the year and seeing what the histogram of keepbits for each variable looks like. And then picking something on the conservative side to apply. Does that sound reasonable?

use of bitinformation(dim)

I don't quite understand the dim argument in bitinformation and its implications. Can I just ignore it and use the default dim=1?

BitInformation.jl/test/information.jl

Lines 37 to 39 in 05bd9ef

    
           @test bi1[i] ≈ bi2[i] atol=1e-3 
        
           @test bi2[i] ≈ bi3[i] atol=1e-3 
        
           @test bi1[i] ≈ bi3[i] atol=1e-3

seems like dim only matters for sorted dimensions, i.e. dim doesnt matter on raw data.

Your example plots in https://doi.org/10.24433/CO.8682392.v1 are using dim=1 meaning longitude. I have data along dimensions longitude, latitude and time and somehow intuitively would run the analysis along time.

	if dim > 1
	permu = collect(1:ndims(A)) # permutation
	perm0 = vcat(permu[2:end],permu[1]) # used to permute permu
	for _ in 1:dim-1 # create permutation array
	permute!(permu,perm0)
	end
	A = PermutedDimsArray(A,permu) # permute A, such that desired dim is 1st dim
	end

	if set_zero_insignificant
	Hf = binom_free_entropy(nelements,confidence) # free entropy of random 50/50 at trial size
	# get chance p for 1 (or 0) from binom distr
	@inbounds for i in eachindex(M)
	M[i] = M[i] <= Hf ? 0 : M[i] # set zero what's insignificant
	end
	end

	@test bi1[i] ≈ bi2[i] atol=1e-3
	@test bi2[i] ≈ bi3[i] atol=1e-3
	@test bi1[i] ≈ bi3[i] atol=1e-3