Code Monkey home page Code Monkey logo

Comments (14)

Firionus avatar Firionus commented on July 18, 2024 1

EDIT: I accidentally sent this out too early, please excuse the edits to my draft.

Interesting suggestion. But I'm not sure what the expected outcome would be with these missing/NaN values.

Option 1: Poison Whole Window

Statistics just returns NaN/missing if just one of the values is NaN/missing:

using Statistics

julia> median([1,2,NaN])
NaN

julia> median([1,2,missing])
missing

Is this what you would expect? That a single NaN/missing value "poisons" the entire window? That is what RollingFunctions does (note they use asymmetric window tapering at the start and no tapering at the end):

julia> using RollingFunctions

julia> runmedian([1,2,3,4], 3)
4-element Vector{Float64}:
 1.0
 1.5
 2.0
 3.0

julia> runmedian([1,2,NaN,4], 3)
4-element Vector{Float64}:
   1.0
   1.5
 NaN
 NaN

julia> runmedian([missing,2,3,4], 3)
4-element Vector{Union{Missing, Float64}}:
  missing
  missing
  missing
 3.0

Option 2: Use Sorting Convention From Base

Base.sort puts NaN as if it was the highest number (see https://docs.julialang.org/en/v1/base/base/#Base.isless):

julia> sort([NaN,1,2,Inf,-Inf])
5-element Vector{Float64}:
 -Inf
   1.0
   2.0
  Inf
 NaN

I have not tested this exhaustively (there are MANY pitfalls with comparisons in the algorithm), but that might be what FastRunningMedian currently does:

julia> using FastRunningMedian

julia> running_median([1,2,NaN,4], 3, :asymmetric)
6-element Vector{Float64}:
   1.0
   1.5
   2.0
   4.0
 NaN
   4.0

However, there's no support for missing at the moment:

julia> running_median([missing,2,3,4], 3, :asymmetric)
ERROR: MethodError: no method matching running_median(::Vector{Union{Missing, Int64}}, ::Int64, ::Symbol)

Closest candidates are:
  running_median(::AbstractVector{T}, ::Integer, ::Any) where T<:Real
   @ FastRunningMedian ~/.julia/packages/FastRunningMedian/TyoOP/src/FastRunningMedian.jl:28
  running_median(::AbstractVector{T}, ::Integer) where T<:Real
   @ FastRunningMedian ~/.julia/packages/FastRunningMedian/TyoOP/src/FastRunningMedian.jl:28

Stacktrace:
 [1] top-level scope
   @ REPL[78]:1

Option 3: Use Inverse Sorting Convention From Base

This is what you get with

julia> sort([NaN,1,2,Inf,-Inf], lt=<)
5-element Vector{Float64}:
 NaN
 -Inf
   1.0
   2.0
  Inf

I think this might be used by SortFilters (again, not tested exhaustively):

julia> using SortFilters

julia> movsort([1,2,3,4], 3, .5)
4-element Vector{Float64}:
 1.0
 1.0
 2.0
 3.0

julia> movsort([1,2,NaN,4], 3, .5)
4-element Vector{Float64}:
 1.0
 1.0
 1.0
 2.0

julia> movsort([4,3,NaN,1], 3, .5)
4-element Vector{Float64}:
   4.0
   4.0
   3.0
 NaN

julia> movsort([missing,2,3,4], 3, .5)
ERROR: TypeError: non-boolean (Missing) used in boolean context
Stacktrace:
 [1] movsort!(out::Vector{Float64}, in::Vector{Union{Missing, Int64}}, window::Int64, p::Float64)
   @ SortFilters ~/.julia/packages/SortFilters/lSAIo/src/SortFilters.jl:39
 [2] movsort(in::Vector{Union{Missing, Int64}}, window::Int64, p::Float64)
   @ SortFilters ~/.julia/packages/SortFilters/lSAIo/src/SortFilters.jl:17
 [3] top-level scope
   @ REPL[56]:1

Option 4: Ignore NaN/missing Unless They are the Only Thing in Window

There's another approach, which would be to skip all NaN/missing values if there's another one in the window and return NaN/missing otherwise:

julia> nonexistent_running_median([1,2,3,4], 3, :asymmetric)
6-element Vector{Float64}:
 1.0
 1.5
 2.0
 3.0
 3.5
 4.0

julia> nonexistent_running_median([1,2,NaN,4], 3, :asymmetric)
6-element Vector{Float64}:
 1.0
 1.5
 1.5
 3.0
 4.0
 4.0

julia> nonexistent_running_median([missing,2,3,4], 3, :asymmetric)
6-element Vector{Union{Missing, Float64}}:
 missing
 2.0
 2.5
 3.0
 3.5
 4.0

The behavior here is that the window still uses the elements from its supporting points but the actual number of used elements varies depending on how many NaN/missing are in there.

In consequence, with this option I would also expect:

julia> nonexistent_running_median([missing,NaN,3,4], 3, :asymmetric)
6-element Vector{Float64}:
 missing
 missing
 3.0
 3.5
 3.5
 4.0

Conclusion

For me, Option 4 sounds the most promising right now, though also probably the hardest to implement.

Option 2/3 I dislike because whether to put NaN as lowest or highest number is a completely arbitrary choice.

Option 1 is the only one that makes any mathematical sense, but is probably less useful in practice than the other options, even if the other options introduce subtle errors (Option 2 biases the median upwards, Option 3 biases the median downwards and Option 3 biases the median left/right depending on whether the missing value is to the right/left of the window center).

Which option would you like?

from fastrunningmedian.jl.

Firionus avatar Firionus commented on July 18, 2024 1

Thanks for the comparison. Interpolating NaNs might be the best workaround yet in terms of performance πŸ‘

We can note that both A and B implementations seem to be shifted by half the window length, I don't know why though

That's because RollingFunctions.jl uses different boundary conditions than FastRunningMedian.jl. If I remeber correctly (untested code ahead), you can match the two with either:

  • running_median(in, N, tapering=:asymmetric)[1:length(in)] and running(median_ignorenan, in, N)
  • running_median(in, N, tapering=:none) and rolling(median_ignorenan, in, N)

Consult the Excel sheet in the README or the docstrings for a detailed overview of the boundary conditions in this library.

from fastrunningmedian.jl.

Firionus avatar Firionus commented on July 18, 2024 1

I got around to the implementation of Option 1 (default) and Option 4 today. πŸŽ‰

It is available on the master branch. You can install it with ]add https://github.com/Firionus/FastRunningMedian.jl.git#master. Ignore the NaNs like this:

running_median(in, w, nan=:ignore)

Default behavior is nan=:include, where a single NaN will turn the entire window NaN.

I'd appreciate if both of you could give it a try and compare it to your existing solutions. Also, if you have Feedback on Performance or the API, please feel free to write what's on your mind.

from fastrunningmedian.jl.

Firionus avatar Firionus commented on July 18, 2024 1

Maybe you would want to detect the presence of NaN or missing values and autoselect the nan=:ignore mode ?

See above:

Option 1 (nan=:include) is the only one that makes any mathematical sense, but is probably less useful in practice than the other options, even if the other options introduce subtle errors (Option 2 biases the median upwards, Option 3 biases the median downwards and Option 4 (nan=:ignore) biases the median left/right depending on whether the missing value is to the right/left of the window center).

MATLAB also does it like this and Iβ€˜m pretty convinced it is the right approach. Thatβ€˜s why nan=:include is the default.

from fastrunningmedian.jl.

Firionus avatar Firionus commented on July 18, 2024 1

Very quick benchmark comparison of how much the NaN handling slowed down the performance:

Before (0.2.0):

julia> @benchmark running_median($x, 101)
BenchmarkTools.Trial: 88 samples with 1 evaluation.
 Range (min … max):  54.663 ms … 67.524 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     56.883 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   57.382 ms Β±  1.701 ms  β”Š GC (mean Β± Οƒ):  0.19% Β± 0.47%

            β–β–β–‚β–ˆβ–„ ▁▁                                           
  β–ƒβ–β–β–β–β–ƒβ–ƒβ–ƒβ–β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–…β–ˆβ–…β–ƒβ–…β–ˆβ–β–ƒβ–…β–ƒβ–…β–β–†β–β–β–…β–ƒβ–ƒβ–ƒβ–β–β–β–β–β–ƒβ–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒ ▁
  54.7 ms         Histogram: frequency by time        62.8 ms <

 Memory estimate: 7.65 MiB, allocs estimate: 25.

After (9b17e62):

julia> @benchmark running_median($x, 101)
BenchmarkTools.Trial: 85 samples with 1 evaluation.
 Range (min … max):  57.673 ms … 72.718 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     58.444 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   58.995 ms Β±  1.753 ms  β”Š GC (mean Β± Οƒ):  0.17% Β± 0.43%

         β–‚β–ˆ   β–‚                                                
  β–„β–ˆβ–‡β–„β–β–‡β–…β–ˆβ–ˆβ–‡β–‡β–…β–ˆβ–„β–„β–β–β–„β–β–β–ˆβ–…β–…β–„β–…β–…β–…β–„β–β–…β–…β–β–β–β–β–„β–β–„β–β–„β–‡β–β–β–„β–‡β–„β–β–β–β–„β–β–…β–‡β–„β–β–„β–β–β–„ ▁
  57.7 ms         Histogram: frequency by time        60.9 ms <

 Memory estimate: 7.65 MiB, allocs estimate: 64.

That's a slowdown of 3 %. It's not great but NaNs need to be handled for correctness. I'll ship it now and we can include possible performance improvements in the next patch release.

You should soon be able to update FastRunningMedian to v0.2.1 and get the new NaN handling feature.

from fastrunningmedian.jl.

wuhuangjianCN avatar wuhuangjianCN commented on July 18, 2024

Thanks for your response. Option 4 is what I need.
Looking forward to your updates~

from fastrunningmedian.jl.

wuhuangjianCN avatar wuhuangjianCN commented on July 18, 2024

EDIT: I accidentally sent this out too early, please excuse the edits to my draft.

Interesting suggestion. But I'm not sure what the expected outcome would be with these missing/NaN values.

Option 1: Poison Whole Window

Statistics just returns NaN/missing if just one of the values is NaN/missing:

using Statistics

julia> median([1,2,NaN])
NaN

julia> median([1,2,missing])
missing

Is this what you would expect? That a single NaN/missing value "poisons" the entire window? That is what RollingFunctions does (note they use asymmetric window tapering at the start and no tapering at the end):

julia> using RollingFunctions

julia> runmedian([1,2,3,4], 3)
4-element Vector{Float64}:
 1.0
 1.5
 2.0
 3.0

julia> runmedian([1,2,NaN,4], 3)
4-element Vector{Float64}:
   1.0
   1.5
 NaN
 NaN

julia> runmedian([missing,2,3,4], 3)
4-element Vector{Union{Missing, Float64}}:
  missing
  missing
  missing
 3.0

Option 2: Use Sorting Convention From Base

Base.sort puts NaN as if it was the highest number (see https://docs.julialang.org/en/v1/base/base/#Base.isless):

julia> sort([NaN,1,2,Inf,-Inf])
5-element Vector{Float64}:
 -Inf
   1.0
   2.0
  Inf
 NaN

I have not tested this exhaustively (there are MANY pitfalls with comparisons in the algorithm), but that might be what FastRunningMedian currently does:

julia> using FastRunningMedian

julia> running_median([1,2,NaN,4], 3, :asymmetric)
6-element Vector{Float64}:
   1.0
   1.5
   2.0
   4.0
 NaN
   4.0

However, there's no support for missing at the moment:

julia> running_median([missing,2,3,4], 3, :asymmetric)
ERROR: MethodError: no method matching running_median(::Vector{Union{Missing, Int64}}, ::Int64, ::Symbol)

Closest candidates are:
  running_median(::AbstractVector{T}, ::Integer, ::Any) where T<:Real
   @ FastRunningMedian ~/.julia/packages/FastRunningMedian/TyoOP/src/FastRunningMedian.jl:28
  running_median(::AbstractVector{T}, ::Integer) where T<:Real
   @ FastRunningMedian ~/.julia/packages/FastRunningMedian/TyoOP/src/FastRunningMedian.jl:28

Stacktrace:
 [1] top-level scope
   @ REPL[78]:1

Option 3: Use Inverse Sorting Convention From Base

This is what you get with

julia> sort([NaN,1,2,Inf,-Inf], lt=<)
5-element Vector{Float64}:
 NaN
 -Inf
   1.0
   2.0
  Inf

I think this might be used by SortFilters (again, not tested exhaustively):

julia> using SortFilters

julia> movsort([1,2,3,4], 3, .5)
4-element Vector{Float64}:
 1.0
 1.0
 2.0
 3.0

julia> movsort([1,2,NaN,4], 3, .5)
4-element Vector{Float64}:
 1.0
 1.0
 1.0
 2.0

julia> movsort([4,3,NaN,1], 3, .5)
4-element Vector{Float64}:
   4.0
   4.0
   3.0
 NaN

julia> movsort([missing,2,3,4], 3, .5)
ERROR: TypeError: non-boolean (Missing) used in boolean context
Stacktrace:
 [1] movsort!(out::Vector{Float64}, in::Vector{Union{Missing, Int64}}, window::Int64, p::Float64)
   @ SortFilters ~/.julia/packages/SortFilters/lSAIo/src/SortFilters.jl:39
 [2] movsort(in::Vector{Union{Missing, Int64}}, window::Int64, p::Float64)
   @ SortFilters ~/.julia/packages/SortFilters/lSAIo/src/SortFilters.jl:17
 [3] top-level scope
   @ REPL[56]:1

Option 4: Ignore NaN/missing Unless They are the Only Thing in Window

There's another approach, which would be to skip all NaN/missing values if there's another one in the window and return NaN/missing otherwise:

julia> nonexistent_running_median([1,2,3,4], 3, :asymmetric)
6-element Vector{Float64}:
 1.0
 1.5
 2.0
 3.0
 3.5
 4.0

julia> nonexistent_running_median([1,2,NaN,4], 3, :asymmetric)
6-element Vector{Float64}:
 1.0
 1.5
 1.5
 3.0
 4.0
 4.0

julia> nonexistent_running_median([missing,2,3,4], 3, :asymmetric)
6-element Vector{Union{Missing, Float64}}:
 missing
 2.0
 2.5
 3.0
 3.5
 4.0

The behavior here is that the window still uses the elements from its supporting points but the actual number of used elements varies depending on how many NaN/missing are in there.

In consequence, with this option I would also expect:

julia> nonexistent_running_median([missing,NaN,3,4], 3, :asymmetric)
6-element Vector{Float64}:
 missing
 missing
 3.0
 3.5
 3.5
 4.0

Conclusion

For me, Option 4 sounds the most promising right now, though also probably the hardest to implement.

Option 2/3 I dislike because whether to put NaN as lowest or highest number is a completely arbitrary choice.

Option 1 is the only one that makes any mathematical sense, but is probably less useful in practice than the other options, even if the other options introduce subtle errors (Option 2 biases the median upwards, Option 3 biases the median downwards and Option 3 biases the median left/right depending on whether the missing value is to the right/left of the window center).

Which option would you like?

I wrote a simple implementation for Option 4:


function medfilter(inData;edgeLen=24*2)
    # running_median that ignore nan
    q_nan=isnan.(inData)
    x=copy(inData)
    x[q_nan] .= -Inf
    a=running_median(x, edgeLen * 2 + 1)
    x[q_nan] .= Inf
    b=running_median(x, edgeLen * 2 + 1)
    # window without valid data will return NaN, because Inf - Inf = NaN
    return (a + b)./2
end

However, the performance can be improved and NaN will return when there are more than half of NaN inside a window.

a=5 .*sin.(LinRange(0,20*pi,1000))+rand(1000)
a[14:27] .= -Inf
a[:4] .= -Inf
b= running_median(a, 5 * 2 + 1)
using PyPlot
plt=get_plt()
plt.clf()
plt.plot(a,labels="input")
plt.plot(b,labels="output")

plt.xlim([0,50])
plt.gcf()

image

from fastrunningmedian.jl.

Firionus avatar Firionus commented on July 18, 2024

Thanks for your response. Option 4 is what I need.
Looking forward to your updates~

In an earlier version of your answer (I got it via mail) you mentioned the behavior of MATLAB. I looked at it and happen to really like it: https://www.mathworks.com/help/matlab/ref/movmedian.html.
By default, they use Option 1, so if there is just 1 NaN in the window, the median is also NaN. This makes a lot of sense as default behavior since it is mathematically the most sound option.
However, MATLAB also allows you to set the flag "omitnan" and get the behavior of Option 4.

I wrote a simple implementation for Option 4:

That's elegant, but not quite right. Inserting Inf or -Inf pushes the median in those windows up or down by some amount that strongly depends on the distribution of the data, and is not guaranteed to average out afterwards. It is not the same as completely omitting the NaN.

Anyway, I'd like to copy MATLAB's behavior. Here's the steps I'd like to take:

  • Write a slow reference implementation for Option 1 and 4 with RollingFunctions (I'll post them here shortly).
  • Write fuzzing tests for Option 1 and 4 based on the reference implementation
  • Implement Option 1 (default)
  • Implement Option 4
  • Documentation
  • Benchmarks

The biggest hurdle will be an elegant implementation that works well with both options. Option 1 will probably want to track the index of the newest NaN in the window, and output NaN until it is out again. Option 4 will likely need the ability to roll with less elements than the maximum window size, which currently is impossible.

I'll start with the reference implementation and I'll also do the Fuzz tests in the near future.

I'll also split out an issue regarding more data types, like Union{Float64, Missing} or Dates.

from fastrunningmedian.jl.

Firionus avatar Firionus commented on July 18, 2024

Alright, reference implementation for Option 4 is done:

julia> using RollingFunctions, Statistics

julia> function median_ignorenan(x)
         mask = @. !isnan(x)
         masked = @view x[mask]
         masked |> isempty && return NaN
         median(masked)
       end
median_ignorenan (generic function with 1 method)

julia> running(median_ignorenan, [NaN,2,3,4], 3)
4-element Vector{Float64}:
 NaN
   2.0
   2.5
   3.0

julia> running(median_ignorenan, [1,2,NaN,4], 3)
4-element Vector{Float64}:
 1.0
 1.5
 1.5
 3.0

Until FastRunningMedian supports it, the above (maybe replace running with rolling, depending on the boundary conditions you want - check the docs of RollingFunctions please!) should probably be your workaround for NaNs, even though it is less performant than this library. If you want Option 1, just use regular RollingFunctions.

from fastrunningmedian.jl.

wuhuangjianCN avatar wuhuangjianCN commented on July 18, 2024

That's elegant, but not quite right. Inserting Inf or -Inf pushes the median in those windows up or down by some amount that strongly depends on the distribution of the data, and is not guaranteed to average out afterwards. It is not the same as completely omitting the NaN.

Yes, that approah misbehaves when encounter too many NaNs inside a running window. Yet, I still use it for now because the performance of running median is critical in my task, and your package is faster than calling the pandas package by times.

Good luck with your future update~

from fastrunningmedian.jl.

berjine avatar berjine commented on July 18, 2024

Hi, could it work to just replace the median calls by nanmedian from NaNStatistics ?
Because the implementation using RollingFunctions seems much slower than the standard one from FastRunningMedian

from fastrunningmedian.jl.

Firionus avatar Firionus commented on July 18, 2024

Hi @berjine πŸ‘‹
you can benchmark whether β€˜nanmedian’ combined with RollingFunctions is faster than what I provided as a workaround above. Feel free to share your results here. I wouldn’t expect big improvements though.

Unfortunately I haven’t had much time for programming the NaN handling. The testing is in place on the β€˜master’ branch though, so if someone wants to give the implementation a go, feel free to do so and put in a PR. If you start working on it, let us know in this issue.

Last time I started working on this I remember being so fed up with modifying my complicated algorithm that I explored SkipLists some more instead πŸ˜‚. I’ll see whether I can carve out some more time for this issue in the future, but it’s a hobby project, so no guarantees.

from fastrunningmedian.jl.

berjine avatar berjine commented on July 18, 2024

Sure !

A is running(median_ignorenan, test_data, windowsize)
B is running(nanmedian, test_data, windowsize)
C is running_median(test_data, windowsize), added to see up to where it's possible to go !

With 1M points, 1% random NaN and 1001-points window :
9.557 s (6999732 allocations: 26.79 GiB) for A
11.550 s (1000007 allocations: 7.64 GiB) for B
69.380 ms (29 allocations: 7.70 MiB) for C, 138x faster than A (!!!)

Same data but 101-points window :
2.075 s (5999986 allocations: 2.61 GiB) for A
3.655 s (1000006 allocations: 869.71 MiB) for B
59.894 ms (25 allocations: 7.65 MiB) for C, 35x faster than A

With 10k points, 1% random NaN and 101-points window :
12.643 ms (59986 allocations: 26.58 MiB) for A
33.365 ms (10006 allocations: 8.66 MiB) for B
547.900 ΞΌs (25 allocations: 94.91 KiB) for C, 23x faster than A

It confirms your workaround works great, and that FastRunningMedian is fantastic haha

A possible workaround is to interpolate the NaNs to suppress them beforehand
This step take 96ms for 1M points at 1% NaNs with SteffenMonotonicInterpolation from Interpolations.jl

This works quite well I think, as long as there are no NaNs at the beginning or end of the data vector x)
image
We can note that both A and B implementations seem to be shifted by half the window length, I don't know why though

from fastrunningmedian.jl.

berjine avatar berjine commented on July 18, 2024

Fantastic ! It works with the same awesome performance !
79.248 ms (5468 allocations: 8.09 MiB) with nan=:ignore on 1M points, 1% NaN and 1001-points window

Maybe you would want to detect the presence of NaN or missing values and autoselect the nan=:ignore mode ?

from fastrunningmedian.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.