As a continuation of <a class="issue-link js-issue-link" data-error-text="Failed to lo

threshold is a value I need to compare values f

(I was away yesterday, now back to this mini-project.) I could

question: performant iteration of multi-thousand-points intervals about portion HOT 8 CLOSED

alexandredecan commented on May 26, 2024

question: performant iteration of multi-thousand-points intervals

from portion.

Comments (8)

AlexandreDecan commented on May 26, 2024 1

I already see "small improvements" (things that could speed-up the computation but shouldn't make a huge difference):

Instead of iterating on the keys of interval1 then on the "position", you can iterate on interval1.domain();
Instead of the above, you can prevent a call to interval1[position] by iterating on .items() instead of .keys(). This is likely to twice the speed at which things are done since you won't have to do interval1[position] to retrieve c1;
It's probably more efficient to think about the distinct values rather than about the intervals (given how they are internally stored). Could you please try to use IntervalDict.combine (probably on a smaller example first) to see if this could do the job in a less time-consuming way?

(Yeah, I know, I said I wouldn't have time until tomorrow, but it was on my mind ;-)

from portion.

AlexandreDecan commented on May 26, 2024 1

I expected domain to be a bit slow given the high number of values and intervals (you said 2800 values for an average of 216 intervals, so that means that domain has to union ~604.800 intervals :-)).

I think it's probably better to copy&paste the code of combine and to tweak it according to your needs, to speed up the computation. For example, it seems that interval1 and interval2 share a similar domain, hence there is no need to compute their intersection and their difference. Moreover, by reusing the code of combine, you can add a progress bar on the inner loop to keep track of the progress ;-)

Anyway, I share your feeling that an alternative implementation is probably the way to go :( I'm afraid IntervalDict is not suited for such large dataset :-/

from portion.

AlexandreDecan commented on May 26, 2024

Hello,

I'll have a look at your code and see how we can do this in a not-so-time-consuming way ;-)
I'll try to do this tomorrow if I can have some spare time ;-)

In the meantime, can you clarify what's the value of threshold, what's the purpose of filtered_counter' and what's the purpose of new_interval*_dict? (they seem to correspond to classical dict since none of c1andc2` are intervals actually, am I right?).

from portion.

spock commented on May 26, 2024

threshold is a value I need to compare values from interval1 and interval2 with; I have to do this for all points (or rather "same-value intervals", but I don't want to code that 😛 )
filtered_counter just tells me how many times I had to adjust the values of intervals
yes, new_interval*_dicts are plain dicts; they let me avoid iterative insertion of intervals with different values directly into the IntervalDict - basically, I'm re-using your advice of grouping same-value intervals first, then inserting them into IntervalDict in one operation; thus, c1 and c2 are the values of intervals.

I'll need to think on and try your suggestions tomorrow, thanks!

I thought I could use combine here somehow, but I still need two separate (modified) outputs...
So I'm not sure how to apply it here.
To be more clear, I am comparing values from both interval dicts, and modify those values if certain conditions are met.
I need to maintain both outputs as separate entities...

from portion.

AlexandreDecan commented on May 26, 2024

Thanks for the answer. Could you elaborate a little bit on the need to maintain two separate (modified) outputs? What if, for example, we define the function provided to combine in such a way that it stores the "sources" of the value (either c1 or c2)? Would it fits your need? If so, I think we can get with a not-so-ugly and not-so-time-consuming solution ;-) (e.g. func could be something like (c2, 2) if c2 > 0 and float(c1) / c2 >= threshold else (c1, 1)) ?

from portion.

spock commented on May 26, 2024

(I was away yesterday, now back to this mini-project.)

I could not figure out how to use interval1.domain() , all it gives me is a single large interval (and for some reason sometimes this single operation takes forever to complete, but other times it works fast - not sure what is happening there)
yes, iterating on .items() instead of .keys() works nicely, and has a side benefit of showing me total sizes in progress bars 😃 ; it is still rather slow, single position iteration takes about 4 seconds

Using tuples as values is an interesting idea!
I've just implemented and started it (basically, saving both modified c1/c2 values as a single tuple into the combined IntervalDict).
As I cannot see any progress of .combine, I run it on a very small, non-representative sub-sample of only 50k positions.
This seems to be taking a while as well. I'll post when it returns.

(Looking at the code of combine, I wonder if the last return IntervalDict(new_items) will cause issues, too.)

Ok, took roughly 6 minutes.
An optimistic estimate is then 7+ hours for the full interval (not taking increased complexity into account).
I guess I should really implement this differently, after all 🙂

from portion.

spock commented on May 26, 2024

I've reimplemented the whole thing with numpy arrays (turned out to be much easier than I thought) and ended up with ~30 seconds running time for everything (reading, filtering, merging same-value intervals, writing out in the original format).

I guess I didn't really need the features offered by portion in the first place 🤣 Oh well, was still a good learning.

Thank you for all the support!

from portion.

AlexandreDecan commented on May 26, 2024

You're welcome. I'm sorry I couldn't help you much with portion. At least, we now know that it cannot deal with large datasets :-D

from portion.

question: performant iteration of multi-thousand-points intervals about portion HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent