Code Monkey home page Code Monkey logo

Comments (8)

AlexandreDecan avatar AlexandreDecan commented on May 26, 2024 1

I already see "small improvements" (things that could speed-up the computation but shouldn't make a huge difference):

  • Instead of iterating on the keys of interval1 then on the "position", you can iterate on interval1.domain();
  • Instead of the above, you can prevent a call to interval1[position] by iterating on .items() instead of .keys(). This is likely to twice the speed at which things are done since you won't have to do interval1[position] to retrieve c1;
  • It's probably more efficient to think about the distinct values rather than about the intervals (given how they are internally stored). Could you please try to use IntervalDict.combine (probably on a smaller example first) to see if this could do the job in a less time-consuming way?

(Yeah, I know, I said I wouldn't have time until tomorrow, but it was on my mind ;-)

from portion.

AlexandreDecan avatar AlexandreDecan commented on May 26, 2024 1

I expected domain to be a bit slow given the high number of values and intervals (you said 2800 values for an average of 216 intervals, so that means that domain has to union ~604.800 intervals :-)).

I think it's probably better to copy&paste the code of combine and to tweak it according to your needs, to speed up the computation. For example, it seems that interval1 and interval2 share a similar domain, hence there is no need to compute their intersection and their difference. Moreover, by reusing the code of combine, you can add a progress bar on the inner loop to keep track of the progress ;-)

Anyway, I share your feeling that an alternative implementation is probably the way to go :( I'm afraid IntervalDict is not suited for such large dataset :-/

from portion.

AlexandreDecan avatar AlexandreDecan commented on May 26, 2024

Hello,

I'll have a look at your code and see how we can do this in a not-so-time-consuming way ;-)
I'll try to do this tomorrow if I can have some spare time ;-)

In the meantime, can you clarify what's the value of threshold, what's the purpose of filtered_counter' and what's the purpose of new_interval*_dict? (they seem to correspond to classical dict since none of c1andc2` are intervals actually, am I right?).

from portion.

spock avatar spock commented on May 26, 2024
  • threshold is a value I need to compare values from interval1 and interval2 with; I have to do this for all points (or rather "same-value intervals", but I don't want to code that 😛 )
  • filtered_counter just tells me how many times I had to adjust the values of intervals
  • yes, new_interval*_dicts are plain dicts; they let me avoid iterative insertion of intervals with different values directly into the IntervalDict - basically, I'm re-using your advice of grouping same-value intervals first, then inserting them into IntervalDict in one operation; thus, c1 and c2 are the values of intervals.

I'll need to think on and try your suggestions tomorrow, thanks!

I thought I could use combine here somehow, but I still need two separate (modified) outputs...
So I'm not sure how to apply it here.
To be more clear, I am comparing values from both interval dicts, and modify those values if certain conditions are met.
I need to maintain both outputs as separate entities...

from portion.

AlexandreDecan avatar AlexandreDecan commented on May 26, 2024

Thanks for the answer. Could you elaborate a little bit on the need to maintain two separate (modified) outputs? What if, for example, we define the function provided to combine in such a way that it stores the "sources" of the value (either c1 or c2)? Would it fits your need? If so, I think we can get with a not-so-ugly and not-so-time-consuming solution ;-) (e.g. func could be something like (c2, 2) if c2 > 0 and float(c1) / c2 >= threshold else (c1, 1)) ?

from portion.

spock avatar spock commented on May 26, 2024

(I was away yesterday, now back to this mini-project.)

  • I could not figure out how to use interval1.domain() , all it gives me is a single large interval (and for some reason sometimes this single operation takes forever to complete, but other times it works fast - not sure what is happening there)
  • yes, iterating on .items() instead of .keys() works nicely, and has a side benefit of showing me total sizes in progress bars 😃 ; it is still rather slow, single position iteration takes about 4 seconds

Using tuples as values is an interesting idea!
I've just implemented and started it (basically, saving both modified c1/c2 values as a single tuple into the combined IntervalDict).
As I cannot see any progress of .combine, I run it on a very small, non-representative sub-sample of only 50k positions.
This seems to be taking a while as well. I'll post when it returns.

(Looking at the code of combine, I wonder if the last return IntervalDict(new_items) will cause issues, too.)

Ok, took roughly 6 minutes.
An optimistic estimate is then 7+ hours for the full interval (not taking increased complexity into account).
I guess I should really implement this differently, after all 🙂

from portion.

spock avatar spock commented on May 26, 2024

I've reimplemented the whole thing with numpy arrays (turned out to be much easier than I thought) and ended up with ~30 seconds running time for everything (reading, filtering, merging same-value intervals, writing out in the original format).

I guess I didn't really need the features offered by portion in the first place 🤣 Oh well, was still a good learning.

Thank you for all the support!

from portion.

AlexandreDecan avatar AlexandreDecan commented on May 26, 2024

You're welcome. I'm sorry I couldn't help you much with portion. At least, we now know that it cannot deal with large datasets :-D

from portion.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.