Comments (8)
I already see "small improvements" (things that could speed-up the computation but shouldn't make a huge difference):
- Instead of iterating on the keys of interval1 then on the "position", you can iterate on
interval1.domain()
; - Instead of the above, you can prevent a call to
interval1[position]
by iterating on.items()
instead of.keys()
. This is likely to twice the speed at which things are done since you won't have to dointerval1[position]
to retrievec1
; - It's probably more efficient to think about the distinct values rather than about the intervals (given how they are internally stored). Could you please try to use
IntervalDict.combine
(probably on a smaller example first) to see if this could do the job in a less time-consuming way?
(Yeah, I know, I said I wouldn't have time until tomorrow, but it was on my mind ;-)
from portion.
I expected domain
to be a bit slow given the high number of values and intervals (you said 2800 values for an average of 216 intervals, so that means that domain
has to union ~604.800 intervals :-)).
I think it's probably better to copy&paste the code of combine
and to tweak it according to your needs, to speed up the computation. For example, it seems that interval1
and interval2
share a similar domain, hence there is no need to compute their intersection and their difference. Moreover, by reusing the code of combine
, you can add a progress bar on the inner loop to keep track of the progress ;-)
Anyway, I share your feeling that an alternative implementation is probably the way to go :( I'm afraid IntervalDict is not suited for such large dataset :-/
from portion.
Hello,
I'll have a look at your code and see how we can do this in a not-so-time-consuming way ;-)
I'll try to do this tomorrow if I can have some spare time ;-)
In the meantime, can you clarify what's the value of threshold
, what's the purpose of filtered_counter' and what's the purpose of
new_interval*_dict? (they seem to correspond to classical dict since none of
c1and
c2` are intervals actually, am I right?).
from portion.
threshold
is a value I need to compare values from interval1 and interval2 with; I have to do this for all points (or rather "same-value intervals", but I don't want to code that 😛 )filtered_counter
just tells me how many times I had to adjust the values of intervals- yes,
new_interval*_dict
s are plain dicts; they let me avoid iterative insertion of intervals with different values directly into the IntervalDict - basically, I'm re-using your advice of grouping same-value intervals first, then inserting them into IntervalDict in one operation; thus,c1
andc2
are the values of intervals.
I'll need to think on and try your suggestions tomorrow, thanks!
I thought I could use combine
here somehow, but I still need two separate (modified) outputs...
So I'm not sure how to apply it here.
To be more clear, I am comparing values from both interval dicts, and modify those values if certain conditions are met.
I need to maintain both outputs as separate entities...
from portion.
Thanks for the answer. Could you elaborate a little bit on the need to maintain two separate (modified) outputs? What if, for example, we define the function provided to combine
in such a way that it stores the "sources" of the value (either c1
or c2
)? Would it fits your need? If so, I think we can get with a not-so-ugly and not-so-time-consuming solution ;-) (e.g. func
could be something like (c2, 2) if c2 > 0 and float(c1) / c2 >= threshold else (c1, 1)
) ?
from portion.
(I was away yesterday, now back to this mini-project.)
- I could not figure out how to use
interval1.domain()
, all it gives me is a single large interval (and for some reason sometimes this single operation takes forever to complete, but other times it works fast - not sure what is happening there) - yes, iterating on
.items()
instead of.keys()
works nicely, and has a side benefit of showing me total sizes in progress bars 😃 ; it is still rather slow, single position iteration takes about 4 seconds
Using tuples as values is an interesting idea!
I've just implemented and started it (basically, saving both modified c1/c2 values as a single tuple into the combined IntervalDict).
As I cannot see any progress of .combine
, I run it on a very small, non-representative sub-sample of only 50k positions.
This seems to be taking a while as well. I'll post when it returns.
(Looking at the code of combine
, I wonder if the last return IntervalDict(new_items)
will cause issues, too.)
Ok, took roughly 6 minutes.
An optimistic estimate is then 7+ hours for the full interval (not taking increased complexity into account).
I guess I should really implement this differently, after all 🙂
from portion.
I've reimplemented the whole thing with numpy arrays (turned out to be much easier than I thought) and ended up with ~30 seconds running time for everything (reading, filtering, merging same-value intervals, writing out in the original format).
I guess I didn't really need the features offered by portion
in the first place 🤣 Oh well, was still a good learning.
Thank you for all the support!
from portion.
You're welcome. I'm sorry I couldn't help you much with portion
. At least, we now know that it cannot deal with large datasets :-D
from portion.
Related Issues (20)
- Get number of times multiple intervals overlap HOT 3
- Iteration of empty intervals is inconsistent. in general "empty := (+inf,-inf)" is problematic HOT 15
- Is this library performant enough to work with (non-atomic) intervals which span integers between 1 and 1 billion? HOT 1
- Using an external comparator? HOT 3
- iterate is broken with subclasses of Interval HOT 12
- Add __format__ method to Interval (improvement) HOT 16
- Thoughts about text-annotation use case and Pandas Ext. API HOT 5
- interval diameter (length, width, measure, range, or size) HOT 8
- Add join / merge method HOT 2
- "compatible version" specifier in setup.py confuses poetry HOT 2
- AttributeError: module 'portion.interval' has no attribute 'empty' HOT 3
- IntervalMultiDict HOT 3
- Add example for pandas in README HOT 5
- importlib error with create_api HOT 2
- Error to import interval, inf, imath from interval HOT 1
- importlib.machinery error with create_api HOT 6
- Enclosure Calculation Bug HOT 1
- Empty Calculation Bug HOT 3
- The performance issues of interval calculations in large quantities. HOT 1
- mass/Lebesgue measure? HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from portion.