Comments (16)
what are the practical differences between a view and a list?
The view only references its source dict
and hence, does not create any duplicated storage (unlike the list
, which allocates its own storage; (partially) duplicating the one of the original dict
). You can think of the view as a reusable iterator; therefore, what comes to my mind in terms of "performance" improvements is:
- to avoid allocating unnecessary memory (especially for large collections);
- to avoid considering the whole collection when only iterating until a given condition;
- to avoid retaining strong references that may prevent garbage collection.
I think it should be quite easy to implement a view object instead of a list
Sure, this can indeed be easily achieved by inheriting from KeysView
, ValuesView
& ItemsView
(located in collections.abc
). Let me know if I can help in implementing that feature.
from portion.
Hello,
Thanks for these explanations. I think I get it now ;-)
I indeed prefer the first option, which seems cleaner and more appropriate. However, since it requires to maintain the intervals sorted and, as such, requires a lot of changes, I think we can opt for the second option as a "temporary workaround". At some point, I hope to find time and courage to write an IntervalDict
based on a kind of self-balancing red-black-tree like structure, so to achieve better performances at the same time. That won't be easy (esp. to keep the current API, since a tree-like structure requires to store atomic intervals only) and I don't have short plan for this now ;-)
from portion.
I was looking at the implementation of xView
in Python (see https://github.com/python/cpython/blob/master/Lib/_collections_abc.py#L729), and I've the impression that we cannot simply reuse the code you suggested in #34 (comment).
For example, the __contains__
method of ValuesView
(see https://github.com/python/cpython/blob/master/Lib/_collections_abc.py#L776) is, I think, unlikely to work out of the box, since keys in IntervalDict
can be Interval
or values, and in the case of an Interval
, the IntervalDict.get
method returns a new IntervalDict
. I need to check, but I guess that if we have an IntervalDict
and we instantiate a ValuesView
from it, checking for containment in this view will fail since the keys that will be tested (see line 777) will be Interval
instances, and as such, any value returned on line 778 will be an IntervalDict
(where parameter value
should, intuitively, be a value), making the comparison on line 779 fails.
Am I right, or did I miss something? :-)
from portion.
Hello,
I found some time this afternoon to have a look at this issue. I've created a PR, see #35.
@hugo-ricateau Could I ask you to have a look at it? I think it properly addresses this issue.
Thank you!
from portion.
Closed by #35
from portion.
Hello,
Thanks for reporting this!
With the exception of the "automatic updated content" of a view, what are the practical differences between a view and a list? I think it should be quite easy to implement a view object instead of a list for .items
, .values
and .keys
, but I do not see how this can lead to performance improvements ;-)
from portion.
Thanks for these explanations!
Feel free to submit a PR with these new "features" if you want, but keep me informed if you plan do to this, otherwise, I'll try to find some time to implement it by myself (I cannot say when it will be implemented, but be sure "it's not for soon" since I've a lot of other tasks to do before ;-))
from portion.
I don't have any bandwidth for now, but I'll try to submit something in the next weeks; I'll let you know when I'll start working on it.
from portion.
Do you have pointers to existing subclasses of DictView
so I can have a look at them?
from portion.
I had a quick look at portion/dict.py
, and the fastest solution to implement is probably something like
from collections.abc import KeysView, ValuesView, ItemsView
class IntervalDict(MutableMapping):
def keys(self):
return KeysView(OrderedDict(sorted(self._items, key=_sort)))
def values(self):
return ValuesView(OrderedDict(sorted(self._items, key=_sort)))
def items(self):
return ItemsView(OrderedDict(sorted(self._items, key=_sort)))
However, it would be considerably better to directly use an OrderedDict
instead of a list
as the internal storage structure (i.e. for . _items
).
Another option, even though it's somehow hacky (the @property
for _mapping
), could be something like
from collections.abc import KeysView, ValuesView, ItemsView
class IntervalDictBaseView:
def __init__(self, mapping):
self.__mapping = mapping
@property
def _mapping(self):
return sorted(self.__mapping._items, key=_sort)
class IntervalDictKeysView(IntervalDictBaseView, KeysView):
pass
class IntervalDictValuesView(IntervalDictBaseView, ValuesView):
pass
class IntervalDictItemsView(IntervalDictBaseView, ItemsView):
pass
class IntervalDict(MutableMapping):
def keys(self):
return IntervalDictKeysView(self)
def values(self):
return IntervalDictValuesView(self)
def items(self):
return IntervalDictItemsView(self)
but I think that the first option with an OrderedDict
storage would definitely be preferable.
from portion.
Thank you!
I prefer the first solution, but in that case, the ordering won't be kept in the views when the IntervalDict is modified.
One possibility is indeed to change the internal structure to a sorted list (I don't see why an OrderedDict will be required here). Moreover, having an ordered list internally will be helpful to achieve better performance in some cases. That's something I planned a long time ago - to switch to an interval-tree-like structure internally - but I couldn't find time to work on it.
from portion.
I prefer the first solution, but in that case, the ordering won't be kept in the views when the IntervalDict is modified.
My previous post was erroneous; the first option is not working as it is. It has the same drawback as the current implementation (with a list
): the view is not kept up-to-date with respect to its IntervalDict
(and it's not only about ordering).
One possibility is indeed to change the internal structure to a sorted list (I don't see why an OrderedDict will be required here).
What do you refer to by "sorted list"? As list
always respect ordering; it feels like I misunderstood your answer.
You should consider using an OrderedDict
for the following reason: view objects are referencing a mapping; therefore, either the internal storage is a mapping and you can instantiate the view with it; or the internal storage is not a mapping and you thus have to insert a mapping in between the storage and the view, breaking the "always up-to-date" feature...
So, first option will only work if the internal storage is a mapping (maintained sorted at all time); if you want to use a non mapping storage you should consider the second option, that allows to insert the mapping object in between the storage and the view on the fly.
If you think that the class can be implemented with an OrderedDict
storage, first option is probably preferable; otherwise, the second option will be the only working one.
Let me know if you start working on it so we can synchronise ourself and avoid work duplication.
from portion.
Perhaps I misunderstood something ;-) As far as I know, OrderedDict
is a mapping in which the order is based on "when" an item was inserted. In the IntervalDict
class, it is expected that .items()
, .values()
and .keys()
return a list of things sorted by "interval" (basically, by their lower bound).
At the time I wrote IntervalDict
, I wanted to have a deterministic order for these three methods (notably to allow to test the outputted content and for convenience). I somewhat arbitrarily decided to sort them by "keys in ascending order" mainly because the insertion order for disjunction of intervals is sometimes a bit difficult to maintain/compute (e.g. assume dict for [1, 3] -> 'a'; add [2] -> 'b' in it; result is [1,2[ | ]2, 3] -> 'a' and [2] -> 'b' where [1,2[ |]2, 3] is "newly created" but should be listed first since it's the result of something applied on the very first interval inserted in this dict). This is even worse in some cases, since an IntervalDict
stores intervals and not atomic intervals. Consider for instance:
Create with [1, 3] -> 'a';
Insert [5] -> 'b';
Insert [7, 8] -> 'a';
Internally, this will be stored as [1, 3] | [7, 8] -> 'a' + [5] -> 'b'. Insertion order will be difficult to provide in this case.
That's mainly why the order in which items are returned in keys()
, values()
and items()
is based on the lower bound of each key, so it is deterministic (and, to some extent, convenient).
from portion.
I totally agree that .keys()
, .values()
& .items()
should be sorted by interval (for both determinism and convenience). I was certainly not clear enough in my previous posts; so, let me try to explain what I have in mind differently.
I think that preliminary was clear to you, but let me recall it anyway (just to be sure :) ).
View objects are instantiated with a reference to a mapping (that is subsequently used to retrieve the data while iterating over the view); this is how is implemented the "always up-to-date" aspect of the views.
Now, there are two options under review:
-
The first one implements the views as
from collections.abc import KeysView, ValuesView, ItemsView class IntervalDict(MutableMapping): def keys(self): return KeysView(self.<storage>) def values(self): return ValuesView(self.<storage>) def items(self): return ItemsView(self.<storage>)
where
self.<storage>
must thus be a mapping.
Note that if the storage is not a mapping, you'll have to create a mapping object to link the view to the storage. However, such a mapping being surely static, the view won't remain up-to-date when theIntervalDict
will change (that was my mistake in the previous comment); this is equivalent to the current implementation.
Thus, as the storage must have a manageable order and must be a mapping, the natural candidate is anOrderedDict
(even though it is not the "when the item is inserted" that is of interest for us, but the ability to manage its ordering).
Note that, with this option, the storage must remain sorted at any moment. -
The second option sets aside the necessity for the storage to be a mapping: the (custom) view objects keep a reference on the
IntervalDict
itself, instead of its storage. Thanks to the_mapping
@property
, every time a custom view object is iterated, a fresh mapping object is created to establish the link between the view (its base class) and the storage.
Note that the storage container can be sorted in the course of this operation and thus it is not required that it remains sorted at any moment.
Looks like you have a preference for the first option (and I do agree ;) ). My point is that this (probably) implies to rework the whole class; this is why it may be worth considering the second option anyway. But you are most likely to estimate the amount of work necessary ;)
from portion.
Indeed, you're right; so none of options 1 or 2 will work out-of-the-box. It might be necessary to override __contains__
(and possibly __iter__
as well) in custom views. So the solution is probably in between the two suggested options; where changing the internal storage structure is not a requirement (and should certainly be skipped as a first step, event though this might improve performances and should definitely be reworked in the futur).
from portion.
I started thinking about this. I think we need, at least, to:
- Change
__iter__
inIntervalDict
so that it does not rely onkeys()
(we can reuse the current definition ofkeys()
to do so) (see https://github.com/AlexandreDecan/portion/blob/master/portion/dict.py#L293). This will allow views to reuse__iter__
with no recursive calls. KeysView
(see https://github.com/python/cpython/blob/master/Lib/_collections_abc.py#L729) can be reused as is.ValuesView
(see https://github.com/python/cpython/blob/master/Lib/_collections_abc.py#L772) will require__contains__
to be adapted. We can rely onIntervalDict.find
which does the job. For__iter__
, we can reuse the current implementation ofIntervalDict.values()
.ItemsView
(see https://github.com/python/cpython/blob/master/Lib/_collections_abc.py#L747) needs also both methods to be redefined. Again, the__iter__
can probably be implemented by taking the code fromIntervalDict.items()
. The__contains__
however is a bit trickier since it depends whether given key is a value or an interval. I have the impression that it would be sufficient, for both cases, to compareself._mapping[key] == IntervalDict([(key, value)])
but since checking for equality betweenIntervalDict
involvesIntervalDict.items()
, it is likely that it will end in recursive calls since__eq__
forItemsView
is based onSet.__eq__
which in turn is based onSet.__le__
which in turn checks forelem not in other
(which means comparing keys, and not items in our case). If I'm right, that means we need something likesorted(self._mapping[key]._items) == sorted(IntervalDict([(key, value)])._items
instead (or, if we do not want to rely on the internal API, to distinguish the cases where key is a value and key is an interval, and for the second case, to check thatself._mapping[key]
exists, is anIntervalDict
containing a single key, that this key is equal to the one provided, and that the values correspond. Another possibility is simply to rewriteIntervalDict.__eq__
so it does not rely onitems()
anymore, making the implementation ofItemsView
easier :-)
Does it seem right? I think we'll need a lot of unit tests to be sure that we do not miss something ;-)
from portion.
Related Issues (20)
- Get number of times multiple intervals overlap HOT 3
- Iteration of empty intervals is inconsistent. in general "empty := (+inf,-inf)" is problematic HOT 15
- Is this library performant enough to work with (non-atomic) intervals which span integers between 1 and 1 billion? HOT 1
- Using an external comparator? HOT 3
- iterate is broken with subclasses of Interval HOT 12
- Add __format__ method to Interval (improvement) HOT 16
- Thoughts about text-annotation use case and Pandas Ext. API HOT 5
- interval diameter (length, width, measure, range, or size) HOT 8
- Add join / merge method HOT 2
- "compatible version" specifier in setup.py confuses poetry HOT 2
- AttributeError: module 'portion.interval' has no attribute 'empty' HOT 3
- IntervalMultiDict HOT 3
- Add example for pandas in README HOT 5
- importlib error with create_api HOT 2
- Error to import interval, inf, imath from interval HOT 1
- importlib.machinery error with create_api HOT 6
- Enclosure Calculation Bug HOT 1
- Empty Calculation Bug HOT 3
- The performance issues of interval calculations in large quantities. HOT 1
- mass/Lebesgue measure? HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from portion.