Comments (5)
Hi!
Thanks for your interest in portion! This issue seems to involve a lot of different things and to be, to some extent, related to a kinda specific use case ;-)
Give me some time to process all the information you provided and I'll come back "soon", either with some solution or (more likely) with a lot of questions :-D
from portion.
also, wanted to point out another group from IBM that did something similar, though their assertions for what constitutes a "span" and what is necessary to create/manipulate them are a lot stronger/larger.
It's possibly of interest to you as well, since I came across #63 and recalled they have an entire "span algebra" system, and published a paper about it.
EDIT: I thought it might be helpful to visualize the existing portion
objects and how I foresee them corresponding to pandas
objects
portion | proposed ext. | via pandas? | note |
---|---|---|---|
P.Interval |
PortionDType |
ExtensionDType |
|
P.iterate(...) |
PortionArray |
ExtensionArray |
mostly used for indexing (pd.Index[PortionDType] ) |
P.IntervalDict |
Series.span.<...> |
register_series_accessor() |
validates the index is PortionArray |
from portion.
(A) is somewhat doable
since I could manually separate the disjunctionsportion
creates on duplicate values (like repeated "the"). But...
Indeed, (A) is doable. For example:
>>> import portion as P
>>> d = P.IntervalDict()
>>> d[P.closed(0, 2) | P.closed(10, 12)] = "the"
>>> d
{[0,2] | [10,12]: 'the'}
>>> list(d.find("the"))
[[0,2], [10,12]]
In the above example, each element of the list is an atomic Interval
.
(B) is problematic?
it looks like adding(0,7):Annotation(...)
to an existing dict would overwrite the data at that location, or at least, require some sophisticated machinery to create a lossless "combine" function that doesn't naively join the information the way theorange/banana
example does, right?
You're right, again :-) An IntervalDict
can only be used to associate one "field" to ranges. I already considered implementing a kind of IntervalMultiDict
where more than one "field" can be associated to a range. However, I hadn't have enough time to come up with a solution that performs well (IntervalDict
is already quite slow, and supporting multiple fields involves, as you said, some sophisticated machinery).
As a workaround, and depending on your exact use case, you can create one instance of IntervalDict
for each field and "query" them in parallel. But if you have many fields, or if you need to traverse them to find related data, this won't be convenient.
Notice that having an IntervalMultiDict
is still in my backlog. As a first step in this direction, I've a student that is currently working on converting IntervalDict
so that it relies on an interval tree structure internally. The expected speed up might be enough to consider a naive implementation of IntervalMultiDict
. on top of the new IntervalDict
.
One mechanism to this is to make a
PortionArray
that effectively wraps anIntervalDict
, since, both are "containers ofInterval
", but I'm now wondering if that really makes sense given (A) and (B). What are your thoughts? And would such a plugin be of interest to keep inside portion, or in its own separate package (which was my original idea)?
Having a pandas
accessor for Interval
or even for IntervalDict
is something I already considered (at least for Interval
instances) but since we cannot really vectorize the operations involving intervals, it would be mostly syntactic sugar without any benefit in terms of performance, so I gave up :-)
from portion.
Any update on this?
from portion.
I'll reopen if needed.
from portion.
Related Issues (20)
- Iteration of empty intervals is inconsistent. in general "empty := (+inf,-inf)" is problematic HOT 15
- Is this library performant enough to work with (non-atomic) intervals which span integers between 1 and 1 billion? HOT 1
- Using an external comparator? HOT 3
- iterate is broken with subclasses of Interval HOT 12
- Add __format__ method to Interval (improvement) HOT 16
- interval diameter (length, width, measure, range, or size) HOT 8
- Add join / merge method HOT 2
- "compatible version" specifier in setup.py confuses poetry HOT 2
- AttributeError: module 'portion.interval' has no attribute 'empty' HOT 3
- IntervalMultiDict HOT 3
- Add example for pandas in README HOT 5
- importlib error with create_api HOT 2
- Error to import interval, inf, imath from interval HOT 1
- importlib.machinery error with create_api HOT 6
- Enclosure Calculation Bug HOT 1
- Empty Calculation Bug HOT 3
- The performance issues of interval calculations in large quantities. HOT 1
- mass/Lebesgue measure? HOT 7
- Value out of range HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from portion.