Code Monkey home page Code Monkey logo

Comments (5)

AlexandreDecan avatar AlexandreDecan commented on July 20, 2024 1

Hi!

Thanks for your interest in portion! This issue seems to involve a lot of different things and to be, to some extent, related to a kinda specific use case ;-)

Give me some time to process all the information you provided and I'll come back "soon", either with some solution or (more likely) with a lot of questions :-D

from portion.

rtbs-dev avatar rtbs-dev commented on July 20, 2024

also, wanted to point out another group from IBM that did something similar, though their assertions for what constitutes a "span" and what is necessary to create/manipulate them are a lot stronger/larger.

It's possibly of interest to you as well, since I came across #63 and recalled they have an entire "span algebra" system, and published a paper about it.

EDIT: I thought it might be helpful to visualize the existing portion objects and how I foresee them corresponding to pandas objects

portion proposed ext. via pandas? note
P.Interval PortionDType ExtensionDType
P.iterate(...) PortionArray ExtensionArray mostly used for indexing (pd.Index[PortionDType])
P.IntervalDict Series.span.<...> register_series_accessor() validates the index is PortionArray

from portion.

AlexandreDecan avatar AlexandreDecan commented on July 20, 2024

(A) is somewhat doable
since I could manually separate the disjunctions portion creates on duplicate values (like repeated "the"). But...

Indeed, (A) is doable. For example:

>>> import portion as P
>>> d = P.IntervalDict()
>>> d[P.closed(0, 2) | P.closed(10, 12)] = "the"
>>> d
{[0,2] | [10,12]: 'the'}
>>> list(d.find("the"))
[[0,2], [10,12]]

In the above example, each element of the list is an atomic Interval.

(B) is problematic?
it looks like adding (0,7):Annotation(...) to an existing dict would overwrite the data at that location, or at least, require some sophisticated machinery to create a lossless "combine" function that doesn't naively join the information the way the orange/banana example does, right?

You're right, again :-) An IntervalDictcan only be used to associate one "field" to ranges. I already considered implementing a kind of IntervalMultiDict where more than one "field" can be associated to a range. However, I hadn't have enough time to come up with a solution that performs well (IntervalDict is already quite slow, and supporting multiple fields involves, as you said, some sophisticated machinery).

As a workaround, and depending on your exact use case, you can create one instance of IntervalDict for each field and "query" them in parallel. But if you have many fields, or if you need to traverse them to find related data, this won't be convenient.

Notice that having an IntervalMultiDict is still in my backlog. As a first step in this direction, I've a student that is currently working on converting IntervalDictso that it relies on an interval tree structure internally. The expected speed up might be enough to consider a naive implementation of IntervalMultiDict. on top of the new IntervalDict.

One mechanism to this is to make a PortionArray that effectively wraps an IntervalDict, since, both are "containers of Interval", but I'm now wondering if that really makes sense given (A) and (B). What are your thoughts? And would such a plugin be of interest to keep inside portion, or in its own separate package (which was my original idea)?

Having a pandas accessor for Interval or even for IntervalDict is something I already considered (at least for Interval instances) but since we cannot really vectorize the operations involving intervals, it would be mostly syntactic sugar without any benefit in terms of performance, so I gave up :-)

from portion.

AlexandreDecan avatar AlexandreDecan commented on July 20, 2024

Any update on this?

from portion.

AlexandreDecan avatar AlexandreDecan commented on July 20, 2024

I'll reopen if needed.

from portion.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.