Code Monkey home page Code Monkey logo

Comments (7)

skadio avatar skadio commented on June 18, 2024

Exactly!

This is the approach presented in our Dichotomic Pattern Mining (DPM) framework, see Frontiers'2022 and AAAI'22

If you look at the Quick Start Example in the Readme, you can see that we can mine for patterns from POSITIVE and NEGATIVE outcomes. The nice thing is you can constraint the POS/NEG mining models independent of each other. Then one can look at frequent patterns that are unique to pos/neg, or in common between pos/neg, or their union. This is exactly what dichotomic_pattern_mining() method returns.

Then, we can take these patterns and "re-encode" the sequences as 1-hot binary encodings denoting whether the sequences exhibit frequent patterns found.

This reverse encoding process of turning sequences into feature vectors is quite complicated actually. That's why we provide the Pattern2Feature() functionality and the get_features() method.

These 0-1 encoded vectors can be used for downstream machine learning tasks for classification, as in your example.

Our papers use this Dichotomic Pattern Mining framework for classification of digital behavior for intent prediction and intruder detection etc.

The cool/novel thing is, one can use standard ML algorithms (e.g., XBOOST) that are not designed to work with sequential data to work with sequences thanks to this transformation. It turns out, this actually works quite well/competitive compared to LSTM, RNNs etc. while remaining "interpretable".

I will add more annotations to Quick Start DPM example and there is also a corresponding DPM notebook that you can use

Hope this helps!

Serdar

from seq2pat.

Sandy4321 avatar Sandy4321 commented on June 18, 2024

Great thanks
It would be nice to add xgboost example , since in paper there is no xgboost
From start to end example will help to understand what exactly needs to be done

from seq2pat.

Sandy4321 avatar Sandy4321 commented on June 18, 2024

For example
In real data
Sequences may be very long
Let's say 1000 tokens
Then some limitation needed to prevent to be in the same pattern building process
tokens located very far each from another
For example
In one given sequence
If token a located very far from token b
Then these a and b should not be influencing on pattern calculations...

from seq2pat.

Sandy4321 avatar Sandy4321 commented on June 18, 2024

Then one can look at frequent patterns that are unique to pos/neg,

BUT lets say pattern ABC
Existing 1234 times in positive data
Existing 5 times in negative data

Then why to eliminate this ABC from classifier tuning?

from seq2pat.

takojunior avatar takojunior commented on June 18, 2024

For example In real data Sequences may be very long Let's say 1000 tokens Then some limitation needed to prevent to be in the same pattern building process tokens located very far each from another For example In one given sequence If token a located very far from token b Then these a and b should not be influencing on pattern calculations...

Yes the aforementioned limitation can be applied by using a maximum span constraint on the indices of items in a pattern, for which we have added as a default constraint and users can set max_span to control the limit of span when initializing seq2pat, see this line. I think on the other way around, user may also use minimum span to explore patterns particularly with larger span in between.

from seq2pat.

takojunior avatar takojunior commented on June 18, 2024

Then one can look at frequent patterns that are unique to pos/neg,

BUT lets say pattern ABC Existing 1234 times in positive data Existing 5 times in negative data

Then why to eliminate this ABC from classifier tuning?

Thanks @Sandy4321 for the question. This is basically hinting that the ABC pattern may neither be frequent in positive or negative, but the occurrences are quite different, then there might still be a value to include the pattern in classifier? This might need some further analysis when we compare the patterns between pos/neg groups. I agree that an enough significant difference between the pattern's occurrences while comparing two groups might have a contribution to classifier, but if they are not frequent enough, then the process would also suffer from too many arbitrary patterns and/or it tries too hard to capture too much noises, which we would also want to avoid.

from seq2pat.

Sandy4321 avatar Sandy4321 commented on June 18, 2024

Sure
Thanks

from seq2pat.

Related Issues (17)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.