Comments (7)
Exactly!
This is the approach presented in our Dichotomic Pattern Mining (DPM) framework, see Frontiers'2022 and AAAI'22
If you look at the Quick Start Example in the Readme, you can see that we can mine for patterns from POSITIVE and NEGATIVE outcomes. The nice thing is you can constraint the POS/NEG mining models independent of each other. Then one can look at frequent patterns that are unique to pos/neg, or in common between pos/neg, or their union. This is exactly what dichotomic_pattern_mining() method returns.
Then, we can take these patterns and "re-encode" the sequences as 1-hot binary encodings denoting whether the sequences exhibit frequent patterns found.
This reverse encoding process of turning sequences into feature vectors is quite complicated actually. That's why we provide the Pattern2Feature() functionality and the get_features() method.
These 0-1 encoded vectors can be used for downstream machine learning tasks for classification, as in your example.
Our papers use this Dichotomic Pattern Mining framework for classification of digital behavior for intent prediction and intruder detection etc.
The cool/novel thing is, one can use standard ML algorithms (e.g., XBOOST) that are not designed to work with sequential data to work with sequences thanks to this transformation. It turns out, this actually works quite well/competitive compared to LSTM, RNNs etc. while remaining "interpretable".
I will add more annotations to Quick Start DPM example and there is also a corresponding DPM notebook that you can use
Hope this helps!
Serdar
from seq2pat.
Great thanks
It would be nice to add xgboost example , since in paper there is no xgboost
From start to end example will help to understand what exactly needs to be done
from seq2pat.
For example
In real data
Sequences may be very long
Let's say 1000 tokens
Then some limitation needed to prevent to be in the same pattern building process
tokens located very far each from another
For example
In one given sequence
If token a located very far from token b
Then these a and b should not be influencing on pattern calculations...
from seq2pat.
Then one can look at frequent patterns that are unique to pos/neg,
BUT lets say pattern ABC
Existing 1234 times in positive data
Existing 5 times in negative data
Then why to eliminate this ABC from classifier tuning?
from seq2pat.
For example In real data Sequences may be very long Let's say 1000 tokens Then some limitation needed to prevent to be in the same pattern building process tokens located very far each from another For example In one given sequence If token a located very far from token b Then these a and b should not be influencing on pattern calculations...
Yes the aforementioned limitation can be applied by using a maximum span constraint on the indices of items in a pattern, for which we have added as a default constraint and users can set max_span
to control the limit of span when initializing seq2pat, see this line. I think on the other way around, user may also use minimum span to explore patterns particularly with larger span in between.
from seq2pat.
Then one can look at frequent patterns that are unique to pos/neg,
BUT lets say pattern ABC Existing 1234 times in positive data Existing 5 times in negative data
Then why to eliminate this ABC from classifier tuning?
Thanks @Sandy4321 for the question. This is basically hinting that the ABC pattern may neither be frequent in positive or negative, but the occurrences are quite different, then there might still be a value to include the pattern in classifier? This might need some further analysis when we compare the patterns between pos/neg groups. I agree that an enough significant difference between the pattern's occurrences while comparing two groups might have a contribution to classifier, but if they are not frequent enough, then the process would also suffer from too many arbitrary patterns and/or it tries too hard to capture too much noises, which we would also want to avoid.
from seq2pat.
Sure
Thanks
from seq2pat.
Related Issues (17)
- Review Issue, @TimKam HOT 1
- A strange question, too many mining results cause jupyter to crash. HOT 11
- Memory Leak in C++ HOT 3
- QUESTION HOT 7
- FEATURE REQUEST HOT 1
- 【QUESTION】Any way to limit the minimum length of each pattern? HOT 6
- [Question] - Sequence of transaction basket HOT 5
- data still available to download? HOT 6
- Installation error HOT 2
- Integer sequences containing zero (0) as an event HOT 3
- Windows pip installs on 32bit but is missing modules HOT 1
- Patterns with a single event HOT 3
- Feature Request: get attributes associated with mined pattern HOT 10
- Changes to not allow arcs to skip layers HOT 1
- Request for metadata on the event_time column HOT 1
- Performance warning when using pat2feat.get_features HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from seq2pat.